133 62 39MB
English Pages 630 [622] Year 2023
T. Agami Reddy Gregor P. Henze
Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition
Applied Data Analysis and Modeling for Energy Engineers and Scientists
T. Agami Reddy • Gregor P. Henze
Applied Data Analysis and Modeling for Energy Engineers and Scientists Second Edition
T. Agami Reddy The Design School and the School of Sustainable Engineering and the Built Environment Arizona State University Tempe, AZ, USA
Gregor P. Henze Department of Civil, Environmental and Architectural Engineering University of Colorado Boulder, CO, USA
ISBN 978-3-031-34868-6 ISBN 978-3-031-34869-3 https://doi.org/10.1007/978-3-031-34869-3
(eBook)
# The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.
Dedicated to the men who have had a profound and significant impact on my life (TAR): my grandfather Satyakarma, my father Dayakar, my brother Shantikar, and my son Satyajit. Dedicated to the strong, loving, and inspiring women in my life (GPH): my mother Uta Birgit, my wife Martha Marie, and our daughters Sophia Miriam and Josephine Charlotte.
Preface (Second Edition)
This second edition has been undertaken over a dozen years after the first edition and is a complete revision meant to modernize, update, and expand on topic coverage and reference case study examples. The general intent remains the same, i.e., a practical textbook on applied data analysis and modeling targeting students and professionals in engineering and applied science working on energy and environmental issues and systems in general and in building energy domain in particular. Statistical textbooks often tend to be opaque from which an intuitive understanding is difficult to acquire of how to (and when not to) apply the numerous analysis techniques to one’s chosen field. The style of writing of this book is to present a simple, clear,1 and logically laid-out structure of the various aspects of statistical theory2 and practice along with suggestions, discussion, and case study examples meant to enhance comprehension, and act as a catalyst for self-discovery of the reader. The book remains modular, and several chapters can be studied as standalone. The structure of the first edition has been retained but important enhancements have been made. The first six chapters deal with basic topics covered in most statistical textbooks (and this could serve as a first course if needed), while the remaining six chapters deal with more advanced topics with domain-relevant discussion and case study examples. The latter chapters have been revised extensively with new statistical methods and subject matter along with numerous examples from recently published technical papers meant to nurture and stimulate a more research-focused mind set of the reader. The chapter on “classification and clustering” in the first edition has been renamed as “statistical learning through data analysis” given the enormous advances in data science, and has been greatly expanded in scope and treatment. The chapters on inverse methods as applied to black-box, gray-box, and white-box models (Chaps. 9 and 10) have been better structured, thoroughly revised, and improved. A new section in the last chapter on decision-making has been added which deals with sustainability assessments. It tries to dispel the current confusing nomenclature in the sustainability literature by defining and scoping various terms and clearly distinguishing between the different assessment frameworks and their application areas. An attempt is made to combine traditional decision-making with the broader domain of “sustainability assessments.” In recent years, “science-based data analysis” is a term often used by certain sections of the society to lend credence to whatever opinion they wish to promote based on some sort of analysis on some sort of data—it is close to assuming the aura of a “new religion.” Data analysis and modeling is an art with principles grounded in science. The same data set can often be analyzed in different ways by different analysts, thereby affording a great deal of methodological freedom. Unfortunately, a lack of rigor and excessive reliance on freely available software packages undermines an attitude of humility and cautious scrutiny historically expected of the scientific method. Therefore, we hold that the significant body of statistical analysis methods is something those wishing to become competent and trustworthy analysts are encouraged to “If you cannot explain the concepts simply, you probably do not understand them properly”—Richard Feynman. 2 Theory is something with principles which are not obvious initially but from which surprising consequences can be deduced and even the principles confirmed. 1
vii
viii
Preface (Second Edition)
acquire as a foundation, but more important is the need to develop a mind set and a skill set built on years of hands-on experience and self-evaluation. With the advent of powerful computing and convenient-to-use statistical packages, the analysts can perform numerous different types of analysis before reaching a conclusion, a convenience not available to analysts merely a few decades ago. As alluded to above, this has also led to cases of misuse and even error. Numerous technical papers report results of statistical analysis which are incomplete and deficient, which leads to an unfortunate erosion in the confidence of scientific findings by the research and policy community, not to mention the public. Further, there are several cases, where the original approach dealing with a specific topic went down an analysis path which was found later to be inappropriate, but subsequent researchers (and funding agencies) kept pushing that pathway just because of historic inertia. Hence, it is imperative that one have the courage to be impartial about one’s results and research findings. There are several instances where science, or at least applied science, is not self-correcting which can be attributed to the tendency to belittle confirmatory research, to limited funding, to constant shifting of research focus by funding agencies, academics, and researchers, and to the mindset that changing course direction will open one’s previous research to criticism. The original author (TAR) is very pleased to have his colleague, Prof. Gregor Henze, join the authorship of this edition. One of the most noticeable differences in the second edition of this book is the inclusion of electronic resources. This will enhance senior-level and graduate instruction while also serving as a self-learning aid to professionals in this domain area. For readers to have exposure with (and perhaps become proficient in) performing hands-on analysis, the open-source Python and R programming languages have been adopted in the form of Jupyter notebooks and R markdown files, which can be downloaded from https://github.com/henze-research-group/adam. This repository contains numerous data sets and sample computer code reflective of real-world problems, which will continue to grow as new examples are developed and graciously contributed by readers of this book. The link also allows the large data tables in Appendix B and various chapters to be downloaded conveniently for analysis.
Acknowledgments In addition to the numerous talented and dedicated colleagues who contributed in various ways to the first edition of this book, we would like to acknowledge, in particular, the following: • Colleagues: Bass Abushakra, Marlin Addison, Brad Allenby, Juan-Carlos Baltazar, David Claridge, Daniel Feuermann, Srinivas Katipamula, George Runger, Kris Subbarao, Frank Vignola and Radu Zmeureanu. • Former students: Thomaz Carvalhaes, Srijan Didwania, Ranajoy Dutta, Phillip Howard, Alireza Inanlouganji, Mushfiq Islam, Saurabh Jalori, Salim Moslehi, Emmanuel Omere, Travis Sabatino, and Steve Snyder. TAR would like to acknowledge the love and encouragement from his wife Shobha, their children Agaja and Satyajit, and grand-daughters Maya and Mikaella. GPH would like to acknowledge the encouragement and support of his wife Martha and their children Sophia and Josephine. Tempe, AZ, USA Boulder, CO, USA
T. Agami Reddy Gregor P. Henze
Preface (First Edition)
A Third Need in Engineering Education At its inception, engineering education was predominantly process oriented, while engineering practice tended to be predominantly system oriented.3 While it was invaluable to have a strong fundamental knowledge of the processes, educators realized the need to have courses where this knowledge translated into an ability to design systems; therefore, most universities, starting in the 1970s, mandated that seniors take at least one design/capstone course. However, a third aspect is acquiring increasing importance: the need to analyze, interpret and model data. Such a skill set is proving to be crucial in all scientific activities, none so as much as in engineering and the physical sciences. How can data collected from a piece of equipment be used to assess the claims of the manufacturers? How can performance data either from a natural system or a man-made system be respectively used to maintain it more sustainably or to operate it more efficiently? Such needs are driven by the fact that system performance data is easily available in our present-day digital age where sensor and data acquisition systems have become reliable, cheap and part of the system design itself. This applies both to experimental data (gathered from experiments performed according to some predetermined strategy) and to observational data (where one can neither intrude on system functioning nor have the ability to control the experiment, such as in astronomy). Techniques for data analysis also differ depending on the size of the data; smaller data sets may require the use of “prior” knowledge of how the system is expected to behave or how similar systems have been known to behave in the past. Let us consider a specific instance of observational data: once a system is designed and built, how to evaluate its condition in terms of design intent and, if possible, operate it in an “optimal” manner under variable operating conditions (say, based on cost, or on minimal environmental impact such as carbon footprint, or any appropriate pre-specified objective). Thus, data analysis and data driven modeling methods as applied to this instance can be meant to achieve certain practical ends—for example: (a) Verifying stated claims of manufacturer; (b) Product improvement or product characterization from performance data of prototype; (c) Health monitoring of a system, i.e., how does one use quantitative approaches to reach sound decisions on the state or “health” of the system based on its monitored data? (d) Controlling a system, i.e., how best to operate and control it on a day-to-day basis? (e) identifying measures to improve system performance, and assess impact of these measures; (f) Verification of the performance of implemented measures, i.e., are the remedial measures implemented impacting system performance as intended?
3
Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. ix
x
Intent Data analysis and modeling is not an end in itself; it is a well-proven and often indispensable aid for subsequent decision-making such as allowing realistic assessment and predictions to be made concerning verifying expected behavior, the current operational state of the system and/or the impact of any intended structural or operational changes. It has its roots in statistics, probability, regression, mathematics (linear algebra, differential equations, numerical methods,. . .), modeling and decision making. Engineering and science graduates are somewhat comfortable with mathematics while they do not usually get any exposure to decision analysis at all. Statistics, probability and regression analysis are usually squeezed into a sophomore term resulting in them remaining “a shadowy mathematical nightmare, and . . . a weakness forever”4 even to academically good graduates. Further, many of these concepts, tools and procedures are taught as disparate courses not only in physical sciences and engineering but in life sciences, statistics and econometric departments. This has led to many in the physical sciences and engineering communities having a pervasive “mental block” or apprehensiveness or lack of appreciation of this discipline altogether. Though these analysis skills can be learnt over several years by some (while some never learn it well enough to be comfortable even after several years of practice), what is needed is a textbook which provides: 1. A review of classical statistics and probability concepts, 2. A basic and unified perspective of the various techniques of data based mathematical modeling and analysis, 3. An understanding of the “process” along with the tools, 4. A proper combination of classical methods with the more recent machine learning and automated tools which the wide spread use of computers has spawned, and 5. Well-conceived examples and problems involving real-world data that would illustrate these concepts within the purview of specific areas of application. Such a text is likely to dispel the current sense of unease and provide readers with the necessary measure of practical understanding and confidence in being able to interpret their numbers rather than merely generating them. This would also have the added benefit of advancing the current state of knowledge and practice in that the professional and research community would better appreciate, absorb and even contribute to the numerous research publications in this area.
Approach and Scope Forward models needed for system simulation and design have been addressed in numerous textbooks and have been well-inculcated into the undergraduate engineering and science curriculum for several decades. It is the issue of data-driven methods, which I feel is inadequately reinforced in undergraduate and first-year graduate curricula, and hence the basic rationale for this book. Further, this book is not meant to be a monograph or a compilation of information on papers i.e., not a literature review. It is meant to serve as a textbook for senior undergraduate or first-year graduate students or for continuing education professional courses, as well as a self-study reference book for working professionals with adequate background. Applied statistics and data based analysis methods find applications in various engineering, business, medical, and physical, natural and social sciences. Though the basic concepts are the same, the diversity in these disciplines results in rather different focus and differing emphasis of the analysis methods. This diversity may be in the process itself, in the type and quantity of data, and in the intended purpose of the analysis. For example, many engineering systems have 4
Keller, D.K., 2006. The Tao of Statistics, Saga Publications, London, UK.
Preface (First Edition)
Preface (First Edition)
xi
low “epistemic” uncertainty or uncertainty associated with the process itself, and, also allow easy gathering of adequate performance data. Such models are typically characterized by strong relationships between variables which can be formulated in mechanistic terms and accurate models consequently identified. This is in stark contrast to such fields as economics and social sciences where even qualitative causal behavior is often speculative, and the quantity and uncertainty in data rather poor. In fact, even different types of engineered and natural systems require widely different analysis tools. For example, electrical and specific mechanical engineering disciplines (ex. involving rotary equipment) largely rely on frequency domain analysis methods, while time-domain methods are more suitable for most thermal and environmental systems. This consideration has led me to limit the scope of the analysis techniques described in this book to thermal, energy-related, environmental and industrial systems. There are those students for whom a mathematical treatment and justification helps in better comprehension of the underlying concepts. However, my personal experience has been that the great majority of engineers do not fall in this category, and hence a more pragmatic approach is adopted. I am not particularly concerned with proofs, deductions and statistical rigor which tend to overwhelm the average engineering student. The intent is, rather, to impart a broad conceptual and theoretical understanding as well as a solid working familiarity (by means of case studies) of the various facets of data-driven modeling and analysis as applied to thermal and environmental systems. On the other hand, this is not a cookbook nor meant to be a reference book listing various models of the numerous equipment and systems which comprise thermal systems, but rather stresses underlying scientific, engineering, statistical and analysis concepts. It should not be considered as a substitute for specialized books nor should their importance be trivialized. A good general professional needs to be familiar, if not proficient, with a number of different analysis tools and how they “map” with each other, so that he can select the most appropriate tools for the occasion. Though nothing can replace hands-on experience in design and data analysis, being familiar with the appropriate theoretical concepts would not only shorten modeling and analysis time but also enable better engineering analysis to be performed. Further, those who have gone through this book will gain the required basic understanding to tackle the more advanced topics dealt with in the literature at large, and hence, elevate the profession as a whole. This book has been written with a certain amount of zeal in the hope that this will give this field some impetus and lead to its gradual emergence as an identifiable and important discipline (just as that enjoyed by a course on modeling, simulation and design of systems) and would ultimately be a required senior-level course or first-year graduate course in most engineering and science curricula. This book has been intentionally structured so that the same topics (namely, statistics, parameter estimation and data collection) are treated first from a “basic” level, primarily by reviewing the essentials, and then from an “intermediate” level. This would allow the book to have broader appeal, and allow a gentler absorption of the needed material by certain students and practicing professionals. As pointed out by Asimov,5 the Greeks demonstrated that abstraction (or simplification) in physics allowed a simple and generalized mathematical structure to be formulated which led to greater understanding than would otherwise, along with the ability to subsequently restore some of the real-world complicating factors which were ignored earlier. Most textbooks implicitly follow this premise by presenting “simplistic” illustrative examples and problems. I strongly believe that a book on data analysis should also expose the student to the “messiness” present in real-world data. To that end, examples and problems which deal with case studies involving actual (either raw or marginally cleaned) data have been included. The hope is that this would provide the student with the necessary training and confidence to tackle real-world analysis situations.
5
Asimov, I., 1966. Understanding Physics: Light Magnetism and Electricity, Walker Publications.
xii
Preface (First Edition)
Assumed Background of Reader This is a book written for two sets of audiences: a basic treatment meant for the general engineering and science senior as well as the general practicing engineer on one hand, and the general graduate student and the more advanced professional entering the fields of thermal and environmental sciences. The exponential expansion of scientific and engineering knowledge as well as its cross-fertilization with allied emerging fields such as computer science, nanotechnology and bio-engineering have created the need for a major reevaluation of the thermal science undergraduate and graduate engineering curricula. The relatively few professional and free electives academic slots available to students requires that traditional subject matter be combined into fewer classes whereby the associated loss in depth and rigor is compensated for by a better understanding of the connections among different topics within a given discipline as well as between traditional and newer ones. It is presumed that the reader has the necessary academic background (at the undergraduate level) of traditional topics such as physics, mathematics (linear algebra and calculus), fluids, thermodynamics and heat transfer, as well as some exposure to experimental methods, probability, statistics and regression analysis (taught in lab courses at the freshman or sophomore level). Further, it is assumed that the reader has some basic familiarity with important energy and environmental issues facing society today. However, special effort has been made to provide pertinent review of such material so as to make this into a sufficiently selfcontained book. Most students and professionals are familiar with the uses and capabilities of the ubiquitous spreadsheet program. Though many of the problems can be solved with the existing (or add-ons) capabilities of such spreadsheet programs, it is urged that the instructor or reader select an appropriate statistical program to do the statistical computing work because of the added sophistication which it provides. This book does not delve into how to use these programs, rather, the focus of this book is education-based intended to provide knowledge and skill sets necessary for value, judgment and confidence on how to use them, as against training-based whose focus would be to teach facts and specialized software.
Acknowledgements Numerous talented and dedicated colleagues contributed in various ways over the several years of my professional career; some by direct association, others indirectly through their textbooks and papers-both of which were immensely edifying and stimulating to me personally. The list of acknowledgements of such meritorious individuals would be very long indeed, and so I have limited myself to those who have either provided direct valuable suggestions on the overview and scope of this book, or have generously given their time in reviewing certain chapters of this book. In the former category, I would like to gratefully mention Drs. David Claridge, Jeff Gordon, Gregor Henze John Mitchell and Robert Sonderegger, while in the latter, Drs. James Braun, Patrick Gurian, John House, Ari Rabl and Balaji Rajagopalan. I am also appreciative of interactions with several exceptional graduate students, and would like to especially thank the following whose work has been adopted in case study examples in this book: Klaus Andersen, Song Deng, Jason Fierko, Wei Jiang, Itzhak Maor, Steven Snyder and Jian Sun. Writing a book is a tedious and long process; the encouragement and understanding of my wife, Shobha, and our children, Agaja and Satyajit, were sources of strength and motivation. Tempe, AZ, USA December 2010
T. Agami Reddy
Contents
1
Mathematical Models and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Forward and Inverse Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Preamble . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 The Energy Problem and Importance of Buildings . . . . . . . . . . . 1.1.3 Forward or Simulation Approach . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Inverse or Data Analysis Approach . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Discussion of Both Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 What Is a System Model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Types of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Types of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Types of Uncertainty in Data . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Mathematical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Block Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Mathematical Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.5 Steady-State and Dynamic Models . . . . . . . . . . . . . . . . . . . . . . 1.5 Mathematical Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Broad Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Simulation or Forward Modeling . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Inverse Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.4 Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Data Analytic Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Data Mining or Knowledge Discovery . . . . . . . . . . . . . . . . . . . . 1.6.2 Machine Learning or Algorithmic Models . . . . . . . . . . . . . . . . . 1.6.3 Introduction to Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Basic Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Example of a Data Collection and Analysis System . . . . . . . . . . 1.8 Topics Covered in Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 1 2 2 3 3 3 3 4 4 6 7 7 7 9 10 13 14 14 14 17 19 19 20 20 21 22 22 22 23 25 27 30
2
Probability Concepts and Probability Distributions . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Classical Concept of Probability . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Bayesian Viewpoint of Probability . . . . . . . . . . . . . . . . . . . . . 2.1.3 Distinction Between Probability and Statistics . . . . . . . . . . . . .
31 31 31 32 32
. . . . .
xiii
xiv
Contents
2.2
3
Classical Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Basic Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Basic Set Theory Notation and Axioms of Probability . . . . . . . . 2.2.3 Axioms of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Joint, Marginal, and Conditional Probabilities . . . . . . . . . . . . . . 2.2.5 Permutations and Combinations . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Probability Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Expectations and Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Function of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Chebyshev’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Important Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Distributions for Discrete Variables . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Distributions for Continuous Variables . . . . . . . . . . . . . . . . . . . 2.5 Bayesian Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Application to Discrete Probability Variables . . . . . . . . . . . . . . . 2.5.3 Application to Continuous Probability Variables . . . . . . . . . . . . 2.6 Three Kinds of Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32 32 33 34 35 38 39 39 42 43 44 45 45 45 50 58 58 61 63 64 66 74
Data Collection and Preliminary Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Sensors and Their Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Collection Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Generalized Measurement System . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Types and Categories of Measurements . . . . . . . . . . . . . . . . . . . 3.2.3 Data Recording Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Raw Data Validation and Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Limit Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Consistency Checks Involving Conservation Balances . . . . . . . . 3.3.4 Outlier Rejection by Visual Means . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Handling Missing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Statistical Measures of Sample Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Summary Descriptive Measures . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Covariance and Pearson Correlation Coefficient . . . . . . . . . . . . . 3.5 Exploratory Data Analysis (EDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 What Is EDA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 Purpose of Data Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Static Univariate Graphical Plots . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 Static Bi- and Multivariate Graphical Plots . . . . . . . . . . . . . . . . . 3.5.5 Interactive and Dynamic Graphics . . . . . . . . . . . . . . . . . . . . . . . 3.5.6 Basic Data Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Overall Measurement Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Need for Uncertainty Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Basic Uncertainty Concepts: Random and Bias Errors . . . . . . . . 3.6.3 Random Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Bias Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.5 Overall Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.6 Chauvenet’s Statistical Criterion of Data Rejection . . . . . . . . . . .
75 75 77 77 80 80 81 81 81 82 83 83 87 87 88 89 89 91 92 95 100 101 102 102 102 103 105 105 106
Contents
xv
4
5
3.7
Propagation of Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Taylor Series Method for Cross-Sectional Data . . . . . . . . . . . . . 3.7.2 Monte Carlo Method for Error Propagation Problems . . . . . . . . . 3.8 Planning a Non-Intrusive Field Experiment . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
106 107 111 113 115 121
Making Statistical Inferences from Samples . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Basic Univariate Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Sampling Distribution and Confidence Interval of the Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Hypothesis Test for Single Sample Mean . . . . . . . . . . . . . . . . . . 4.2.3 Two Independent Sample and Paired Difference Tests on Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Single and Two Sample Tests for Proportions . . . . . . . . . . . . . . 4.2.5 Single and Two Sample Tests of Variance . . . . . . . . . . . . . . . . . 4.2.6 Tests for Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Test on the Pearson Correlation Coefficient . . . . . . . . . . . . . . . . 4.3 ANOVA Test for Multi-Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Single-Factor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Tukey’s Multiple Comparison Test . . . . . . . . . . . . . . . . . . . . . . 4.4 Tests of Significance of Multivariate Data . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Introduction to Multivariate Methods . . . . . . . . . . . . . . . . . . . . . 4.4.2 Hotteling T2 Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Non-Parametric Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Signed and Rank Tests for Medians . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Kruskal–Wallis Multiple Samples Test for Medians . . . . . . . . . . 4.5.3 Test on Spearman Rank Correlation Coefficient . . . . . . . . . . . . . 4.6 Bayesian Inferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Estimating Population Parameter from a Sample . . . . . . . . . . . . . 4.6.3 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Some Considerations About Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Random and Non-Random Sampling Methods . . . . . . . . . . . . . . 4.7.2 Desirable Properties of Estimators . . . . . . . . . . . . . . . . . . . . . . . 4.7.3 Determining Sample Size During Random Surveys . . . . . . . . . . 4.7.4 Stratified Sampling for Variance Reduction . . . . . . . . . . . . . . . . 4.8 Resampling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Application to Probability Problems . . . . . . . . . . . . . . . . . . . . . 4.8.3 Different Methods of Resampling . . . . . . . . . . . . . . . . . . . . . . . 4.8.4 Application of Bootstrap to Statistical Inference Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 123 124
Linear Regression Analysis Using Least Squares . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Objective of Regression Analysis . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Ordinary Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
124 127 129 133 134 136 137 138 139 141 142 142 143 146 147 149 150 152 152 152 153 154 154 155 156 158 159 159 159 159 160 162 163 167 169 169 170 170 170
xvi
Contents
5.3
Simple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Estimation of Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Statistical Criteria for Model Evaluation . . . . . . . . . . . . . . . . . . 5.3.3 Inferences on Regression Coefficients and Model Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Model Prediction Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Multiple OLS Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Higher Order Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Matrix Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Point and Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Beta Coefficients and Elasticity . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.5 Partial Correlation Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.6 Assuring Model Parsimony—Stepwise Regression . . . . . . . . . . . 5.5 Applicability of OLS Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Sources of Errors During Regression . . . . . . . . . . . . . . . . . . . . . 5.6 Model Residual Analysis and Regularization . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Detection of Ill-Conditioned Behavior . . . . . . . . . . . . . . . . . . . . 5.6.2 Leverage and Influence Data Points . . . . . . . . . . . . . . . . . . . . . . 5.6.3 Remedies for Nonuniform Residuals . . . . . . . . . . . . . . . . . . . . . 5.6.4 Serially Correlated Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.5 Dealing with Misspecified Models . . . . . . . . . . . . . . . . . . . . . . . 5.7 Other Useful OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Zero-Intercept Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Indicator Variables for Local Piecewise Models— Linear Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.3 Indicator Variables for Categorical Regressor Models . . . . . . . . . 5.8 Resampling Methods Applied to Regression . . . . . . . . . . . . . . . . . . . . . . . 5.8.1 Basic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8.2 Jackknife and k-Fold Cross-Validation . . . . . . . . . . . . . . . . . . . 5.8.3 Bootstrap Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10 Parting Comments on Regression Analysis and OLS . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Design of Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Purpose of DOE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.3 DOE Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Overview of Different Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Different Types of ANOVA Tests . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Link Between ANOVA and Regression . . . . . . . . . . . . . . . . . . . 6.2.3 Recap of Basic Model Functional Forms . . . . . . . . . . . . . . . . . . 6.3 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Levels, Discretization, and Experimental Combinations . . . . . . . 6.3.2 Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Unrestricted and Restricted Randomization . . . . . . . . . . . . . . . . 6.4 Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Full Factorial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 2k Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
171 171 173 175 177 178 178 182 182 185 187 187 189 189 191 191 191 194 197 202 203 205 205 205 207 208 208 208 210 210 213 214 221 223 223 223 224 225 225 225 226 226 227 227 229 229 230 230 234
Contents
xvii
7
6.4.3 Concept of Orthogonality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Fractional Factorial Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Block Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Complete Block Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Latin Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Response Surface Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 First- and Second-Order Models . . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Central Composite Design and the Concept of Rotation . . . . . . . 6.7 Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.2 Similarities and Differences Between Physical and Simulation Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.3 Monte Carlo and Allied Sampling Methods . . . . . . . . . . . . . . . . 6.7.4 Sensitivity Analysis for Screening . . . . . . . . . . . . . . . . . . . . . . . 6.7.5 Surrogate Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
238 240 241 241 243 245 245 246 246 247 250 250
Optimization Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 What Is Optimization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Terminology and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Categorization of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Types of Objective Functions and Constraints . . . . . . . . . . . . . . 7.2.4 Sensitivity Analysis and Post-Optimality Analysis . . . . . . . . . . . 7.3 Analytical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Direct Substitution Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Lagrange Multiplier Method for Equality Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Problems with Inequality Constraints . . . . . . . . . . . . . . . . . . . . . 7.3.5 Penalty Function Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Numerical Unconstrained Search Methods . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Univariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Linear Programming (LP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Example of a LP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Linear Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Example of Maximizing Flow in a Transportation Network . . . . . 7.5.5 Mixed Integer Linear Programing (MILP) . . . . . . . . . . . . . . . . . 7.5.6 Example of Reliability Analysis of a Power Network . . . . . . . . . 7.6 Nonlinear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.1 Standard Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 Quadratic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Popular Numerical Multivariate Search Algorithms . . . . . . . . . .
267 267 267 268 269 269 269 271 272 272 272
252 253 254 259 260 260 265
273 274 275 276 277 277 280 282 282 283 284 285 285 286 288 288 289 289
xviii
8
9
Contents
7.7 Illustrative Example: Integrated Energy System (IES) for a Campus . . . . . . 7.8 Examples of Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
290 294 299 306
Analysis of Time Series Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 Basic Behavior Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.4 Illustrative Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 General Model Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Smoothing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Arithmetic Moving Average (AMA) . . . . . . . . . . . . . . . . . . . . . 8.3.2 Exponentially Weighted Moving Average (EWA) . . . . . . . . . . . 8.3.3 Determining Structure by Cross-Validation . . . . . . . . . . . . . . . . 8.4 OLS Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Trend Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Trend and Seasonal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Forecast Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Fourier Series Models for Periodic Behavior . . . . . . . . . . . . . . . 8.4.5 Interrupted Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Stochastic Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 ACF, PACF, and Data Detrending . . . . . . . . . . . . . . . . . . . . . . . 8.5.3 ARIMA Class of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Recommendations on Model Identification . . . . . . . . . . . . . . . . . 8.6 ARMAX or Transfer Function Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Conceptual Approach and Benefit . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Transfer Function Modeling of Linear Dynamic Systems . . . . . . 8.7 Quality Control and Process Monitoring Using Control Chart Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.1 Background and Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.2 Shewart Control Charts for Variables . . . . . . . . . . . . . . . . . . . . . 8.7.3 Shewart Control Charts for Attributes . . . . . . . . . . . . . . . . . . . . 8.7.4 Practical Implementation Issues of Control Charts . . . . . . . . . . . 8.7.5 Time-Weighted Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
309 309 309 311 312 313 314 314 315 316 318 320 320 321 322 324 327 328 328 329 332 337 339 339 339
Parametric and Non-Parametric Regression Methods . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Important Concepts in Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 Structural Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Ill-Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Numerical Identifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Problematic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Principal Component Analysis and Regression . . . . . . . . . . . . . . 9.3.3 Ridge and Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Chiller Case Study Involving Collinear Regressors . . . . . . . . . . . 9.3.5 Other Multivariate Methods . . . . . . . . . . . . . . . . . . . . . . . . . . .
355 355 357 357 359 360
341 341 342 344 346 347 349 350 353
361 361 362 367 368 372
Contents
xix
10
9.4
Going Beyond OLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . 9.4.3 Generalized Linear Models (GLM) . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Box-Cox Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.5 Logistic Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.6 Error in Variables (EIV) and Corrected Least Squares . . . . . . . . . 9.5 Non-Linear Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Detecting Non-Linear Correlation . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Different Non-Linear Search Methods . . . . . . . . . . . . . . . . . . . . 9.5.3 Overview of Various Parametric Regression Methods . . . . . . . . . 9.6 Non-Parametric Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.2 Extensions to Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.3 Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6.4 Polynomial Regression and Smoothing Splines . . . . . . . . . . . . . 9.7 Local Regression: LOWESS Smoothing Method . . . . . . . . . . . . . . . . . . . 9.8 Neural Networks: Multi-Layer Perceptron (MLP) . . . . . . . . . . . . . . . . . . . 9.9 Robust Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
373 373 375 377 378 381 384 386 386 388 390 390 390 391 393 393 394 394 399 401 406
Inverse Methods for Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Approaches and Their Characteristics . . . . . . . . . . . . . . . . . . . . 10.1.3 Mechanistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.4 Scope of Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Gray-Box Static Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Basic Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Performance Models for Solar Photovoltaic Systems . . . . . . . . . 10.2.3 Gray-Box and Black-Box Models for Water-Cooled Chillers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Sequential Stagewise Regression and Selection of Data Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.5 Case Study of Non-Intrusive Sequential Parameter Estimation for Building Energy Flows . . . . . . . . . . . . . . . . . . . . 10.2.6 Application to Policy: Dose-Response . . . . . . . . . . . . . . . . . . . . 10.3 Certain Aspects of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Types of Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Measures of Information Content . . . . . . . . . . . . . . . . . . . . . . . 10.3.3 Functional Testing and Data Fusion . . . . . . . . . . . . . . . . . . . . . . 10.4 Gray-Box Models for Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Sequential Estimation of Thermal Network Model Parameters from Controlled Tests . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Non-Intrusive Identification of Thermal Network Models and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 State Space Representation and Compartmental Models . . . . . . . 10.4.5 Example of a Compartmental Model . . . . . . . . . . . . . . . . . . . . . 10.4.6 Practical Issues During Identification . . . . . . . . . . . . . . . . . . . . . 10.5 Bayesian Regression and Parameter Estimation: Case Study . . . . . . . . . . . 10.6 Calibration of Detailed Simulation Programs . . . . . . . . . . . . . . . . . . . . . .
409 410 410 411 413 413 413 413 414 417 419 420 424 426 426 426 431 432 432 433 434 437 437 439 441 446
xx
Contents
10.6.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 The Basic Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.3 Detailed Simulation Models for Energy Use in Buildings . . . . . . 10.6.4 Uses of Calibrated Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.5 Causes of Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.6 Definition of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.7 Raw Input Tuning (RIT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.8 Semi-Analytical Methods (SAM) . . . . . . . . . . . . . . . . . . . . . . . 10.6.9 Physical Parameter Estimation (PPE) . . . . . . . . . . . . . . . . . . . . . 10.6.10 Thoughts on Statistical Criteria for Goodness-of-Fit . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
446 446 447 449 450 451 451 453 456 456 459 464
11
Statistical Learning Through Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Distance as a Similarity Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Unsupervised Learning: Clustering Approaches . . . . . . . . . . . . . . . . . . . . 11.3.1 Types of Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Centroid-Based Partitional Clustering by K-Means . . . . . . . . . . . 11.3.3 Density-Based Partitional Clustering Using DBSCAN . . . . . . . . 11.3.4 Agglomerative Hierarchical Clustering Methods . . . . . . . . . . . . . 11.4 Supervised Learning: Statistical-Based Classification Approaches . . . . . . . 11.4.1 Different Types of Approaches . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Distance-Based Classification: k-Nearest Neighbors . . . . . . . . . . 11.4.3 Naive Bayesian Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.4 Classical Regression-Based Classification . . . . . . . . . . . . . . . . . 11.4.5 Discriminant Function Analysis . . . . . . . . . . . . . . . . . . . . . . . . 11.4.6 Neural Networks: Radial Basis Function (RBF) . . . . . . . . . . . . . 11.4.7 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . . . . 11.5 Decision Tree–Based Classification Methods . . . . . . . . . . . . . . . . . . . . . . 11.5.1 Rule-Based Method and Decision-Tree Representation . . . . . . . . 11.5.2 Criteria for Tree Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.3 Classification and Regression Trees (CART) . . . . . . . . . . . . . . . 11.5.4 Ensemble Method: Random Forest . . . . . . . . . . . . . . . . . . . . . . 11.6 Anomaly Detection Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Graphical and Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 Model-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.4 Data Mining Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Applications to Reducing Energy Use in Buildings . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
467 467 469 470 470 471 475 478 482 482 482 486 486 490 492 493 495 495 496 497 499 504 504 505 505 505 506 510 512
12
Decision-Making, Risk Analysis, and Sustainability Assessments . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Types of Decision-Making Problems and Applications . . . . . . . . 12.1.2 Purview of Reliability, Risk Analysis, and Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Example of Discrete Decision-Making . . . . . . . . . . . . . . . . . . . . 12.1.4 Example of Chiller FDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Single Criterion Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 General Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Representing Problem Structure: Influence Diagrams and Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
515 515 515 518 518 519 521 521 522
Contents
xxi
12.2.3 12.2.4 12.2.5 12.2.6 12.2.7 12.2.8
Single and Multi-Stage Decision Problems . . . . . . . . . . . . . . . . . Value of Perfect Information . . . . . . . . . . . . . . . . . . . . . . . . . . . Different Criteria for Outcome Evaluation . . . . . . . . . . . . . . . . . Discretizing Probability Distributions . . . . . . . . . . . . . . . . . . . . Utility Value Functions for Modeling Risk Attitudes . . . . . . . . . Monte Carlo Simulation for First-Order and Nested Uncertainties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Risk Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 The Three Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 The Empirical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Context of Environmental Risk to Humans . . . . . . . . . . . . . . . . 12.3.4 Other Areas of Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Case Study: Risk Assessment of an Existing Building . . . . . . . . . . . . . . . . 12.5 Multi-Criteria Decision-Making (MCDM) Methods . . . . . . . . . . . . . . . . . 12.5.1 Introduction and Description of Terms . . . . . . . . . . . . . . . . . . . . 12.5.2 Classification of Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.3 Basic Mathematical Operations . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Single Discipline MCDM Methods: Techno-Economic Analysis . . . . . . . . 12.6.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.2 Consistent Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.3 Inconsistent Attribute Scales: Dominance and Pareto Frontier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.4 Case Study of Conflicting Criteria: Supervisory Control of an Engineered System . . . . . . . . . . . . . . . . . . . . . . . 12.7 Sustainability Assessments: MCDM with Multi-Discipline Attribute Scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.1 Definitions and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.2 Indicators and Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.3 Sustainability Assessment Frameworks . . . . . . . . . . . . . . . . . . . 12.7.4 Examples of Non, Semi-, and Fully-Aggregated Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.7.5 Two Case Studies: Structure-Based and Performance-Based . . . . 12.7.6 Closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
524 526 527 528 529 532 533 533 535 536 538 540 546 546 547 549 549 549 550 551 553 557 557 559 560 562 564 568 569 573
Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
1
Mathematical Models and Data Analysis
Abstract
This chapter starts with an introduction of forward and inverse models and provides a practical context of their distinctive usefulness and specific capabilities and scopes in terms of a major societal concern, namely the high energy use in buildings. This is followed by a description of the various types of models generally encountered, namely conceptual, physical, and mathematical, with the last type being the sole focus of this book. Next, different types of data collection schemes and the different types of uncertainty encountered are discussed. This is followed by introducing the elements of mathematical models and the different ways to classify them such as linear and nonlinear, lumped and distributed, dynamic and steady-state, etc. Subsequently, how algebraic and first-order differential equations capture different characteristics related to the response of sensors is illustrated. Next, the distinction between simulation or forward (or well-defined or wellspecified) problems, and inverse (or data-driven or ill-defined) problems is highlighted. This chapter introduces analysis approaches relevant to the latter which include calibrated forward models and statistical models identified primarily from data, which can be black-box or gray-box. The latter can again be separated into (i) partial gray-box, i.e., inspired from only partial understanding of system functioning, and (ii) reducedorder mechanistic gray-box models. More recently, a new class of analysis methods has evolved, namely the data analytic approaches, which include data mining or knowledge discovery, machine learning, and big data analysis. These methods, which have been largely driven by increasing computational power and sensing capabilities, are briefly discussed. Next, the various steps involved in a statistical analysis study are discussed followed by an example of data collection and analysis of field-monitored data from an engineering system. Finally, the various topics covered in each chapter of this book are outlined.
1.1
Forward and Inverse Approaches
1.1.1
Preamble
Applied data analysis and modeling of system performance is historically older than simulation modeling. The ancients, starting as far back as 12,000 years ago, observed the movements of the sun, moon, and stars in order to predict their behavior and initiate certain tasks such as planting crops or readying for winter. Theirs was a necessity compelled by survival; surprisingly, still relevant today. The threat of climate change and its dire consequences are being studied by scientists using in essence similar types of analysis tools— tools that involve measured data to refine and calibrate their models, extrapolate, and evaluate the effect of different scenarios and mitigation measures. These tools fall under the general purview of inverse data analysis and modeling methods, and it would be expedient to illustrate their potential and relevance with a case study application that the reader can relate to more practically.
1.1.2
The Energy Problem and Importance of Buildings
One of the current major societal problems facing mankind is the issue of energy, not only due to the gradual depletion of fossil fuels but also due to the adverse climatic and health effects that their burning create. According to the U.S. Department of Energy (USDOE), total worldwide primary energy consumption in 2021 was about 580 Exajoules (= 580 × 1018 J). The average annual growth rate is about 2%, which suggests a doubling time of 35 years. The United States accounts for 17% of the worldwide energy use (with only 5% of the world’s population), while the building sector alone (residential plus commercial buildings) in the United States consumes about 40% of the total primary energy use,
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_1
1
2
1
over 76% of the electricity generated, and is responsible for 40% of the CO2 emitted. Improvement in energy efficiency in all sectors of the economy worldwide has been rightly identified as a major and pressing need, and aggressive programs and measures are being implemented worldwide. By 2030, USDOE estimates that building energy use in the United States could be cut by more than 20% using technologies known to be cost effective today and by more than 35% if research goals are met. Much higher savings are technically possible. Building efficiency must be considered as improving the performance of a complex system designed to provide occupants with a comfortable, safe, and attractive living and work environment. This requires superior architecture and engineering designs, quality construction practices, and intelligent operation and maintenance of the structures. Identifying energy conservation and efficiency opportunities, verifying by monitoring whether anticipated benefits are in fact realized when such measures/systems are implemented, optimal operating of buildings, etc.; all these tasks require skills in data analysis and modeling.
1.1.3
Forward or Simulation Approach
Building energy simulation models (or forward models) are mechanistic (i.e., based on a mathematical formulation of the physical behavior) and deterministic (i.e., there is no randomness in the inputs or outputs).1 They require as inputs the hourly climatic data of the selected location, the layout, orientation and physical description of the building (such as wall material, thickness, glazing type and fraction, type of shading overhangs, etc.), the type of mechanical and electrical systems available inside the building in terms of air distribution secondary systems, performance specifications of primary equipment (chillers, boilers, etc.), and the hourly operating and occupant schedules of the building. The simulation predicts hourly or sub-hourly energy use during the entire year from which monthly total energy use and peak use along with utility rates provide an estimate of the operating costs of the building. The primary strength of such a forward simulation model is that it is based on sound engineering principles usually taught in colleges and universities, and consequently has gained widespread acceptance by the design and professional community. Major public domain simulation codes (e.g., Energy Plus 2009) have been developed with hundreds of man-years invested in their development by very competent professionals. This modeling approach is generally useful for design purposes where different design options are to be evaluated before the actual system is built.
1
These terms will be described more fully in Sect. 1.5.2.
1.1.4
Mathematical Models and Data Analysis
Inverse or Data Analysis Approach
Inverse modeling methods, on the other hand, are used when performance data of the system is available, and one uses this data for certain specific purposes, such as predicting or controlling the behavior of the system under different operating conditions, or for identifying energy conservation opportunities, or for verifying the effect of energy conservation measures and commissioning practices once implemented, or even to verify that the system is performing as intended (called condition monitoring). Consider the case of an existing building whose energy consumption is known (either utility bill data or monitored data). The following are some of the tasks to which knowledge of data analysis methods may be advantageous to a building energy specialist: (a) Commissioning tests: How can one evaluate whether a component or a system is installed and commissioned properly? (b) Comparison with design intent: How does the consumption compare with design predictions? In case of discrepancies, are they due to anomalous weather, to unintended building operation, to improper equipment operation, or to other causes? (c) Demand side management (DSM): How would the energy consumption decrease if certain operational changes are made, such as lowering thermostat settings, ventilation rates or indoor lighting levels? (d) Operation and maintenance (O&M): How much energy could be saved by retrofits to building shell, changes to air handler operation from constant air volume to variable air volume operation, or due to changes in the various control settings, or due to replacing the old chiller with a new and more energy efficient one? (e) Monitoring and verification (M&V): If the retrofits are implemented in the system, can one verify that the savings are due to the retrofit, and not to other confounding causes, e.g., the weather or changes in building occupancy? (f) Automated fault detection, diagnosis, and evaluation (AFDDE): How can one automatically detect faults in heating, ventilating, air-conditioning, and refrigerating (HVAC&R) equipment, which reduce operating life and/or increase energy use? What are the financial implications of this degradation? Should this fault be rectified immediately or at a later time? What specific measures need to be taken? (g) Optimal supervisory operation: How can one characterize HVAC&R equipment (such as chillers, boilers, fans, pumps, etc.) in their installed state and optimize the control and operation of the entire system? (h) Smart-grid interactions: How to best facilitate dynamic energy interactions between building energy systems
1.2 System Models
and the smart-grid with advanced communication and load control capability and high solar/wind energy penetration?
1.1.5
Discussion of Both Approaches
All the above questions are better addressed by data analysis methods. The forward approach could also be used, by, say, (i) going back to the blueprints of the building and of the HVAC system, and repeating the analysis performed at the design stage while using actual building schedules and operating modes, and (ii) performing a calibration or tuning of the simulation model (i.e., varying the inputs in some fashion) since actual performance is unlikely to match observed performance. This process is, however, tedious and much effort has been invested by the building professional community in this regard with only limited success (Reddy 2006). A critical limitation of the calibrated simulation approach is that the data being used to tune the forward simulation model must meet certain criteria, and even then, all the numerous inputs required by the forward simulation model cannot be mathematically identified (this is referred to as an “over-parameterized problem”). Though awkward, labor intensive, and not entirely satisfactory in its current state of development, the calibrated building energy simulation model is still an attractive option and has its place in the toolkit of data analysis methods (discussed at length in Sect. 10.6). The fundamental difficulty is that there is no general and widely used model or software for dealing with data-driven applications as they apply to building energy; only specialized software programs have been developed, which allow certain types of narrow analysis to be performed. In fact, given the wide diversity in applications of data-driven models, it is unlikely that any one methodology or software program will ever suffice. This leads to the basic premise of this book that there exists a crucial need for building energy professionals to be familiar and competent with a wide range of data analysis methods and tools so that they could select the one that best meets their purpose with the end result that buildings will be operated and managed in a much more energy-efficient manner than currently. Building design simulation tools have played a significant role in lowering energy use in buildings. These are necessary tools and their importance should not be understated. Historically, most of the business revenue in architectural engineering and HVAC&R firms was generated from design/build contracts, which required extensive use of simulation and design software programs. Hence, the professional community is fairly knowledgeable in this area, and several universities teach classes geared toward the use of building energy modeling (BEM) simulation programs.
3
The last 40 years or so have seen a dramatic increase in building energy services as evidenced by the number of firms that offer services in this area. The acquisition of the required understanding, skills, and tools relevant to this aspect is different from those required for building design. There are other market forces that are also at play. The recent interest in “green” and “sustainable” has resulted in a plethora of products and practices aggressively marketed by numerous companies. Often, the claims that this product can save much more energy than another, and that that device is more environmentally friendly than others, are, unfortunately, unfounded under closer scrutiny. Such types of unbiased evaluations and independent verification are imperative, otherwise the whole “green” movement may degrade into mere “green-washing” rather than overcoming a dire societal challenge. A sound understanding of applied data analysis is imperative for this purpose and future science and engineering graduates have an important role to play. Thus, the raison d’etre of this book is to provide a general introduction and a broad foundation to the mathematical, statistical, and modeling aspects of data analysis methods.
1.2
System Models
1.2.1
What Is a System Model?
A system is the object under study, which could be as simple or as complex as one may wish to consider. It is any ordered, inter-related set of things, and their attributes. A model is a construct that allows one to represent the real-life system so that it can be used to predict the future behavior of the system under various “what–if” scenarios. The construct could be a scaled down physical version of the actual system (widely adopted historically in engineering) or a mental construct. The development of a model is not the ultimate objective; in other words, it is not an end by itself. It is a means to an end, the end being a credible means to make decisions that could involve system-specific issues (such as gaining insights about influential drivers and system dynamics, or predicting system behavior, or determining optimal control conditions) as well as those involving a broader context (such as operation management, deciding on policy measures and planning, etc.).
1.2.2
Types of Models
One differentiates between different types of models (Fig. 1.1): (a) Abstract models can be: (i) Conceptual (or qualitative or descriptive models), where the system’s behavior is summarized in
4
1
Fig. 1.1 Different types of models
Mathematical Models and Data Analysis
Models
Physical
Abstract
Mathematical
Physics-based
Conceptual
non-analytical ways because only general qualitative trends of the system are known. They are mental/intuitive abstract constructs that capture subjective/qualitative behavior or expectations of how something works based on prior experience. Such models are primarily used as an aid to thought or communication. (ii) Mathematical models, which capture system response using mathematical equations; these are further discussed below. (b) Physical models can be either: (i) Scaled down (or up) physical constructs, whose characteristics resemble those of the physical system being studied. They often supported and guided the work of earlier scientists and engineers and are still extensively used for validating mathematical models (such as architectural daylight experiments in test rooms), or (ii) Analogue models, which are actual physical setups meant to reproduce the physics of the systems and involve secondary measurements of the system to be made (flow, energy, etc.). (c) Mathematical models can be further subdivided into (see Fig. 1.1): (i) Empirical models, which are abstract models based on observed/monitored data with a loose mathematical structure. They capture general qualitative trends of the system based on data describing properties of the system summarized in a graph, a table, or a curve fit to observation points. Such models presume some knowledge of the fundamental quantitative trends but lack accurate understanding. Examples: econometric, medical, sociological, anthropological behavior. (ii) Physics-based mechanistic models (or structural models), which use metric or count data and are based on mathematical relationships derived from physical laws such as Newton’s laws, the laws of thermodynamics and heat transfer, etc. Such
Empirical
Analytical
Numerical Scale
Analogue
models can be used for prediction (during system design) or for proper system operation and control (involving data analysis). This group of models can be classified into: • Analytical, for which closed form mathematical solutions exist for the equation or set of equations. • Numerical, which require numerical procedures to solve the equation(s). Alternatively, mathematical models can be considered to be: • Exact structural models where the equation is thought to apply rigorously, i.e., the relationship between variables and parameters in the model is exact, or as close to exact as current state of scientific understanding permits. • Inexact structural models where the equation applies only approximately, either because the process is not fully known or because one chose to simplify the exact model so as to make it more usable. A typical example is the dose–response model, which characterizes the relation between the amount of toxic agent imbibed by an individual and the incidence of adverse health effect.
1.3
Types of Data
1.3.1
Classification
Data2 can be classified in different ways. One classification scheme is to distinguish between experimental data gathered under controlled test conditions where the observer can perform tests in the manner or sequence he intends, and Several authors make a strict distinction between “data,” which is plural, and “datum,” which is singular and implies a single data point. No such distinction is made throughout this book, and the word “data” is used to designate either.
2
1.3 Types of Data
5
Fig. 1.2 The rolling of a dice is an example of discrete data where the data can only assume whole numbers. Even if the dice is fair, one would not expect, out of 60 throws, the numbers 1 through 6 to appear exactly 10 times but only approximately so
observational data collected while the system is under normal operation or when the system cannot be controlled (as in astronomy). Another classification scheme is based on type of data: (a) Categorical/qualitative data, which involve non-numerical descriptive measures or attributes, such as belonging to one of several categories. One can further distinguish between: (i) Nominal (or unordered), consisting of attribute data with no rank, such as male/female, yes/no, married/ unmarried, eye color, engineering major, etc. (ii) Ordinal data, i.e., data that has some order or rank, such as a building envelope that is leaky, medium, or tight, or a day that is hot, mild, or cold. Such data can be converted into an arbitrary rank order by a min-max scaling (e.g., hot/mild/cold days can be ranked as 1/2/3). Such rank-ordered data can be manipulated arithmetically to some extent. (b) Numerical/quantitative data, i.e., data obtained from measurements of such quantities as time, weight, and height. Further, there are two different kinds: (i) Count or discrete data, which can take on only a finite or countable number of values. An example is data series one would expect by rolling a dice 60 times (Fig. 1.2). (ii) Continuous or metric data involving measurements of time, weight, height, energy, or others. Such data may take on any value in an interval (most metric data is continuous, and hence is not countable); for example the daily average outdoor dry-bulb temperature in Philadelphia, PA, over a year (Fig. 1.3). Further, one can distinguish between: – Data measured on an interval scale, which has an arbitrary zero point (such as the Celsius scale) and so only differences between values are meaningful. – Data measured on a ratio scale, which has a zero point that cannot be arbitrarily changed (such as mass or volume or temperature in Kelvin); both differences and ratios are meaningful.
Fig. 1.3 Continuous data separated into a large number of bins (in this case, 300) resulted in the above histogram of the hourly outdoor dry-bulb temperature (in °F) in Philadelphia, PA, over a year. A smoother distribution would have resulted if a smaller number of bins had been selected
For data analysis purposes, it is often important to view data based on their dimensionality, i.e., the number of axes needed to graphically present the data. A univariate data set consists of observations based on a single variable, bivariate those based on two variables, and multivariate those based on more than two variables. A fourth type of distinction between data types is by the source or origin of the data: (a) Population is the collection or set of all individuals (or items, or characteristics) representing the same quantity with a connotation of completeness, i.e., the entire group of items being studied whether they be the freshmen student body of a university, instrument readings of a test quantity, or points on a curve. (b) Sample is a portion or limited number of items from a population from which information or readings are collected. There are again two types of samples: – Single-sample is a single reading or succession of readings taken at the same time or under different times but under identical conditions. – Multi-sample is a repeated measurement of a fixed quantity using altered test conditions, such as different observers or different instruments or both. Many experiments may appear to be multi-sample data but are actually single-sample data. For example, if the same instrument is used for data collection during different times, the data should be regarded as single-sample not multisample. One can differentiate between different types of multisample data. Consider the case of solar thermal collector testing (as described in Pr. 5.7 of Chap. 5). In essence, the collector is subjected to different inlet fluid temperature levels
6
1
Mathematical Models and Data Analysis
Fig. 1.4 Example of multisample data in the framework of a “round-robin” experiment of testing the same solar thermal collector in six different test facilities (shown by different symbols) following the same testing methodology. The test data is used to determine and plot the collector efficiency versus the reduced temperature along with uncertainty bands (see Pr. 5.7 for nomenclature). (Streed et al. 1979)
under different values of incident solar radiation and ambient air temperatures using an experimental facility with instrumentation of pre-specified accuracy levels. The test results are processed according to certain performance models and the data plotted against collector efficiency versus reduced temperature level. The test protocol would involve performing replicate tests under similar reduced temperature levels, and this is one type of multi-sample data. Another type of multisample data would be the case when the same collector is tested at different test facilities nation-wide. The results of such a “round-robin” test are shown in Fig. 1.4, where one detects variations around the trend line given by the performance model that can be attributed to differences in test facility and instrumentation, and in slight variations in how the test protocols were implemented in different facilities. (c) Two-stage experiments are successive staged experiments where the chance results of the first stage determine the conditions under which the next stage will be carried out. For example, when checking the quality of a lot of mass-produced articles, it is frequently possible to decrease the average sample size by carrying out the inspection in two stages. One may first take a small sample and accept the lot if all articles in the sample are satisfactory; otherwise a large second sample is inspected. Finally, one needs to distinguish between: (i) a duplicate, which is a separate specimen taken from the same source as the first specimen, and tested at the same time and in the same manner, and (ii) replicate, which is the same specimen tested again at a different time. Thus, while duplication allows one to test samples till they are destroyed (such as tensile strength testing of an iron specimen), replicate testing stops short of doing permanent damage to the samples.
1.3.2
Types of Uncertainty in Data
If the same results are obtained when an experiment is repeated under the same conditions, one says that the experiment is deterministic. It is this deterministic nature of science that allows theories or models to be formulated and permits the use of scientific theory for prediction (Hodges and Lehman 1970). However, all observational or experimental data invariably have a certain amount of inherent noise or randomness, which introduces a certain degree of uncertainty in the results or conclusions. Instrument or measurement technique, or improper understanding of all influential factors, or the inability to measure some of the driving parameters, random and/or bias types of errors usually infect the deterministic data. However, there are also experiments whose results vary due to the very nature of the experiment; for example, gambling outcomes (throwing of dice, card games, etc.). These are called random experiments. Without uncertainty or randomness, there would have been little need for statistics. Probability theory and inferential statistics have been largely developed to deal with random experiments and the same approach has also been adapted to deterministic experimental data analysis. Both inferential statistics and stochastic model building have to deal with the random nature of observational or experimental data, and thus require knowledge of probability. There are several types of uncertainty in data, and all of them have to do with the inability to determine the true state of affairs of a system (Haimes 1998). A succinct classification involves the following sources of uncertainties: (a) Purely stochastic variability (or aleatory uncertainty), where the ambiguity in outcome is inherent in the nature of the process, and no amount of additional measurements can reduce the inherent randomness.
1.4 Mathematical Models
Common examples involve coin tossing, or card games. These processes are inherently random (either on a temporal or spatial basis), and whose outcome, while uncertain, can be anticipated on a statistical basis. (b) Epistemic uncertainty or ignorance or lack of complete knowledge of the process, which result in certain influential variables not being considered (and, thus, not measured). (c) Inaccurate measurement of numerical data due to instrument or sampling errors. (d) Cognitive vagueness involving human linguistic description. For example, people use words like tall/ short or very important/not important, which cannot be quantified exactly. This type of uncertainty is generally associated with qualitative and ordinal data where subjective elements come into play. The traditional approach is to use probability theory along with statistical techniques to address (a), (b), and (c) types of uncertainties. The variability due to sources (b) and (c) can be diminished by taking additional measurements, by using more accurate instrumentation, by better experimental design, and by acquiring better insight into specific behavior with which to develop more accurate models. Several authors apply the term “uncertainty” to only these two sources. Finally, source (d) can be modeled using probability approaches though some authors argue that it would be more convenient and appropriate to use fuzzy logic to model such vagueness in human speech.
1.4
Mathematical Models
1.4.1
Basic Terminology
One can envision two different types of systems: open systems in which either energy and/or matter flows into and out of the system, and closed systems in which matter is not exchanged to the environment but energy flows can be present. A system model is a description of the system. Empirical and mechanistic models are made up of three components: (i) Input variables (also referred to as regressor, forcing, exciting, covariates, exogenous or independent variables in the engineering, statistical and econometric literature), which act on the system. Note that there are two types of such variables: controllable by the experimenter, and uncontrollable or extraneous variables, such as climatic variables, for example. (ii) System structure and parameters/properties, which provide the necessary mathematical description of the
7
systems in terms of physical and material constants; for example, thermal mass, overall heat transfer coefficients, mechanical properties of elements. (iii) Output variables (also called response, state, endogenous, or dependent variables), which describe system response to the input variables. A structural model of a system is a mathematical relationship between one or several input variables and parameters and one or several output variables. Its primary purpose is to allow better physical understanding of the phenomenon or process or, alternatively, to allow accurate prediction of system response. This is useful for several purposes; for example, preventing adverse phenomena from occurring, for proper system design (or optimization), or to improve system performance by evaluating other modifications to the system. A satisfactory mathematical model is subject to two contradictory requirements (Edwards and Penney 1996): it must be sufficiently detailed to represent the phenomenon it is attempting to explain or capture, yet it must be sufficiently simple to make the mathematical analysis practical. This requires judgment and experience of the modeler backed by experimentation and validation.3
1.4.2
Block Diagrams
An information flow or block diagram4 is a standard shorthand manner of schematically representing the inputs and output quantities of an element or a system as well as the computational sequence of variables. It is a concept widely used in the context of system modeling and simulation since a block implies that its output can be calculated provided the inputs are known. They are useful for setting up the set of model equations to solve in order to simulate or analyze systems or components. As illustrated in Fig. 1.5, a centrifugal pump could be represented as one of many possible block diagrams (as shown in Fig. 1.6), depending on which parameters are of interest. If the model equation is cast in a form such that the outlet pressure p2 is the response variable and the inlet pressure p1 and the fluid flow volumetric rate v are the forcing variables, then the associated block diagram is that shown in Fig. 1.6a. Another type of block diagram is shown in Fig. 1.6b, where flow rate v is the response variable. The arrows indicate the direction of unilateral information or signal flow, which in term can be viewed as cause–effect 3
Validation is defined as the process of bringing the user’s confidence about the model to an acceptable level either by comparing its performance to other more accepted models or by experimentation. 4 Block diagrams should not be confused with material flow diagrams, which for a given system configuration are unique. On the other hand, there can be numerous ways of assembling block diagrams depending on how the problem is framed.
8
1
relationship; this is why these models are termed causal. Thus, such diagrams depict the manner in which the simulation models of the various components of a system need to be formulated. In general, a system or process is subject to one or more inputs (or stimulus or excitation or forcing functions) to which it responds by producing one or more outputs (or system response). If the observer is unable to act on the system, i.e., change some or any of the inputs, so as to produce a desired output, the system is not amenable to
s
p2
v p1
Fig. 1.5 Schematic of a centrifugal pump rotating at speed s (say, in rpm), which pumps a water flow rate v from lower pressure p1 to higher pressure p2
p1 v p1 p2 p1 p2
s
Pump
p2
(a)
Pump
v
(b)
v
Pump
(c)
+
Fig. 1.6 Different block diagrams for modeling a pump depending on how the problem is formulated
x(t)
Mathematical Models and Data Analysis
control. If, however, the inputs can be varied, then control is feasible. Thus, a control system is defined as an arrangement of physical components connected or related in such a manner as to command, direct, or regulate itself or another system (Stubberud et al. 1994). One needs to distinguish between open and closed loops, and block diagrams provide a convenient way of doing so. (a) An open loop control system is one in which the control action is independent of the output (see Fig. 1.7a). Two important features are: (i) their ability to perform accurately is determined by their calibration, i.e., by how accurately one is able to establish the input–output relationship; and (ii) they are generally not unstable. A practical example is an automatic toaster, which is simply controlled by a timer. If the behavior of an open loop system is not completely understood or if unexpected disturbances act on it, then there may be considerable and unpredictable variations in the output. (b) A closed loop control system, also referred to as a feedback control system, is one in which the control action is somehow dependent on the output (Fig. 1.7b). If the value of the response y(t) is too low or too high, then the control action modifies the manipulated variable (shown as u(t)) appropriately. Such systems are designed to cope with lack of exact knowledge of system behavior, inaccurate component models, and unexpected disturbances. Thus, increased accuracy is achieved by reducing the sensitivity of the ratio of output to input to variations in system characteristics (i.e., increased bandwidth defined as the range of variation in the inputs over which the system will respond satisfactorily) or due to random perturbations of the system by the environment. They have a serious disadvantage though: they can
y (t)
System
(a) Open loop Disturbance
x (t)
+
Control Element
Manipulated variable System u (t)
Input Feedback element (b) Closed loop Fig. 1.7 Open and closed loop systems for a controlled output y(t). (a) Open loop. (b) Closed loop
y(t) Controlled output
1.4 Mathematical Models
9
inadvertently develop unstable oscillations. This issue is an important one by itself and is treated extensively in control textbooks. Using the same example of a centrifugal pump but going one step further would lead us to the control of the pump. For example, if the inlet pressure p1 is specified, and the pump needs to be operated or controlled (i.e., say by varying its rotational speed s) under variable outlet pressure p2 so as to maintain a constant fluid flow rate v, then some sort of control mechanism or feedback is often used (shown in Fig. 1.6c). The small circle at the intersection of the signal s and the feedback represents a summing point that denotes the algebraic operation being carried out. For example, if the feedback signal is summed with the signal s, a “+” sign is placed just outside the summing point. Such graphical representations are called signal flow diagrams and are used in process or system control, which requires inverse modeling and parameter estimation.
1.4.3
Mathematical Representation
Let us start with explaining the difference between parameters and variables in a model. A deterministic model is a mathematical relationship, derived from physical considerations, between variables and parameters. The quantities in a model that can be measured independently during an experiment are the “variables,” which can be either input or output variables (as described earlier). To formulate the relationship among variables, one usually introduces “constants” that denote inherent properties of nature or of the engineering system called parameters. Consider the dynamic model of a component or system represented by the block diagram in Fig. 1.8. For simplicity, let us assume a linear model with no lagged terms in the forcing variables. Then, the model can be represented in matrix form as: Yt = AYt - 1 þ BUt þ CWt
with
Y1 = d
ð1:1Þ
where the output or state variable at time t is Yt. The forcing or input variables are of two types: vector U denoting observable and controllable input variables, and vector W indicating uncontrollable input variables or disturbing inputs that may or may not be observable. The parameter vectors of the model are {A, B, C} while d represents the initial condition vector. Examples of Simple Models (a) Pressure drop Δp of a fluid flowing at velocity v through a pipe of hydraulic diameter Dh and length L: Δp = f
L v2 ρ Dh 2
where f is the friction factor, and ρ is the density of the fluid. For a given system, v can be viewed as the independent or input variable, while the pressure drop is the state variable. The factors f, L, and Dh are the system or model parameters and ρ is a property of the fluid. Note that the friction factor f is itself a function of the velocity, thus making the problem a bit more complex. Sometimes, the distinction between parameters and variables is ambiguous and depends on the context, i.e., the objective of the study and the manner in which the experiment is performed. For example, in Eq. 1.2, pipe length has been taken to be a fixed system parameter since the intention was to study the pressure drop against fluid velocity. However, if the objective is to determine the effect of pipe length on pressure drop for a fixed velocity, the length would then be viewed as the independent variable. (b) Rate of heat transfer from a fluid to a surrounding solid: •
Q = UA T f - T o
ð1:3Þ
where the parameter UA is the overall heat conductance, and Tf and To are the mean fluid and solid temperatures (which are the input variables). (c) Rate of heat added to a flowing fluid: •
•
Q = mcp ðT out - T in Þ •
Fig. 1.8 Block diagram of a simple component with parameter vectors {A, B, C}. Vectors U and W are the controllable/observable and the uncontrollable/disturbing inputs, respectively, while Y is the state variable or system response
ð1:2Þ
ð1:4Þ
where m is the fluid mass flow rate, cp is its specific heat at constant pressure, and Tout and Tin are the exit and inlet fluid temperatures. It is left to the reader to identify the input variables, state variables, and the model parameters. (d) Lumped model of the water temperature Ts in a storage tank with an immersed heating element and losing heat to the environment is given by the first-order ordinary differential equation (ODE):
10
1
Mcp
dTs = P - UAðT s - T env Þ dt
ð1:5Þ
where Mcp is the thermal heat capacitance of the tank (water plus tank material), Tenv the environment temperature, and P is the auxiliary power (or heat rate) supplied to the tank. It is left to the reader to identify the input variables, state variables, and parameters.
1.4.4
Classification
Predicting the behavior of a system requires a mathematical representation of the system components. The process of deciding on the level of detail appropriate for the problem at hand is called abstraction (Cha et al. 2000). This process has to be undertaken with care; (i) over-simplification may result in loss of important system behavior predictability, while (ii) an overly detailed model may result in undue data collection effort and computational resources as well as time spent in understanding the model assumptions and results generated. There are different ways by which mathematical models can be classified. Some of these are shown in Table 1.1 and described below. (i) Distributed Versus Lumped Parameter In a distributed parameter system, the elements of the system are continuously distributed along the system geometry so that the variables they influence must be treated as differing not only in time but also in space, i.e., from point to point. Partial differential or difference equations are usually needed. Recall that a partial differential equation (PDE) is a differential equation between partial derivatives of an unknown function against at least two independent variables. One distinguishes between two general cases:
Mathematical Models and Data Analysis
• The independent variables are space variables only. • The independent variables are both space and time variables. Though partial derivatives of multivariable functions are ordinary derivatives with respect to one variable (the other being kept constant), the study of PDEs is not an easy extension of the theory for ordinary differential equations (ODEs). The solution of PDEs requires fundamentally different approaches. Recall that ODEs are solved by first finding general solutions and then using subsidiary conditions to determine arbitrary constants. However, such arbitrary constants in general solutions of ODEs are replaced by arbitrary functions in PDE, and determination of these arbitrary functions using subsidiary conditions is usually impossible. In other words, general solutions of ODEs are of limited use in solving PDEs. In general, the solution of the PDEs and subsidiary conditions (called initial or boundary conditions) needs to be determined simultaneously. Hence, it is wise to try to simplify the PDE model as far as possible when dealing with data analysis situations. In a lumped parameter system, the elements are small enough (or the objective of the analysis is such that simplification is warranted) so that each such element can be treated as if it were concentrated (i.e., lumped) at one particular spatial point in the system. The position of the point can change with time but not in space. Such systems usually are adequately modeled by ODE or difference equations. A heated billet as it cools in air could be analyzed as either a distributed system or a lumped parameter system depending on whether the Biot number (Bi) is greater than or less than 0.1 (Fig. 1.9). Recall that the Biot number is proportional to the ratio of the internal to the external heat flow resistances of the sphere. So, a small Biot number would imply that the resistance to heat flow attributed to internal body temperature
Table 1.1 Ways of classifying mathematical models 1 2 3 4 5 6 7 8 9 10
Different classification schemes Distributed vs lumped parameter Dynamic vs static or steady-state Deterministic vs stochastic Continuous vs discrete Linear vs nonlinear in the functional model Linear vs nonlinear in the model parameters Time invariant vs time variant Homogeneous vs non-homogeneous Simulation vs performance models Physics based (white-box) vs data based (black-box) and mix of both (gray-box)
Adapted from Eisen (1988)
Fig. 1.9 Cooling of a solid sphere in air can be modeled as a lumped model provided the Biot number Bi < 0.1. This number is proportional to the ratio of the heat conductive resistance (1/k) inside the sphere to the convective resistance (1/h) from the outer envelope of the sphere to the air
1.4 Mathematical Models
11
Fig. 1.10 Thermal networks to model heat flow through a homogeneous plane wall of surface area A and wall thickness Δx. (a) Schematic of the wall with the indoor and outdoor temperatures and convective heat flow coefficients. (b) Lumped model with two resistances and one capacitance (2R1C model). (c) Higher nth order model with n layers of equal thickness (Δx/n). The numerical discretization assumes all capacitances to be equal, while only the (n - 2) internal resistances (excluding the two end resistances) are taken to be equal. (From Reddy et al. 2016)
gradient is small enough that it can be neglected without biasing the analysis. Thus, a small body with high thermal conductivity and low convection coefficient can be adequately modeled as a lumped system. Another example of lumped model representation is the 1-D heat flow through the wall of a building (Fig. 1.10a) using the analogy between heat flow and electricity flow. The internal and external convective film heat transfer coefficients are represented by hi and ho, respectively, while k, ρ, and cp are the thermal conductivity, density, and specific heat of the wall material, respectively. In the lower limit, the wall can be discretized into one lumped layer of capacitance C with two resistors as shown by the electric network of Fig. 1.10b (referred to as 2R1C network). In the upper limit, the network can be represented by “n” nodes (see Fig. 1.10c). The 2R1C simplification does lead to some errors, which under certain circumstances is outweighed by the convenience it provides while yielding acceptable results. (ii) Dynamic Versus Steady-State Dynamic models are defined as those that allow transient system or equipment behavior to be captured with explicit recognition of the time varying behavior of both output and input variables. The steady-state or static or zero-order model is one that assumes no time variation in its input variables (and hence, no change in the output variable as well). One can also distinguish an intermediate type, referred to as quasistatic models. Cases arise when the input variables (such as
incident solar radiation on a solar hot water panel) are constantly changing at a short time scale (say, at the minute scale) and the thermal output needs to be predicted at hourly intervals only. The dynamic behavior is poorly predicted by the solar collector model at such high-frequency time scales, and so the input variables can be “time-averaged” so as to make them constant during a specific hourly interval. This is akin to introducing a “low pass filter” for the inputs. Thus, the use of quasi-static models allows one to predict the system output(s) in discrete time variant steps or intervals during a given day with the system inputs averaged (or summed) over each of the time intervals fed into the model. These models could be either zero-order or low-order ODE. Dynamic models are usually represented by PDEs, or by ODEs when spatially lumped with respect to time. One could solve them directly, and the simple cases are illustrated in Sect. 1.4.5. Since solving these equations gets harder as the order of the model increases, it is often more convenient to recast the differential equations in a time-series formulation using response functions or transfer functions, which are time-lagged values of the input variable(s) only, or of both the inputs and the response, respectively. This formulation is discussed in Chap. 8. The steadystate or static or zero-order model is one that assumes no time variation in its inputs or outputs. Its time series formulation results in simple algebraic equations with no timelagged values of the input variable(s) appearing in the function.
12
(iii) Deterministic Versus Stochastic A deterministic system is one whose response to specified inputs under specified conditions is completely predictable (to within a certain accuracy of course) from physical laws. Thus, the response is precisely reproducible time and time again. A stochastic system is one where the specific output can be predicted to within an uncertainty range only, which could be due to two reasons: (i) that the inputs themselves are random and vary unpredictably within a specified range of values (such as the electric power output of a wind turbine subject to gusting winds), and/or (ii) because the models are not accurate (e.g., the dose–response of individuals when subject to asbestos inhalation). Concepts from probability theory are required to make predictions about the response. The majority of observed data has some stochasticity in them either due to measurement noise/errors or due to the random nature of the process itself. If the random element is so small that it is negligible as compared to the “noise” in the system, then the process or system can be treated in a purely deterministic framework. The orbits of the planets though well described by Kepler’s laws have small disturbances due to other secondary effects, but Newton was able to treat them as deterministic and verify his law of gravitation. On the other hand, Brownian molecular motion is purely random, and has to be treated by stochastic methods. (iv) Continuous Versus Discrete A continuous system is one in which all the essential variables are continuous in nature and the time that the system operates is some interval (or intervals) of the real numbers. Usually such systems need differential equations to describe them. A discrete system is one in which all essential variables are discrete and the time that the system operates is a finite subset of the real numbers. This system can be described by difference equations. In most applications in engineering, the system or process being studied is fundamentally continuous. However, the continuous output signal from a system is usually converted into a discrete signal by sampling. Alternatively, the continuous system can be replaced by its discrete analog that, of course, has a discrete signal. Hence, analysis of discrete data is usually more widespread in data analysis applications.
1
Mathematical Models and Data Analysis
x1
y1
x2
y2
c1 x1 c2 x2
c1 y1 + c2 y2
Fig. 1.11 Principle of superposition of a linear system
y2(t)] for all pairs of inputs x1(t) and x2(t) and all pairs of real number constants a1 and a2. This concept is illustrated in Fig. 1.11. An equivalent concept is the principle of superposition, which states that the response of a linear system due to several inputs acting simultaneously is equal to the sum of the responses of each input acting alone. This is an extremely important concept since it allows the response of a complex system to be determined more simply by decomposing the input driving function into simpler terms, solving the equation for each term separately, and then summing the individual responses to obtain the desired aggregated response. Such a strategy is common in detailed hour-by-hour building energy simulation programs (Reddy et al. 2016). An important distinction needs to be made between a linear model and a model that is linear in its parameters. For example, • y = ax1 + bx2 is linear in both model and parameters a and b. • y = a sin x1 + bx2 is a nonlinear model but is linear in its parameters. • y = a exp (bx1) is nonlinear in both model and parameters.
(v) Linear Versus Nonlinear
In all fields, linear differential or difference equations are by far more widely used than nonlinear equations. Even if the models are nonlinear, every attempt is made, due to the subsequent convenience it provides, to make them linear either by suitable transformation (such as logarithmic transform) or by piece-wise linearization, i.e., linear approximation over a smaller range of variation. The advantages of linear systems over nonlinear systems are many:
A system is said to be linear if, and only if, it has the following property: if an input x1(t) produces an output y1(t), and if an input x2(t) produces an output y2(t), then an input [c1 x1(t) + c2 x2(t)] produces an output [c1 y1(t) + c2
• Linear systems are simpler to analyze. • General theories are available to analyze them. • They do not have singular solutions (simpler engineering problems rarely have them anyway).
1.4 Mathematical Models
13
• Well-established methods are available, such as the state space approach (see Sect. 10.4.4), for analyzing even relatively complex set of equations. The practical advantage with this type of time domain transformation is that large systems of higher-order ODEs can be transformed into a first-order system of simultaneous equations that, in turn, can be solved rather easily by numerical methods.
AyðnÞ þ Byðn - 1Þ þ . . . þ My00 þ Ny0 þ Oy = 0
ð1:7Þ
A system is time invariant or stationary if neither the form of the equations characterizing the system, nor the model parameters vary with time under either constant or varying inputs; otherwise the system is time-variant or non-stationary. In some cases, when the model structure is poor and/or when the data are very noisy, time variant models are used requiring either on-line or off-line updating depending on the frequency of the input forcing functions and how quickly the system responds. Examples of such instances abound in electrical engineering applications. Usually, one tends to encounter time invariant models in less complex thermal and environmental engineering applications.
yields the free response of the system. The homogeneous solution is a general solution whose arbitrary constants are then evaluated using the initial (or boundary) conditions, thus making it unique to the situation. (b) The non-homogeneous form where P(x) ≠ 0 and Eq. 1.6 applies. The forced response of the system is associated with the case when all the initial conditions are identically zero, i.e., y(0), y′(0), . . . y(n - 1) are all zero. Thus, the implication is that the forced response is only dependent on the external forcing function P(x). The total response of the linear time-invariant ODE is the sum of the free response and the forced response (thanks to the superposition principle). When system control is being studied, slightly different terms are often used to specify total dynamic system response: (a) the steady-state response is that part of the total response that does not approach zero as time approaches infinity, and (b) the transient response is that part of the total response that approaches zero as time approaches infinity.
(vii) Homogeneous Versus Non-Homogeneous
(viii) Simulation Versus Performance-Based Models
If there are no external inputs and the system behavior is determined entirely by its initial conditions, then the system is called homogeneous or unforced or autonomous; otherwise it is called non-homogeneous or forced. Consider the general form of a nth order time-invariant or stationary linear ODE:
The distinguishing trait between simulation and performance models is the basis on which the model structure is framed (this categorization is quite important). Simulation models are used to predict system performance during the design phase when no actual system exists, and design alternatives are being evaluated. A performancebased model (also referred to as “stochastic data model”) relies on measured performance data of the actual system to provide insights into model structure and to estimate its parameters or to simply predict future performance (such models are referred to as “algorithmic models”). Both these model approaches are discussed in Sects. 1.5 and 1.6.
(vi) Time Invariant Versus Time Variant
AyðnÞ þ Byðn - 1Þ þ . . . þ My00 þ Ny0 þ Oy = PðxÞ
ð1:6Þ
where y′, y″, and y(n) are the first, second, and nth derivatives of y with respect to x, and A, B, . . . M, N, and O are constants. The function P(x) frequently corresponds to some external influence on the system and is a function of the independent variable. Often, the independent variable is the time variable t. This is intentional since time comes into play when the dynamic behavior of most physical systems is modeled. However, the variable x can be assigned any other physical quantity as appropriate. To completely specify the problem, i.e., to obtain a unique solution y(x), one needs to specify two additional factors: (i) the interval of x over which a solution is desired, and (ii) a set of n initial conditions. If these conditions are such that y(x) and its (n - 1) derivatives are specified for x = 0, then the problem is called an initial value problem. Thus, one distinguishes between: (a) The homogeneous form where P(x) = 0, i.e., there is no external driving force. The solution of the differential equation:
1.4.5
Steady-State and Dynamic Models
Let us illustrate steady-state and dynamic system responses using the example of measurement sensors. Steady-state models (also called zero-order models) apply when input variables (and hence, the output variables) are maintained constant. A zero-order model for the dynamic performance of measuring systems is used (i) when the variation in the quantity to be measured is very slow as compared to how quickly the instrument responds, or (ii) as a standard of comparison for other more sophisticated models. For a zero-order instrument, the output is directly proportional to the input (Doebelin 1995):
14
1
a0 qo = b0 qi
ð1:8aÞ
25
Or
Steady-state value
ð1:8bÞ
where a0 and b0 are the system parameters, assumed time invariant, qo and qi are the output and the input quantities, respectively, and K = b0/a0 is called the static sensitivity of the instrument. Hence, only K is required to completely specify the response of the instrument. Thus, the zero-order instrument is an ideal instrument; no matter how rapidly the measured variable changes, the output signal faithfully and instantaneously reproduces the input. The next step in complexity is the first-order model: a1
dq0 þ a0 q o = b 0 q i dt
ð1:9aÞ
Or τ
dq0 þ qo = Kqi dt
ð1:9bÞ
where τ is the time constant of the instrument (τ = a1/a0), and K is the static sensitivity of the instrument. Thus, two numerical parameters are used to completely specify a first-order instrument. The solution to Eq. 1.9b for a step change in input is: qo ðt Þ = Kqis 1 - e - t=τ
ð1:10Þ
where qis is the value of the input quantity after the step change. After a step change in the input, the steady-state value of the output will be K times the input qis (just as in the zeroorder instrument). This is shown as a dotted horizontal line in Fig. 1.12 with a numerical value of 20. The time constant characterizes the speed of response; the smaller its value the faster its response, and vice versa, to any kind of input. Figure 1.12 illustrates the dynamic response and the associated time constants for two instruments when subject to a step change in the input. Numerically, the time constant represents the time taken for the response to reach 63.2% of its final change, or to reach a value within 36.8% of the final value. This follows from Eq. 1.10 by setting t = τ, o ðt Þ i.e., qKq = ð1 - e - 1 Þ = 0:632. Another useful measure of is
response speed for any instrument is the 5% settling time, i.e., the time for the output signal to get to within 5% of the final value. For any first-order instrument, it is equal to about three times the time constant.
Instrument reading
20
qo = Kqi
Mathematical Models and Data Analysis
15 63.2% of change 10
Two different instruments
5 Small time constant
Large time constant
0 0
5
10
15
20
25
30
Time from step change in input (seconds)
Fig. 1.12 Step-responses of two first-order instruments with different response times on a plot with instrument reading (y-axis) versus time (xaxis). The response is characterized by the time constant, which is the time for the instrument reading to reach 63.2% of the steady-state value
1.5
Mathematical Modeling Approaches
1.5.1
Broad Categorization
As shown in Fig. 1.13, one can differentiate between two broad types of mathematical approaches meant for: (a) Simulation or forward (or well-defined or wellspecified) problems. (b) Inverse problems, which include calibrated forward models, statistical models identified primarily from data that can be black-box or gray-box. The latter can again be designated as partial gray-box, i.e., inspired from partial understanding, or mechanistic gray-box or physics-based reduced-order structural models.
1.5.2
Simulation or Forward Modeling
Simulation or forward or white-box or detailed mechanistic models are based on the laws of physics and permit accurate and microscopic modeling of the various fluid flow, heat and mass transfer phenomenon, etc. that occur within engineered systems. Students are generally familiar with the algebraic and differential equations to represent the temporal (or transient or dynamic) and spatial performances of numerous equipment, subsystems, and systems in terms of algebraic, ordinary, and partial differential equations (ODE and PDE, respectively), and how to solve them analytically or numerically (sequentially or iteratively) over different time periods and duration (from minutes, to hours, days, and years). A high level of physical understanding is necessary to develop these models, complemented with some
1.5 Mathematical Modeling Approaches
15
Fig. 1.13 Overview of various traditional mathematical modeling approaches with applications
Mathematical Models
Inverse models (based on performance data)
Forward models (simulation-based)
Calibrated forward models
White-box models
White-box models
System design applications
Data-driven or Statistical models
Black-box models (curve fitting)
Partial grey-box models (mechanistic)
Performance prediction of existing systems
Physics-based reduced order models
Grey-box models (mechanistic)
Understanding, prediction, control
expertise in numerical analysis. Consequently, these have found their niche in design studies prior to building a system where the effect of different system configurations needs to be evaluated under different operating conditions and scenarios. Adopting the model specified by Eq. 1.1 and Fig. 1.8, such problems are framed as: Given fU, Wg and fB, Cg determine Y
ð1:11Þ
The objective is to predict the response or state variables of a specified model with known structure and known parameters when subject to specified input or forcing variables. This is also referred to as the “well-defined problem” since it has a unique solution if formulated properly (in other words, the degree of freedom is zero). Such models are implicitly studied in classical mathematics and also in system simulation design courses. For example, consider a simple steady-state problem wherein the operating point of a pump and piping network are represented by black-box models of the pressure drop (Δp) and volumetric flow rate (V ), such as shown in Fig. 1.14: Δp = a1 þ b1 V þ c1 V 2
for the pump
Δp = a2 þ b2 V þ c2 V
for the pipe network
2
ð1:12Þ
Solving the two equations simultaneously yields the performance condition of the operating point, i.e., pressure drop and flow rate (Δp0,V0). Note that the numerical values
Fig. 1.14 Example of a forward problem where solving two simultaneous equations, one representing the pump curve and the other the system curve, yields the operating point
of the model parameters {ai, bi, ci} are known, and that (Δp) and V are the two variables, while the two equations provide the two constraints. This simple example has obvious extensions to the solution of differential equations where the combined spatial and temporal response is sought. In order to ensure accuracy of prediction, the models have tended to become increasingly complex especially with the advent of powerful and inexpensive computing power. The divide and conquer mind-set is prevalent in this approach, often with detailed mathematical equations based on scientific laws used to model micro-elements of the complete
16
1
Fig. 1.15 Schematic of the cooling plant for Example 1.5.1
tb
Mathematical Models and Data Analysis
qc
tc
P
Condenser Expansion Valve
Compressor
Evaporator
te qe
system. This approach presumes detailed knowledge of not only the various natural phenomena affecting system behavior but also of the magnitude of various interactions (e.g., heat and mass transfer coefficients, friction coefficients, etc.). The main advantage of this approach is that the system need not be physically built in order to predict its behavior. Thus, this approach is ideal in the preliminary design and analysis stage and is most often employed as such. Note that incorporating superfluous variables and needless modeling details does increase computing time and complexity in the numerical resolution. However, if done correctly, it does not usually compromise the accuracy of the solution obtained. Example 1.5.1 Simulation of a chiller. Consider an example of simulating a chilled water cooling plant consisting of the condenser, compressor, and evaporator, as shown in Fig. 1.15.5 Simple black-box models are used for easier comprehension. The steadystate cooling capacity qe (in kWt6) and the compressor electric power draw P (in kWe) are function of the refrigerant evaporator temperature te and the refrigerant condenser temperature tc in °C, and are supplied by the equipment manufacturer: qe = 239:5 þ 10:073t e - 0:109t 2e - 3:41t c - 0:00250t 2c - 0:2030t e t c þ 0:00820t 2e t c þ0:0013t e t 2c - 0:000080005t 2e t 2c
ð1:13Þ
ta
and P = - 2:634 - 0:3081t e - 0:00301t 2e þ 1:066t c - 0:00528t 2c - 0:0011t e t c - 0:000306t 2e t c þ0:000567t e t 2c þ 0:0000031t 2e t 2c
ð1:14Þ
Another equation needs to be introduced for the heat rejected at the condenser qc (in kWt). This is simply given by a heat balance of the system (i.e., from the first law of thermodynamics) as: qc = qe þ P
ð1:15Þ
The forward problem would entail determining the unknown values of Y = {te, tc, qe, P, qc}. Since there are five unknowns, five equations are needed. In addition to the three equations above, two additional ones are required. They are the heat transfer equations at the evaporator and the condenser between the refrigerant (assuming to be changing phase and so at a constant temperature) and the circulating water: qe = me cp ðt a - t e Þ 1 - exp -
UAe me cp
ð1:16Þ
qc = mc cp ðt c - t b Þ 1 - exp -
UAc mc cp
ð1:17Þ
and
where cp is the specific heat of water = 4.186 kJ/kg K. Further, values of parameters are specified: 5
From Stoecker (1989), by permission of McGraw-Hill. kWt denotes that the units correspond to thermal energy while kWe to energy in electric units.
6
• Water flow rate through the evaporator, me = 6.8 kg/s, and through the condenser, mc = 7.6 kg/s
1.5 Mathematical Modeling Approaches
• Thermal conductance of the evaporator, UAe = 30.6 kW/ K, and that of the condenser, UAc = 26.5 kW/K • Inlet water temperature to the evaporator, ta = 10 °C, and that to the condenser, tb = 25 °C Solving Eqs. 1.13–1.17 results in: t e = 2:84 ° C, t c = 34:05 ° C, qe = 134:39 kW and P = 28:34 kW To summarize, the performance of the various equipment and their interaction have been represented by mathematical equations, which allow a single solution set to be determined. This is the case of the well-defined forward problem adopted in system simulation and design studies. There are instances when the same system could be subject to the inverse model approach. Consider the case when a cooling plant similar to that assumed above exists, and the facility manager wishes to instrument the various components in order to: (i) verify that the system is performing adequately, and (ii) vary some of the operating variables so that the power consumed by the compressor is reduced. In such a case, the numerical model coefficients given in Eqs. 1.13 and 1.14 will be unavailable, and so will be the UA values, since either he is unable to find the manufacturer-provided models in his documents or the equipment has degraded somewhat such that the original models are no longer accurate. The model calibration will involve determining these values from experiment data gathered by appropriately sub-metering the evaporator, condenser, and compressor on both the refrigerant and the water coolant side. How best to make these measurements, how accurate should the instrumentation be, what should be the sampling frequency, for how long should one monitor, etc. are all issues that fall within the purview of design of field monitoring. Uncertainty in the measurements as well as the fact that the assumed models are approximations of reality will introduce model prediction errors and so the verification of the actual system against measured performance will have to consider such aspects realistically.
17
1.5.3
Inverse Modeling
It is rather difficult to succinctly define inverse problems since they apply to different classes of problems with applications in diverse areas, each with their own terminology and viewpoints (it is no wonder that it suffers from the “blind men and the elephant” syndrome). Generally speaking, inverse problems are those that involve identification of model structure (system identification) and/or estimates of model parameters where the system under study already exists, and one uses measured or observed system behavior to aid in the model building and/or refinement. Different model forms may capture the data trend; this is why some argue that inverse problems can be referred to as “ill-defined” or “ill-posed.” Following Eq. 1.1 and Fig. 1.8, inverse problems can be conceptually framed as either: – Parameter estimation problems: givenfY, U, W, dg, determinefA, B, Cg
ð1:18Þ
– Control models: givenfY00 g and fA, B, Cg, determinefU, W, dg ð1:19Þ where Y″ is meant to denote that only limited measurements may be available for the state variable. Typically, one (i) takes measurements of the various parameters (or regressor variables) affecting the output (or response variables) of a device or a phenomenon, (ii) identifies a quantitative correlation between them by regression, and (iii) uses it to make predictions about system behavior under future operating conditions. As shown in Fig. 1.13, one can distinguish between black-box and gray-box approaches in terms of how the quantitative correlation is identified. Table 1.2 summarizes different characteristics of both these approaches and those of the simulation modeling approach. (i) Black-box approach identifies a simple mathematical function between response and regressor variables assuming that either (i) nothing is known about the
Table 1.2 Characteristics of different types of modeling approaches Approach Simulation
Time variation of system inputs/outputs Dynamic Quasi-static
Model type White-box Detailed mechanistic
Physical understanding High
Mechanistic -inverse
Dynamic Quasi-static Steady-state Static or steady-state
Gray-box Semi-physical Reduced-order Black-box Curve-fit
Medium
Empirical-inverse
ODE ordinary differential equation, PDE partial differential equation
Low
Types of equations PDE ODE Algebraic ODE Algebraic Algebraic
18
innards or the inside workings of the system, or (ii) there is very little/partial understanding of system behavior. The system functioning is opaque to the user. (ii) Gray-box approach uses the physics-based understanding of system structure and functioning to identify a mathematical model and deduce physically relevant system parameters from measured response and input variables. The resulting models are usually reducedorder, i.e., lumped models based on first-order ODE or algebraic equations. Such system parameters (e.g., overall heat loss coefficient, time constant, etc.) can serve to improve our mechanistic understanding of the phenomenon or system behavior. The mechanistic model structure can make more accurate predictions and provide better control capability. The identification of these models that combine phenomenological plausibility with mathematical simplicity generally requires both good understanding of the physical phenomenon or of the systems/equipment being modeled, and a competence in statistical methods. These analysis methods were initially proposed several hundred years back and have seen major improvements over the years (especially during the last hundred years) resulting in a rich literature in this area with great diversity of techniques and level of sophistication. Traditionally, the distinction was made between parametric and non-parametric methods, and these are discussed in Chaps. 5 and 9. With the advent of computing power, resampling methods have been gaining increasing importance/popularity because they provide (i) additional flexibility in model building and in predictive accuracy, (ii) more robustness in estimating the errors in both model coefficients and in predictive ability, and (iii) are able to handle much larger sets of regressor variables than traditional parametric modeling. These are discussed in Chap. 5. The gray-box approach requires context-specific approximate numerical or analytical solutions for linear and nonlinear problems and often involves model selection and parameter estimation as well. The ill-conditioning, i.e., the solution is extremely sensitive to the data (see Sect. 9.2), is often due to the repetitive nature of the data collected while the system is under normal operation. There is a rich and diverse body of knowledge on inverse methods applied to physical systems, and numerous textbooks, monographs, and research papers are available on this subject. Chapter 10 addresses these problems at more length. In summary, different models and parameter estimation techniques need to be adopted depending on whether: (i) the intent is to subsequently predict system behavior within the temporal and/or spatial range of input variables—in such cases, simple and well-known methods such as curve fitting
1
Mathematical Models and Data Analysis
Fig. 1.16 Example of a parameter estimation problem where the model parameters of a presumed function of pressure drop versus volume flow rate are identified from discrete experimental data points
may suffice (Fig. 1.16); (ii) the intent is to subsequently understand/predict/control system behavior outside the temporal and/or spatial range of input variables—in such cases, physically based models are generally more appropriate. Example 1.5.2 Dose–response models. An example of how model-driven methods differ from a straightforward curve fit is given below (the same example is treated in more depth in Sect. 10.2.6). Consider the case of models of risk to humans when exposed to toxins (or biological poisons), which are extremely deadly even in small doses. Dose is the total mass of toxin that the human body ingests. Response is the measurable physiological change in the body produced by the toxin, which can have many manifestations; here, the focus is on human cells becoming cancerous. There are several aspects to this problem relevant to inverse modeling: (i) Can the observed data of dose versus response provide some insights into the process that induces cancer in biological cells? (ii) How valid are these results extrapolated down to low doses? (iii) Since laboratory tests are performed on animal subjects, how valid are these results when extrapolated to humans? The manner one chooses to extrapolate the dose–response curve downward is dependent on either the assumption one makes regarding the basic process itself or how one chooses to err (which has policy-making implications). For example, erring too conservatively in terms of risk would overstate the risk and prompt implementation of more precautionary measures, which some critics would fault as unjustified and improper use of limited resources. There are no simple answers to these queries (until the basic process itself is completely understood). There is yet another major issue. Since different humans (and test animals) react differently to the same dose, the response is often interpreted as a probability of cancer being induced, which can be framed as a risk. Probability is
1.6 Data Analytic Approaches
19
numerous model parameters so that model predictions match observed system behavior as closely as possible. Often, only a subset or limited number of measurements of system states and forcing function values are available, resulting in a highly over-parameterized problem with more than one possible solution. Following Eq. 1.1, such inverse problems can be framed as: given
Fig. 1.17 Three different inverse models depending on toxin type for extrapolating dose–response observations at high doses to the response at low doses. (From Heinsohn and Cimbala 2003, by permission of CRC Press)
bound to play an important role to the nature of the process, and hence the adoption of various agencies (such as the U.S. Environmental Protection Agency) of probabilistic methods toward risk assessment and modeling. Figure 1.17 illustrates three methods of extrapolating dose–response curves down to low doses (Heinsohn and Cimbala 2003). The dots represent observed laboratory tests performed at high doses. Three types of models are fit to the data. They all agree at high doses; however, they deviate substantially at low doses because the models are functionally different. While model I is a nonlinear model applicable to highly toxic agents, curve II is generally taken to apply to contaminants that are quite harmless at low doses (i.e., the body is able to metabolize the toxin at low doses). Curve III is an intermediate one between the other two curves. The above models are somewhat empirical (or black-box) and provide little understanding of the basic process itself. Models based on simplified but phenomenological considerations of how biological cells become cancerous have also been developed and these are described in Sect. 10.2.6.
1.5.4
Calibrated Simulation
The calibrated simulation approach can be viewed as a hybrid of the forward and inverse methods (refer to Fig. 1.13). Here one uses a mechanistic model originally developed for the purpose of system simulation, and modifies or “tunes” the
fY00 , U00 , W00 , d00 g, determine fA00 , B00 , C00 g ð1:20Þ
where the ″ notation is used to represent limited measurements or reduced parameter set. Example 1.5.1 is a simple simulation or forward model with explicit algebraic equations for each component with no feedback loops. Detailed simulation programs are much more complex (with hundreds of variables, complex interactions and boundary conditions, etc.) involving ODEs or PDEs; one example is computational fluid dynamic (CFD) models for indoor air quality studies. Calibrating such models is extremely difficult given the lack of proper instrumentation to measure detailed spatial and temporal fields, and the inability to conveniently compartmentalize the problem so that inputs and outputs of sub-blocks could be framed and calibrated individually as done in the cooling plant example above. Thus, in view of such limitations in the data, developing a simpler system model consistent with the data available while retaining the underlying mechanistic considerations as far as possible is a more appealing approach, albeit a challenging one. Such an approach is called the “gray-box” approach, involving inverse models (see Fig. 1.13). An in-depth discussion of calibrated simulation approaches is provided in Sect. 10.6.
1.6
Data Analytic Approaches
Several authors, for example (Sprent 1998), use terms such as (i) data-driven models to imply those that are suggested by the data at hand and commensurate with knowledge about system behavior—this is somewhat akin to our definition of black-box models discussed above, and (ii) model-driven approaches as those that assume a pre-specified model and the data is used to determine the model parameters; this is synonymous with gray-box models inverse methods. Data-driven or statistical methods have been traditionally separated into black-box and gray-box approaches. An alternate view (Fig. 1.18), reflective of current thinking, is to distinguish between traditional stochastic methods and data analytic methods (Breiman 2001 goes so far as to refer to these as different cultures). The traditional methods consisting of parametric, non-parametric, and resampling methods (which have been introduced earlier) fall under
20
1
Fig. 1.18 An overview of different approaches under the broad classification of “traditional stochastic” methods and “data analytics” methods
Statistical Analysis Traditional Stochastic Methods Resampling
Classical Parametric Non-Parametric
“stochastic data modeling” and were meant primarily to understand and predict system behavior. In the last few decades there has been an explosion of other analysis methods, loosely called data analytic methods, which are less based on statistics and probability and more on data exploration and computer-based learning algorithms. The superiority of such algorithmic7 methods is manifest under situations when patterns in the data are too complex for humans to grasp or learn. The algorithmic approach is better suited for learning or knowledge discovery or gaining insight into important relationships/correlations (i.e., finding patterns in data) while providing superior predictive capability and control of nonlinear systems under complex situations. Several statisticians (e.g., Breiman 2001) argue the need to emphasize this approach given its diversity in application and its ability to make better use of large data sets in the real world. This book deals largely with the traditional stochastic modeling methods, with only Chap. 11 devoted to data analytic methods.
1.6.1
Data Mining or Knowledge Discovery
Data mining (DM) is defined as the science of extracting useful information from large/enormous data sets; that is why it is also referred to as knowledge discovery. The associated suite of approaches were developed in fields outside statistics. Though DM is based on a range of techniques, from the very simple to the sophisticated (involving such methods as clustering classification, anomaly detection, etc.), it has the distinguishing feature that it is concerned with shifting through large/enormous amounts of data with no clear aim in mind except to discern hidden information;
7
Mathematical Models and Data Analysis
The terminology follows Breiman (2001) who argues that algorithmic modeling rather than traditional stochastic approaches is much better suited to tackling modern-day problems, which involve complex systems and decision-making behavior based on numerous factors, variables, and large sets of informational data.
Data Analytics Methods Data Mining
Machine Learning Big Data
discover patterns, associations, and trends; or summarize data behavior (Dunham 2003). Thus, not only does its distinctiveness lie in the data management problems associated with storing and retrieving large amounts of data from perhaps multiple data sets, but also in it being much more exploratory and less formalized in nature than is statistics and model building where one analyzes a relatively small data set with some specific objective in mind. Data mining has borrowed concepts from several fields such as multivariate statistics and Bayesian theory, as well as less formalized ones such as machine learning, artificial intelligence, pattern recognition, and data management so as to bound its own area of study and define the specific elements and tools involved. It is the result of the digital age where enormous digital databases abound from the mundane (supermarket transactions, credit card records, telephone calls, Internet postings, etc.) to the very scientific (astronomical data, medical images, etc.). Thus, the purview of data mining is to explore such databases in order to find patterns or characteristics (called data discovery) or even in response to some very general research question not provided by any previous mechanistic understanding of the social or engineering system, so that some action can be taken resulting in a benefit or value to the owner. Data mining techniques are briefly discussed in Chap. 11.
1.6.2
Machine Learning or Algorithmic Models
Machine learning (ML) or predictive algorithmic modeling is the field of study that develops algorithms that computers follow in order to identify and extract patterns from data with the primary purpose of developing prediction models with accuracy being the primary aim and understanding or explaining the data trends of secondary concern (Kelleher and Tierney 2018). It has also been defined as a field of study that gives computers the ability to learn without being explicitly programmed. Thus, ML models and algorithms learn to map input to output in an iterative manner with the internal
1.6 Data Analytic Approaches
21
Fig. 1.19 Examples of some infrastructure-related technological applications of big data from the perspective of different stakeholders (SMI smart metering infrastructure)
Examples of Technological Applications of Big Data
Building Operations Learning energy consumption in residences (DM using ANN ensembles and adaptive)
Electric Utilities Datamine SMI (cluster customers for rate plan modification)
structure changing continuously. It is especially useful when the patterns in the data are too complex for traditional statistical analysis methods to handle. The learning of ML algorithms such as neural network models improves with additional data and is adaptive to changes in environmental conditions. ML is one of the important fields in computer science. Chapter 11 presents some of the important ML algorithms such as neural networks.
1.6.3
Introduction to Big Data
One of the major impacts to modern day society is the discipline called big data or data science. It involves harnessing information in novel ways and extracting a certain level of quantification in terms of trends and behavior from large data sets. While the knowledge may not enhance fundamental understanding or insight or wisdom,8 it produces useful actionable insights on goods and services of significant value to businesses and society. “It is meant to inform rather than explain” (Mayer-Schonberger and Cukier 2013, p. 4); and in that sense DM and ML algorithms are at the core of its suite of analysis tools. The basic trait that distinguishes it from statistical learning methods is that it is based on processing huge amounts of heterogeneous multi-source data (sensors, videos, Internet searches, social media, survey, government, etc.), which is characterized by variety, large noise, huge volume, velocity (real-time streaming), data fusion from multiple disparate sources, and sensor fusion from multiple sensors. The thinking is that the size of the data set can compensate for the use of simple models and noisier non-curated data. However, there are some major concerns:
“Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information?” T.S. Eliot (1934).
8
Smart Grid Operations
City Level Routine City Operations
Extreme Events Mgmt
Development Planning for Aspirational Goals (carbon neutrality, sustainable cities)
• Can obscure long-term view/behavior of phenomena • Increases danger of false learning/beliefs and overconfidence • Major issue with privacy and ethics A whole new set of tools, procedures, and software has been developed for datafication, which is the process of transforming raw data such as Internet searches, Internet purchasing, texts, visual, social media, info from phones, cars, etc. into a quantifiable format so that it can be tabulated, stored, and analyzed. The value of the data shifts from its primary use to its potential future use. The new profession of the data scientist combines numerous traditional skills: the statistician (data mining), computer scientist (machine learning), software programmer, machine learning expert, informatics, planners, etc. Numerous textbooks have appeared recently on data science, for example, Kelleher and Tierney (2018). There is a debate on whether big data processing capabilities and the trends and behavior therefrom extracted diminish the value of the domain expert in applications that fell traditionally under the purview of statistical and engineering analysis. Nevertheless, big data offers great promise to reshape entire business sectors to address/satisfy several societal and global problems. Some examples from a technological perspective as applied to infrastructure applications are listed below and shown in Fig. 1.19: – Improving current sensing and data analysis capabilities in individual buildings. – Be able to analyze the behavior of large number of buildings/cities in terms of energy efficiency, and integrate higher renewable energy penetration with the smart-grid. – Be able to analyze the benefits and limitations of engineered infrastructures and routine operational practices on society and different entities.
22
1
– Be able to identify gaps in current societal needs, use this knowledge to develop strategies to enhance operational efficiency and decarbonization, track the implementation of these strategies, and assess the benefit once implemented. – Be able to provide and enhance routine services as well as emergency response measures.
1.7
Data Analysis
1.7.1
Introduction
Data analysis is not performed just for its own sake; its usefulness lies in the support it provides to such objectives as gaining insight about system behavior, characterizing current system performance against a baseline, deciding whether retrofits and suggested operational changes to the system are warranted or not, quantifying the uncertainty in predicting future behavior of the present system, suggesting robust/costeffective/risk averse ways to operate an existing system, avoiding catastrophic system failure, etc. Analysis is often a precursor to decision-making in the real world. There is another discipline with overlapping/ complementary aims to that of data analysis and modeling. Risk analysis and decision-making provide both an overall paradigm and a set of tools with which decision makers can construct and analyze a model of a decision situation (Clemen and Reilly 2001). Even though it does not give specific answers to problems faced by a person, decision analysis provides structure, guidance, and analytical tools on how to logically and systematically tackle a problem, model uncertainty in different ways, and hopefully arrive at rational decisions in tune with the personal preferences of the individual who has to live with the choice(s) made. While it is applicable to problems without uncertainty but with multiple outcomes, its strength lies in being able to analyze complex multiple outcome problems that are inherently uncertain or stochastic compounded with the utility functions or risk preferences of the decision maker. There are different sources of uncertainty in a decisionmaking process but the one pertinent to data modeling and analysis in the context of this book is that associated with fairly well-behaved and well-understood engineering systems with relatively low uncertainty in their performance data. This is the reason why, historically, engineering students were not subjected to a class in decision analysis. However, many engineering systems are operated wherein the attitudes and behavior of people operating these systems assume importance; in such cases, there is a need to adapt
Mathematical Models and Data Analysis
many of the decision analysis tools and concepts with traditional data analysis and modeling techniques. This issue is addressed in Chap. 12.
1.7.2
Basic Stages
In view of the diversity of fields to which data analysis is applied, an all-encompassing definition would have to be general. One good definition is: “an evaluation of collected observations so as to extract information useful for a specific purpose.” The evaluation relies on different mathematical and statistical tools depending on the intent of the investigation. In the area of science, the systematic organization of observational data, such as the orbital movement of the planets, provided a means for Newton to develop his laws of motion. Observational data from deep space allow scientists to develop/refine/verify theories and hypotheses about the structure, relationships, origins, and presence of certain phenomena (such as black holes) in the cosmos. At the other end of the spectrum, data analysis can also be viewed as simply: “the process of systematically applying statistical and logical techniques to describe, summarize, and compare data.” From the perspective of an engineer/scientist, data analysis is a process that when applied to system performance data, collected either intrusively or non-intrusively, allows certain conclusions about the state of the system to be drawn, and, thereby, to initiate follow-up actions. Studying a problem through the use of statistical data analysis usually involves four basic steps (Arsham 2008): (a) Defining the Problem The context of the problem and the exact definition of the problem being studied need to be framed. This allows one to design both the data collection system and the subsequent analysis procedures to be followed. (b) Collecting the Data In the past (say, 70 years back), collecting the data was the most difficult part, and was often the bottleneck of data analysis. Nowadays, one is overwhelmed by the large amounts of data resulting from the great strides in sensor and data collection technology; and data cleaning, handling, and summarizing have become major issues. Paradoxically, the design of data collection systems has been marginalized by an apparent belief that extensive computation can make up for any deficiencies in the design of data collection. Gathering data without a clear definition of the problem often results in failure or limited success. Data can be
1.7 Data Analysis
collected from existing sources or obtained through observation and experimental studies designed to obtain new data. In an experimental study, the variable of interest is identified. Then, one or more factors in the study are controlled so that data can be obtained about how the factors influence the variables. In observational studies, no attempt is made to control or influence the variables of interest either intentionally or due to the inability to do so (two examples are surveys and astronomical data). (c) Analyzing the Data There are various statistical and analysis approaches and tools that one can bring to bear depending on the type and complexity of the problem and the type, quality, and completeness of the data available. Several categories of problems encountered in data analysis are shown in Figs. 1.13 and 1.18. Probability is an important aspect of data analysis since it provides a mechanism for measuring, expressing, and analyzing the uncertainties associated with collected data and mathematical models used. This, in turn, impacts the confidence in our analysis results: uncertainty in future system performance predictions, confidence level in our confirmatory conclusions, uncertainty in the validity of the action proposed, etc. The majority of the topics addressed in this book pertain to this category. (d) Reporting the Results The final step in any data analysis effort involves preparing a report. This is the written document that logically describes all the pertinent stages of the work, presents the data collected, discusses the analysis results, states the conclusions reached, and recommends further action specific to the issues of the problem identified at the onset. The final report and any technical papers resulting from it are the only documents that survive over time and are invaluable to other professionals. Unfortunately, the task of reporting is often cursory and not given its due importance. The term “intelligent” data analysis has been used, which has a different connotation from traditional ones (Berthold and Hand 2003). This term is used not in the sense that it involves increasing knowledge/intelligence of the user or analyst in applying traditional tools, but that the statistical tools themselves have some measure of intelligence built into them. A simple example is when a regression model has to be identified from data. Software packages and programmable platforms are available, which facilitate hundreds of built-in functions to be evaluated, and a prioritized list of models identified. The recent evolution of computer-intensive methods (such as bootstrapping and Monte Carlo methods) along with soft computing algorithms (such as artificial neural networks, genetic algorithms, etc.) enhances the
23
computational power and capability of traditional statistics, model estimation, and data analysis methods. Such added capabilities of modern-day computers and the sophisticated manner in which the software programs are written allow “intelligent” data analysis to be performed.
1.7.3
Example of a Data Collection and Analysis System
Data can be separated into experimental or observational depending on whether the system operation can be modified by the observer or not. Consider a system where the initial phase of designing and installing the monitoring system is complete. Figure 1.20 is a flowchart depicting various stages in the collection, analysis, and interpretation of data collected from an engineering thermal9 system while in operation. The various elements involved are: (a) A measurement system consisting of various sensors of pre-specified types and accuracy. The proper location, commissioning, and maintenance of these sensors are important aspects of this element. (b) The data sampling element whereby the output of the various sensors is read at a pre-determined frequency. The low cost of automated data collection has led to increasingly higher sampling rates. Typical frequencies for thermal systems are in the range of 1 s–1 min. (c) The cleaning of raw data for spikes, gross errors, mis-recordings, and missing or dead channels; average (or sum) the data samples and, if necessary, store them in a dynamic fashion (i.e., online) in a central electronic database with an electronic time stamp. (d) The averaging of raw data and storing in a database; typical periods are in the range of 1–30 min. One can also include some finer checks for data quality by flagging data when they exceed physically stipulated ranges. This process need not be done online but could be initiated automatically and periodically, say, every day. It is this data set that is queried as necessary for subsequent analysis. (e) The above steps in the data collection process are performed on a routine basis. This data can be used to advantage provided one can frame the issues relevant to the client and determine which of these can be satisfied. Examples of such routine uses are assessing overall time-averaged system efficiencies and preparing weekly performance reports, as well as for subtler action such as supervisory control and automated fault detection. 9
Electrical systems have different considerations since they mostly use very high-frequency sampling rates.
24
1
Mathematical Models and Data Analysis
Measurement System Design
System Monitoring
Data Sampling (1 sec–1 min)
Clean (and Store Raw Data)
-Initial cleaning and flagging (missing misrecorded, dead channels) -Gross error detection -Removal of spikes
Average and Store Data (1–30 min)
Define issue to be Analyzed
-Formulate intention of client as engineering problem -Determine analysis approach -Determine data needed
Extract Data Sub-set for Intended Analyses
-Data transformation -Data filtering -Outlier detection -Data validation
Perform Engineering Analysis
-Statistical inference -Identify patterns in data -Regression analysis -Parameter estimation -System identification
Perform Decision Analysis
-Data adequate for sound decision? -Is prior presumption correct? -How to improve operation and/or effy? -Which risk-averse strategy to select? -How to react to catastrophic risk?
Perform additional analyses
Redesign and take additional measurements
Present Decision to Client
End
Fig. 1.20 Flowchart depicting various stages in data analysis and decision-making as applied to continuous monitoring of thermal systems
1.8 Topics Covered in Book
25
Table 1.3 Analysis methods covered in this book Chapter 1 2 3 4 5 6 7 8 9 10 11 12
Topic Introduction to mathematical models and description of different classes of analysis approaches Probability and statistics, important probability distributions, Bayesian statistics Data collection, exploratory data analysis, measurement uncertainty, and propagation of errors Inferential statistics, non-parametric tests, Bayesian, sampling and resampling methods Linear ordinary least squares (OLS) regression, residual analysis, point and interval estimation, resampling methods Design of experiments (factorial and block, response surface, context of computer simulations) Traditional optimization methods; linear, nonlinear, and dynamic programming Time series analysis, trend and seasonal models, stochastic methods (ARIMA), quality control Advanced regression (parametric, non-parametric, collinearity, nonlinear, and neural networks) Parameter estimation of static and dynamic gray-box models, calibration of simulation models Data analytics, unsupervised learning (clustering), supervised (classification) Decision-making, risk analysis, and sustainability assessment methods
(f) Occasionally, the owner would like to evaluate major changes such as equipment change out or addition of new equipment, or would like to improve overall system performance or reliability not knowing exactly how to achieve this. Alternatively, one may wish to evaluate system performance under an exceptionally hot spell of several days. This is when specialized consultants are brought in to make recommendations to the owner. Historically, such analysis was done based on the professional expertise of the consultant with minimal or no measurements of the actual system. However, both financial institutions who would lend the money for implementing these changes and the upper management of the company owning the system are insisting on a more transparent engineering analysis based on actual data. Hence, the preliminary steps involving relevant data extraction and a more careful data proofing and validation are essential. (g) Extracted data are then subject to certain engineering analyses that can be collectively referred to as applied data modeling and analysis. These involve statistical inference, identifying patterns in the data, regression analysis, parameter estimation, performance extrapolation, classification or clustering, deterministic modeling, etc. (h) Performing a decision analyses, in our context, involves using the results of the engineering analyses and adding an additional layer of analyses that includes modeling uncertainties (involving among other issues a sensitivity analysis), modeling stakeholder preferences, and structuring decisions. Several iterations may be necessary between this element and the ones involving engineering analysis and data extraction. (i) The presentation of the various choices suggested by the decision analysis to the owner or decision maker so that a final course of action may be determined. Sometimes,
it may be necessary to perform additional analyses or even modify or enhance the capabilities of the measurement system in order to satisfy client needs.
1.8
Topics Covered in Book
The overall structure of the book is depicted in Table 1.3 along with a simple suggestion as to how this book could be used for two courses if necessary. This chapter has provided a general introduction of mathematical models and discussed the different types of problems and analysis tools available for data-driven modeling and analysis. Chapter 2 reviews basic probability concepts (both classical and Bayesian), and covers various important probability distributions with emphasis on their practical usefulness. Chapter 3 covers data collection and proofing, along with exploratory data analysis and descriptive statistics. The latter entails performing “numerical detective work” on the data and developing methods for screening, organizing, summarizing, and detecting basic trends in the data (such as graphs, and tables), which would help in information gathering and knowledge generation. Historically, formal statisticians have shied away from exploratory data analysis considering it to be either too simple to warrant serious discussion or too ad hoc in nature to be able to expound logical steps (McNeil 1977). This area had to await the pioneering work by John Tukey and others to obtain a formal structure. A brief overview is provided in this book, and the interested reader can refer to Hoagin et al. (1983) or Tukey (1988) for an excellent perspective. The concepts of measurement uncertainty and propagation of errors are also addressed, and relevant equations provided. Chapter 4 covers statistical inference involving hypotheses testing of single-sample and multi-sample
26
parametric tests (such as analysis of variance [ANOVA]) involving univariate and multivariate samples. Inferential problems are those that involve making uncertainty inferences or calculating confidence intervals of population estimates from selected samples. These methods are the backbone of classical statistics from which other approaches evolved (covered in Chaps. 5, 6, and 9). Non-parametric tests and Bayesian inference methods have also proven to be useful in certain cases, and these approaches are also covered. The advent of computers has led to the very popular sampling and resampling methods that reuse the available sample multiple times resulting in more intuitive, versatile, and robust point and interval estimation. Chapter 5 deals with inferential statistics applicable to linear regression situations, an application that is perhaps the most prevalent. In essence, the regression problem involves (i) taking measurements of the various parameters (or regressor variables) and of the output (or response variables) of a device/system or a phenomenon, (ii) identifying a causal quantitative correlation between them by regression, (iii) estimating the model coefficients/ parameters, and (iv) using it to make predictions about system behavior under future operating conditions. When a regression model is identified from data, the data cannot be considered to include the entire “population” data, i.e., all the observations one could possibly conceive. Hence, model parameters and model predictions suffer from uncertainty, which falls under the purview of inferential statistics. There is a rich literature in this area called “model building” with great diversity of techniques and level of sophistication. Traditional regression methods using ordinary least squares (OLS) for univariate and multivariate linear problems along with advanced parametric and non-parametric methods are covered. Residual analysis, detection of leverage, and influential points are also discussed along with simple remedial measures one could take if the residual behavior is improper. Resampling methods applied in a regression context are also presented. Chapter 6 covers experimental design methods and discusses factorial and response surface methods that allow extending hypothesis testing to multiple variables as well as identifying sound performance models. “Design of experiments” (DOE) is the process of prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed such that the relationship or model between a response variable and a set of regressor variables can be identified in a robust and accurate manner. The extension of traditional DOE approaches to computer simulation-based design of energy efficient buildings involving numerous design variables is also discussed. The material from all these five chapters (Chaps. 2, 3, 4, 5 and 6) is generally covered in undergraduate statistics
1
Mathematical Models and Data Analysis
and probability course, and can be used for that purpose. It can also be used as review or refresher material (especially useful to the general practitioner) for a second course meant to cover more advanced concepts and statistical techniques and to better prepare graduate students and energy researchers. Chapter 7 reviews various traditional optimization methods separated into analytical (such as the Lagrange multiplier method) and univariate and multivariate numerical methods. Linear programming problems (network models and mixed integer problems) as well as nonlinear programming problems are covered. Optimization is at the heart of numerous data analysis situations (including regression model building) and is essential to current societal problems such as the design and future planning of energy-efficient and resilient infrastructure systems. Time series analysis methods are treated in Chap. 8, which are a set of tools that include traditional model building techniques as well as those that capture the sequential behavior of the data and its noise. They involve the analysis, interpretation, and manipulation of time series signals in either time domain or frequency domain. Several methods to smooth time series data in the time domain are presented. Forecasting models based on OLS modeling and the more sophisticated class of stochastic models (such as autoregressive and moving average methods) suitable for linear dynamic models are discussed. An overview is also provided of control chart techniques extensively used for process control and condition monitoring. Chapter 9 deals with subtler and more advanced topics related to parametric and non-parametric regression analysis. The dangers of collinearity among regressors during multivariate regression is addressed, and ways to minimize such effects (such as principal component analysis, ridge regression, and shrinkage methods) are discussed. An overarching class of models, namely Generalized Linear Models (GLM), are introduced that combine in a unified framework both the strictly linear models and nonlinear models that can be transformed into linear ones. Next, parameter estimation of intrinsically nonlinear models as well as non-parametric estimation methods involving smoothing and regression splines and the use of kernel functions are covered. Finally, the multi-layer perceptron (MLP) neural network model is discussed. Inverse modeling is an approach to data analysis method, which combines the basic physics of the process with statistical methods so as to achieve a better understanding of the system dynamics, and thereby use it to predict system performance. Chapter 10 presents an overview of the types of problems that fall under inverse estimation methods applied to structural models: (a) static gray-box models involving algebraic equations, (b) dynamic gray-box models involving differential equations, and (c) the calibration of white-box
Problems
models involving detailed simulation programs. The concept of information content of collected data and associated quantitative metrics are also introduced. Chapter 11 deals with data analytic methods, which include data mining and machine learning. They are directly concerned with practical applications (discern hidden information; discover patterns, associations, and trends; or summarize data behavior) through data exploration and computer-based learning algorithms. The problems can be broadly divided into two categories: unsupervised learning approaches (such as classification methods) and supervised learning approaches (such as clustering methods). Classification problems are those where one would like to develop a model to statistically distinguish or “discriminate” differences between two or more groups when one knows beforehand that such groupings exist in the data set provided, and to subsequently assign, allocate, or classify a future unclassified observation into a specific group with the smallest probability of error. During clustering, the number of clusters or groups is not known beforehand (thus, a more difficult problem), and the intent is to allocate a set of observation sets into groups that are similar or “close” to one another. Several important subtypes of both these categories are presented and discussed. Decision theory is the study of methods for arriving at “rational” decisions under uncertainty. Chapter 12 covers this issue (including risk analysis), and further provides an introduction to sustainability assessment methods that have assumed central importance in recent years. An overview of quantitative decision-making methods is followed by the suggestion that this area be divided into single and multicriteria methods and further separated into single discipline and multi-discipline applications that usually involve non-consistent attribute scales. How the decision-making process is actually applied to the selection of the most appropriate course of action for engineered systems is discussed and various relevant concepts introduced. A general introduction of sustainability and the importance of assessments in this area are discussed. The two primary sustainability assessment frameworks, namely the structure-based and the performance-based, are described along with their various subcategories and analysis procedures; illustrative case study examples are also provided.
Problems Pr. 1.1 Identify which of the following equations are linear functional models, which are linear in their parameters (a, b, c), and which are both: (a) y = a + bx + cx2 (b) y = a þ bx þ xc2
27
(c) (d) (e) (f) (g) (h) (i)
y = a + b(x - 1) + c(x - 1)2 y = a0 þ b0 x1 þ c0 x21 þ a1 þ b1 x1 þ c1 x21 x2 y = a + b. sin (c + x) y = a + b sin (cx) y = a + bxc y = a + bx1.5 y = a + b ex
Pr. 1.2 Consider the equation for pressure drop in a pipe given by Eq. 1.2 (a) Recast the equation such that it expresses the fluid volume flow rate (rather than velocity) in terms of pressure drop and other quantities. (b) Draw a block diagram to represent the case when a feedback control is used to control the flow rate from measured pressure drop. Pr. 1.3 Consider Eq. 1.5, which is a lumped model of a fully mixed hot water storage tank. Assume initial temperature is Ts, initial = 60 °C while the ambient temperature is constant at 20 °C. (i) Deduce the expression for the time constant of the tank in terms of model parameters. (ii) Compute its numerical value when Mcp = 9.0 MJ/°C and UA = 0.833 kW/°C. (iii) What will be the storage tank temperature after 6 h under cool-down (with P = 0)? (iv) How long will the tank temperature take to drop to 40 ° C under cool-down? (v) Derive the solution for the transient response of the storage tank under electric power input P. (vi) If P = 50 kW, calculate and plot the response when the tank is initially at 30 °C (akin to Fig. 1.12). Pr. 1.4 Consider Fig. 1.9 where a heated sphere is being cooled. The analysis simplifies considerably if the sphere can be modeled as a lumped one. This can be done if the Biot number Bi hLk e < 0:1. Assume that the external heat transfer coefficient is 10 W/m2 °C and that the radius of the sphere is 15 cm. The equivalent length of the sphere is Volume Le = Surface area. Determine whether the lumped model assumption is appropriate for spheres made of the following materials: (a) Steel with thermal conductivity k = 34 W/m °C (b) Copper with thermal conductivity k = 340 W/m °C (c) Wood with thermal conductivity k = 0.15 W/m °C
28
1
Mathematical Models and Data Analysis
Fig. 1.21 Steady-state heat flow through a composite wall of surface area A made up of three layers in series. (a) Sketch. (b) Electrical resistance analog (Pr. 1.6). (From Reddy et al. 2016)
Pr. 1.5 The thermal network representation of a homogeneous plane is illustrated in Fig. 1.10. Draw the 3R2C network representation and derive expressions for the three resistors and the two capacitors in terms of the two air film coefficients and the wall properties (Hint: follow the approach illustrated in Fig. 1.10 for the 2R1C network). Pr. 1.6 Consider the composite wall of surface area A as shown in Fig. 1.21 with four interfaces designed by (1, 2, 3, 4). It consists of three different materials (A, B, C) with thermal conductivity k and thickness Δx. The surface temperatures at points A and C are T1 and T4, respectively, while the steadystate heat flow rate is Q. (a) Write the equation for steady-state heat flow through this wall. This is the forward application. When would one use this equation? (b) This equation can also be used for inverse modeling applications. Give two practical instances when this applies. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.7 A building with a floor area of A = 1000 m2 and inside wall height of h = 3 m has an air infiltration (i.e., leakage) rate of 0.4 air change per hour (Note: one air change is equal to the volume of the room). The outdoor temperature is To = 2 °C and the indoor temperature Ti = 22 °C: (a) Write the equation for the rate of heat input Q that must be provided by the building heating system to warm the cold outside air assuming the location to be at sea level.
(b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.8 Consider a situation where a water pump is to deliver a quantity of water F = 500 L/s from a well of depth d = 60 m to the top of a building of height h = 30 m. The friction pressure drop through the pipe is Δp = 30 kPa. The pump-motor efficiency ηp = 60%. (a) Write the equation for electric power consumption in terms of the various quantities specified. This is the forward situation. Under what situations would this equation be used? (b) This equation can also be used for inverse modeling applications. Give a practical instance. Clearly state what the intent is, how the equation has to be restructured, what parameters are to be determined, and what measurements are needed to do so. Pr. 1.9 Two pumps in parallel problem viewed from the forward and the inverse perspectives Consider Fig. 1.22, which will be analyzed in both the forward and data-driven approaches. (a) Forward problem10: Two pumps with parallel networks deliver F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ 1010 F 21 and Δp2 = ð3:6Þ 1010 F 22 where F1 and F2 are the flow rates through each branch in m3/s. Assume that
10
From Stoecker (1989), by permission of McGraw-Hill.
Problems
29
ΔP1 = (2.1 x 1010) (F1)2 ΔP2 = (3.6 x 1010) (F2)2
F=0.01 m3/s
F1
F2
Fig. 1.22 Pumping system with two pumps in parallel (Pr. 1.9) Fig. 1.23 Perspective of the forward problem for the lake contamination situation (Pr. 1.10)
Contaminated outfall Incoming stream
Qs =5.0 m3/s Cs =10.0 mg/L
pumps and their motor assemblies have the same efficiency. Let P1 and P2 be the electric power in Watts (W) consumed by the two pump-motor assemblies. (i) Sketch the block diagram for this system with total electric power as the output variable. (ii) Frame the total power P as the objective function that needs to be minimized against total delivered water F. (iii) Solve the problem for F1 and P. (b) Inverse problem: Now consider the same system in the inverse framework where one would instrument the existing system such that operational measurements of P for different F1 and F2 are available. (i) Frame the function appropriately using insights into the functional form provided by the forward model. (ii) The simplifying assumption of constant efficiency of the pumps is unrealistic. How would the above function need to be reformulated if efficiency can be taken to be a quadratic polynomial (or black-box model) of flow rate as shown below for the first piping branch (with a similar expression applying for the second branch): η1 = a1 þ b1 F 1 þ c1 F 21 .
Q w =0.5 m3/s Cw=100.0 mg/L
V=10.0 x 106 m3 k=0.20/day C=?
Outgoing stream Qm= ? m3/s Cm= ? mg/L
Pr. 1.10 Lake contamination problem viewed from the forward and the inverse perspectives A lake of volume V is fed by an incoming stream with volumetric flow rate Qs contaminated with concentration Cs11 (Fig. 1.23). The outfall of another source (say, the sewage from a factory) also discharges a flow Qw of the same pollutant with concentration Cw. The wastes in the stream and sewage have a decay coefficient k. (a) Let us consider the forward model approach. In order to simplify the problem, the lake will be considered to be a fully mixed compartment and evaporation and seepage losses to the lake bottom will be neglected. In such a case, the concentration of the outflow is equal to that in the lake, i.e., Cm = C. Then, the steady-state concentration in the lake can be determined quite simply: Input rate = Output rate + decay rate. where Input rate = QsCs + QwCw, Output rate = QmCm = (Qs + Qw)Cm, and decay rate = kCV.12 s þQw C w This results in: C = QQssCþQ . w þkV
11
From Masters and Ela (2008), by permission of Pearson Education. This term is the first-order Taylor series approximation of exp(kCVt) where t = time.
12
30
Verify the above-derived expression, and also check that C = 3.5 mg/L when the numerical values for the various quantities given in Fig. 1.23 are used. (b) Now consider the inverse control problem when an actual situation can be generally represented by the model treated above. One can envision several scenarios; let us consider a simple one. Flora and fauna downstream of the lake have been found to be adversely affected, and an environmental agency would like to investigate this situation by installing appropriate instrumentation. The agency believes that the factory is polluting the lake, which the factory owner, on the other hand, disputes. Since it is rather difficult to get a good reading of spatial averaged concentrations in the lake, the experimental procedure involves measuring the cross-sectionally averaged concentrations and volumetric flow rates of the incoming, outgoing, and outfall streams. (i) Using the above model, describe the agency’s thought process whereby they would conclude that indeed the factory is the major cause of the pollution. (ii) Identify arguments that the factory owner can raise to rebut the agency’s findings.
References Arsham, http://home.ubalt.edu/ntsbarsh/stat-data/Topics.htm, downloaded August 2008 Berthold, M. and D.J. Hand (eds.) 2003. Intelligent Data Analysis, 2nd Edition, Springer, Berlin. Breiman, L. 2001. Statistical modeling: The two cultures, Statistical Science, vol. 16, no.3, pp. 199–231 Cha, P.D., J.J. Rosenberg and C.L. Dym, 2000. Fundamentals of Modeling and Analyzing Engineering Systems, 2nd Ed., Cambridge University Press, Cambridge, UK. Clemen, R.T. and T. Reilly, 2001. Making Hard Decisions with Decision Tools, Brooks Cole, Duxbury, Pacific Grove, CA Edwards, C.H. and D.E. Penney, 1996. Differential Equations and Boundary Value Problems, Prentice Hall, Englewood Cliffs, NJ
1
Mathematical Models and Data Analysis
Eisen, M., 1988. Mathematical Methods and Models in the Biological Sciences, Prentice Hall, Englewood Cliffs, NJ. Energy Plus, 2009. Energy Plus Building Energy Simulation software, developed by the National Renewable Energy Laboratory (NREL) for the U.S. Department of Energy, under the Building Technologies program, Washington DC, USA. http://www.nrel.gov/buildings/ energy_analysis.html#energyplus. Doebelin, E.O., 1995. Engineering Experimentation: Planning, Execution and Reporting, McGraw-Hill, New York Dunham, M., 2003. Data Mining: Introductory and Advanced Topics, Pearson Education Inc. Haimes, Y.Y., 1998. Risk Modeling, Assessment and Management, John Wiley and Sons, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoagin, D.C., F. Moesteller and J.W. Tukey, 1983. Understanding Robust and Exploratory Analysis, John Wiley and Sons, New York. Hodges, J.L. and E.L. Lehman, 1970. Basic Concepts of Probability and Statistics, 2nd Edition Holden Day Kelleher J.D. and B. Tierney, 2018. Data Science, MIT Press, Cambridge, MA Masters, G.M. and W.P. Ela, 2008. Introduction to Environmental Engineering and Science, 3rd Ed. Prentice Hall, Englewood Cliffs, NJ Mayer-Schonberger V. and K. Cukier, 2013. Big Data, John Murray, London, UK McNeil, D.R. 1977. Interactive Data Analysis, John Wiley and Sons, New York. Reddy, T.A., 2006. Literature review on calibration of building energy simulation programs: Uses, problems, procedures, uncertainty and tools, ASHRAE Transactions, 112(1), January Reddy, T.A., J.F. Kreider, P. Curtiss, A. Rabl, 2016. Heating and Cooling of Buildings- Principles and Practice of Energy Efficient Design, 3rd Edition, CRC Press, Boca Raton, FL. Sprent, P., 1998. Data Driven Statistical Methods, Chapman and Hall, London. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. Streed, E.R., J.E. Hill, W.C. Thomas, A.G. Dawson and B.D. Wood, 1979. Results and Analysis of a Round Robin Test Program for Liquid-Heating Flat-Plate Solar Collectors, Solar Energy, 22, p.235. Stubberud, A., I. Williams, and J. DiStefano, 1994. Outline of Feedback and Control Systems, Schaum Series, McGraw-Hill. Tukey, J.W., 1988. The Collected Works of John W. Tukey, W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Pacific Grove, CA
2
Probability Concepts and Probability Distributions
Abstract
This chapter reviews basic notions of probability (or stochastic variability), which is the formal study of the laws of chance, i.e., where the ambiguity in outcome is inherent in the nature of the process itself. Probability theory allows idealized behavior of different types of systems to be modeled and provides the mathematical underpinning of statistical inference. In that respect, it can be viewed as pertaining to the forward modeling domain. Both the primary views of probability, namely the frequentist (or classical) and the Bayesian, are covered, and a discussion provided of the difference between probability and statistics. The basic laws of probability are presented followed by an introductory treatment of set theory nomenclature and algebra. Relevant concepts of random variables, namely density functions, moment generation attributes, and transformation of variables, are also covered. Next, some of the important discrete and continuous probability distributions are presented along with a discussion of their genealogy, their mathematical form, and their application areas. Subsequently, the Bayes’ theorem is derived and how it provides a framework to include prior knowledge in multistage tests is illustrated using examples involving forward and reverse tree diagrams. Finally, the three different kinds of empirical probabilities, such as absolute, relative, and subjective, are described with illustrative examples.
2.1
Introduction
2.1.1
Classical Concept of Probability
Random data by its very nature is indeterminate. So how can a scientific theory attempt to deal with indeterminacy? Probability theory does just that and is based on the fact that
though the result of any particular trial or experiment or event cannot be predicted, a long sequence of performances taken together reveals a stability that can serve as the basis for fairly precise predictions. Consider the case when an experiment was carried out several times and the anticipated event E occurred in some of them. Relative frequency is the ratio denoting the fraction of event E occurring. It is usually estimated empirically after the event as: pðE Þ =
number of times E occured ð2:1Þ number of times the experiment was carried out
For certain simpler events, one can determine this proportion without actually carrying out the experiment; this is referred to as wise before the event. For example, the relative frequency of getting heads (selected as a “success” event) when tossing a fair coin is 0.5. In any case, this a-priori proportion is interpreted as the long-run relative frequency and is referred to as probability of event E occurring. This is the classical or frequentist or traditionalist definition of probability on which probability theory is founded. This interpretation arises from the strong law of large numbers (a wellknown result in probability theory), which states that the average of a sequence of independent random variables having the same distribution will converge to the mean of that distribution. If a six-faced dice is rolled, the probability of getting a pre-selected number between 1 and 6 (say, 4) will vary from event to event, but the long-run average will tend toward 1/6. The classical probability concepts are often described or explained in terms of dice-tossing or coinflipping or card-playing outcomes since they are intuitive and simple to comprehend, but their applicability is much wider and extends to all sorts of problems as will become evident in this chapter.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_2
31
32
2.1.2
2 Probability Concepts and Probability Distributions
Bayesian Viewpoint of Probability
The classical or traditional or objective probability concepts are associated with the frequentist view of probability, i.e., interpreting probability as the long-run frequency. This has a nice intuitive interpretation, hence its appeal. However, people have argued that many processes are unique events and do not occur repeatedly, thereby questioning the validity of the frequentist or objective probability viewpoint. Further, even when one may have some basic preliminary idea of the probability associated with a certain event, the classical view excludes such subjective insights in the determination of probability. The Bayesian approach, however, recognizes such issues by allowing one to update assessments of probability that integrate prior knowledge with observed events, thereby allowing better conclusions to be reached. It can thus be viewed as an approach combining a-priori probability (i.e., estimated ahead of the experiment) with post-priori knowledge gained after the experiment is over. Both the classical and the Bayesian approaches converge to the same results as increasingly more data (or information) is gathered. It is when the datasets are small that the additional insight offered by the Bayesian approach becomes advantageous. Thus, the Bayesian view is not an approach that is at odds with the frequentist approach, but rather adds (or allows the addition of) refinement to it. This can be a great benefit in many types of analysis, and therein lies its appeal. The Bayes’ theorem and its application to discrete and continuous probability variables are discussed in Sect. 2.5, while Sect. 4.6 presents its application to estimation and hypothesis testing problems.
2.1.3
(i) To try to understand the overall nature of the system from its measured performance, i.e., to explain what caused the system to behave in the manner it did. (ii) To try to make inferences about the general behavior of the system from a limited amount of data. Consequently, some authors have suggested that the probability approach be viewed as a “deductive” science where the conclusion is drawn without any uncertainty, while statistics is an “inductive” science where only an imperfect conclusion can be reached, with the added problem that this conclusion hinges on the types of assumptions one makes about the random nature of the process and its forcing functions! Here is a simple example to illustrate the difference. Consider the flipping of a coin supposed to be fair. The probability of getting “heads” is ½. If, however, “heads” come up eight times out of the last ten trials, what is the probability the coin is not fair? Statistics allows an answer to this type of “inverse” enquiry, while probability is the approach for the “forward” type of questioning.
Distinction Between Probability and Statistics
The distinction between probability and statistics is often not clear cut, and sometimes the terminology adds to the confusion.1 In its simplest sense, probability theory generally allows one to predict the behavior of the system “before” the event under the stipulated assumptions, while statistics refers to a body of post-priori knowledge whose application allows one to make sense out of the data collected. Thus, probability concepts provide the theoretical underpinnings of those aspects of statistical analysis that involve random behavior or noise in the actual data being analyzed. Recall that in Sect. 1.3.2, a distinction had been made between four types of uncertainty or unexpected variability in the data. The first was due to the stochastic or inherently random nature of the process itself, which no amount of experiment, even if For example, “statistical mechanics” in physics has nothing to do with statistics at all but is a type of problem studied under probability.
1
done perfectly, can overcome. The study of probability theory is mainly mathematical, and applies to this type, i.e., to situations/processes/systems whose random nature is known to be of a certain type or can be modeled such that its behavior (i.e., certain events being produced by the system) can be predicted in the form of probability distributions. Thus, probability deals with the idealized behavior of a system under a known type of randomness. Unfortunately, most natural or engineered systems do not fit neatly into any one of these groups, and so when performance data is available of a system, the objective may be:
2.2
Classical Probability
2.2.1
Basic Terminology
A random (or stochastic) variable is one whose numerical value depends on the outcome of a random phenomenon or experiment or trial, i.e., one whose value depends on chance and thus not entirely predictable. For example, rolling a dice is a random experiment. There are two types of random variables: • Discrete random variables—those that can take on only a finite or countable number of values. • Continuous random variables—those that may take on any value in an interval. The following basic notions relevant to the study of probability apply primarily to discrete random variables:
2.2 Classical Probability
(i) Simple event or trial of a random experiment is one that has a single outcome. It cannot be decomposed into anything simpler. For example, getting a {6} when a dice is rolled. (ii) Sample space (some refer to it as “universe”) is the set of all possible outcomes of a single trial. For the rolling of a six-faced dice, the sample space is S = {1, 2, 3, 4, 5, 6}. (iii) Compound or composite event is one involving grouping of several simple events. For example, getting a pre-selected number, say, A={6}, from all the possible outcomes of rolling two six-faced dices together would constitute a composite event. (iv) Complement of an event is the set of outcomes in the sample not contained in A above. For example, A = f2, 3, 4, 5, 7, 8, 9, 10, 11, 12g is the complement of the event A.
2.2.2
33
S
A (a) S
intersection A ∩ B
(b) S
B
A
Basic Set Theory Notation and Axioms of Probability
A∩B=Ø
(c) A “set” in mathematics is a collection of well-defined distinct objects that can be considered as an object in its own right. Set theory has its own definitions, axioms, notations, and algebra, which is closely related to the properties of random variables. Familiarity with algebraic manipulation of sets can enhance understanding and manipulation of probability concepts. The outcomes of simple or compound events can also be considered to be elements of a set. For example, the sample space S of outcomes from rolling a dice is a collection of items or a set, and is usually indicated by S = {1, 2, 3, 4, 5, 6}. If set B were to denote B{1,2,3}, it would be a subset of S or “belong to S,” and is expressed as B 2 S. The symbol 2 = represents “not belonging to.” Another mathematical representation that B is a subset of S is by B ⊂ S or S ⊃ B. Generally, if E is a set of numbers between 3 and 6 (inclusive), it can be stated as E={x |3≤ x ≤ 6}. Finally, a compound or joint event is one that arises from operations involving two or more events occurring at the same time. The concepts of complementary, union, intersection, etc. discussed below in the context of probability also apply to set manipulation. The Venn diagram is a pictorial representation wherein elements are shown as points in a plane and sets as closed regions within an enclosing rectangle denoting the universal set or sample space. It offers a convenient manner of illustrating set and subset interaction and allows intuitive understanding of compound events and the properties of their combined probabilities. Figure 2.1 illustrates the following concepts:
B
A
S
A B (d) Fig. 2.1 Venn diagrams for a few simple cases. (a) Event A is denoted as a region in space S. Probability of event A is represented by the area inside the circle to that inside the rectangle. (b) The intersection of events A and B is the common overlapping area (shown hatched). (c) Events A and B are mutually exclusive or are disjoint events. (d) Event B is a subset of event A
• The universe of outcomes or sample space S is a set denoted by the area enclosed within a rectangle, while the probability of a particular event (say, event A) is denoted by a region within (Fig. 2.1a). • Union of two events A and B (Fig. 2.1b) is represented by the set of outcomes in either A or B or both, and is denoted by A [ B (where the symbol [ is conveniently remembered as “u” of “union”). This is akin to an addition and the composite event is denoted mathematically as C = A [ B = B [ A. An example is the number of cards in a pack of 52 cards which are either hearts or spades (= 52*(1/4+1/4) = 26). • Intersection of two events A and B is represented by the set of outcomes that are in both A and B simultaneously. This is akin to a multiplication, and denoted by D = A \ B =
34
2 Probability Concepts and Probability Distributions
B \ A. It is represented by the hatched area in Fig. 2.1b. An example is drawing a card from a deck and finding it to be a red jack (probability = (1/2) × (1/13) = 1/26). The figure also shows the areas denoted by the intersection of A and B with their complements Ā and B. • Mutually exclusive events or disjoint events are those that have no outcomes in common (Fig. 2.1c). In other words, the two events cannot occur together during the same trial. If events A and B are disjoint, this can be expressed as A \ B = Ø, where Ø denote a null or empty set. An example is drawing a red spade (nil). • Event B is inclusive in event A when all outcomes of B are contained in those of A, i.e., B is a subset of A (Fig. 2.1d). This is expressed as B ⊂ A or A ⊃ B or B 2 A. An example is the number of cards less than six (event B), which are red cards (event A). The figure also shows the area (A – B) representing the difference between events A and B.
2.2.3
Axioms of Probability
Let us now apply the above concepts to random variables and denote the sample space S as consisting of two events A and B with probabilities p(A) and p(B), respectively. Then: (i) Probability of any event, say A, cannot be negative. This is expressed as: pðAÞ ≥ 0
ð2:2Þ
(ii) Probabilities of all events must be unity (i.e., normalized): pð SÞ pð A Þ þ pð B Þ = 1
ð2:3Þ
(iii) Probabilities of mutually exclusive events A and B add up: pðA [ BÞ = pðAÞ þ pðBÞ
ð2:4Þ
If a dice is rolled twice, the outcomes can be assumed to be mutually exclusive. If event A is the occurrence of 2 and event B that of 3, then p(A or B) = 1/6 + 1/6 = 1/3, i.e., the additive rule (Eq. 2.4) applies. The extension to more than two mutually exclusive events is straightforward. Some other inferred relations are: (iv) Probability of the complement of event A: p A = 1 - pð A Þ which leads to pðAÞ [ p A = S
ð2:5Þ
(v) Probability for either A or B to occur (when they are not mutually exclusive) is: pðA [ BÞ = pðAÞ þ pðBÞ - pðA \ BÞ
ð2:6Þ
This is intuitively obvious from the Venn diagram (see Fig. 2.1b) since the hatched area (representing p(A \ B)) gets counted twice in the sum, and so needs to be deducted once. This equation can also be deduced from the axioms of probability. Note that if events A and B are mutually exclusive, then Eq. 2.6 reduces to Eq. 2.4. Example 2.2.1 (a) Set theory approach Consider two sets A and B defined by integers from 1 to 10 by: A = {1,4,5,7,8,9} and B = {2,3,5,6,8,10}. Then A [ B = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = sample space S. Also A \ B = {5, 8} and A – B = {1,4,7,9} and B – A = {2,3,6,10}. The reader is urged to draw the corresponding Venn diagram for better conceptual understanding. (b) Probability approach Let the two sets be redefined based on the number of integers in each set. Then p(A) = 6/10 = 0.6 and p(B) = 0.6. Clearly the two sets are overlapping since they sum to greater than one. Union p(A [ B) = {1} = entire space S Rearranging Eq. 2.6 results in intersection p(A \ B) = p(A) + p(B) - 1 = 0.2, which is consistent with (a) where the intersection consisted of two elements out of 10 in the set. Example 2.2.2 For three non-mutually exclusive events, Eq. 2.6 can be extended to pðA [ B [ CÞ = pðAÞ þ pðBÞ þ pðCÞ - pðA \ BÞ - pðA \ CÞ - pðB \ CÞ þ 3pðA \ B \ CÞ ð2:7Þ This is clear from the corresponding Venn diagram shown in Fig. 2.2. The last term (A \ B \ C) denotes the intersection area of all three events occurring simultaneously and is deducted three times as part of (A \ B), (A \ C), and (B \ C), and so needs to be added thrice.
2.2 Classical Probability
35
A
B
A∩B A∩B∩C A∩C
S
B∩C
C
Fig. 2.3 Two components connected in series and in parallel (Example 2.2.3)
Fig. 2.2 Venn diagram for the union of three non-mutually exclusive events (Example 2.2.2)
2.2.4
Joint, Marginal, and Conditional Probabilities
RðA and BÞ = RðAÞ:RðBÞ = ð0:9 × 0:75Þ = 0:675:
(a) Joint probability of two independent events represents the case when both events occur together at the same point in time. Such events and more complex probability problems are not appropriate for Venn diagram representation. Then, following the multiplication law: pðA and BÞ = pðAÞ pðBÞ if A and B are independent
(a) Series connection: For the system to function, both components should be functioning. Then the joint probability of system functioning (or the reliability):
ð2:8Þ
These are called product models. The notations p(A \ B) and p(A and B) can be used interchangeably. Consider a six-faced dice-tossing experiment. If outcome of event A is the occurrence of an even number, then p (A) = 1/2. If outcome of event B is that the number is less than or equal to 4, then p(B) = 2/3. The probability that both outcomes occur when a dice is rolled is p(A and B) = 1/2 × 2/3 = 1/3. This is consistent with our intuition since outcomes {2,4} would satisfy both the events. Example 2.2.3 Probability concepts directly apply to reliability problems associated with engineered systems. Consider two electronic components A and B but with different rates of failure expressed as probabilities, say p(A) = 0.1 and p(B) = 0.25. What is the failure probability of a system made up of the two components if connected (a) in series and (b) in parallel. Assume independence, i.e., the failure of one component is independent of the other. The two cases are shown in Fig. 2.3 with the two components A and B connected in series and in parallel. Reliability is the probability of functioning properly and is the complement of probability of failure, i.e., Reliability R(A) = 1 – p(A) = 1 – 0.1 = 0.9, and R(B) = 1 – p(B) = 1 – 0.25 = 0.75.
The failure probability of the system is p(system failure) = 1 – R(A and B) = 1 – 0.675 = 0.325. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p)2. (b) Parallel connection: For the system to fail, both components should fail. In this case, it is better to work with failure probabilities. Then, the joint probability of system failing: p(A). p(B) = 0.1 × 0.25 = 0.025, which is much lower than 0.325 found for the components in series scenario. This result is consistent with the intuitive fact that components in parallel increase the reliability of the system. The corresponding probability of functioning is R(A and B) = 1 - 0.025 = 0.975. For the special case when both components have the same probability of failure p, it is left to the reader to verify that system reliability or probability of system functioning = (1 – p2). (c) Marginal probability of an event A compared to another event B refers to its probability of occurrence irrespective of B. It is sometimes referred to as “unconditional probability” of A on B. Let the space contain only events A and B, i.e., events A and B are known to have occurred. Since S can be taken to be the sum of event space B and its complement B, the probability of A can be expressed in terms of the sum of the disjoint parts of B: pðAÞ = pðA \ BÞ þ p A \ B = p ðA and BÞ þ p A and B
ð2:9Þ
This expression (known as Bayes’ Rule) can be extended to the case of more than two joint events. This equation will be made use of in Sect. 2.5.
36
2 Probability Concepts and Probability Distributions
Example 2.2.4 The percentage data of annual income versus age has been gathered from a large population living in a certain region— see Table 2.1. Let X be the income and Y the age. The marginal probability of X for each class is simply the sum of the probabilities under each column and that of Y the sum of those for each row. Thus, p(X ≥ 40, 000) = 0.15 + 0.10 + 0.08 = 0.33, and so on. Also, verify that the sum of the marginal probabilities of X and Y sum to 1.00 (to satisfy the normalization condition). ■ (c) Conditional probability: There are several situations involving compound outcomes that are sequential or successive in nature. The chance result of the first stage determines the conditions under which the next stage occurs. Such events, called two-stage (or multistage) events, involve step-by-step outcomes that can be represented as a probability tree. This allows better visualization of how the probabilities progress from one stage to the next. If A and B are events, then the probability that event B occurs given that A has already occurred is given by the conditional probability of B given A (known as Bayes’ Rule): pðB=AÞ =
pðA \ BÞ pðAÞ
ð2:10Þ
An example of a conditional probability event is the drawing of a spade from a pack of cards from which a first card was already drawn. If it is known that the first card was not a spade, then the probability of drawing a spade the second time is 12/51 = 4/17. On the other hand, if the first card drawn was a spade, then the probability of getting a spade on the second draw is 11/51. A special but important case is when p(B/A) = p(B). In this case, B is said to be independent of A because the fact that event A has occurred does not affect the probability of B occurring. In this case, one gets back Eq. 2.8.
Mutually exclusive events and independent events are not to be confused. While the former is a property of the events themselves that occur simultaneously, the latter is a property that arises from the event probabilities that are sequential or staged over time. The distinction is clearer if one keeps in mind that: – If A and B events are mutually exclusive, p(A \ B)=p(AB) = 0. – If A and B events are independent, then p(AB) = p(A). p(B). – If A and B events are mutually exclusive and independent, then p(AB) = 0 = p(A).p(B), and so one of the events cannot or should not occur. Example 2.2.5 A single fair dice is rolled. Let event A = {even outcome} and event B = {outcome is divisible by 3}. (a) List the various outcomes in the sample space: {1 2 3 4 5 6} (b) List the outcomes in A and find p(A): {2 4 6}, p(A) = 1/2 (c) List the outcomes of B and find p(B): {3 6}, p(B) = 1/3 (d) List the outcomes in A \ B and find p(A \ B): {6}, p(A \ B) = 1/6 (e) Are the events A and B independent? Yes, since Eq. 2.8 holds ■ Example 2.2.6 Two defective bulbs have been mixed with ten good ones. Let event A = {first bulb is good}, and event B = {second bulb is good}. The two events are independent. (a) If two bulbs are chosen at random with replacement, what is the probability that both are good? Given p(A) = 8/10 and p(B) = 8/10. Then from Eq. 2.8:
Table 2.1 Computing marginal probabilities from a probability table (Example 2.2.4) Income (X) Age (Y ) Under 25 Between 25 and 40 Above 40 Marginal probability of X
≤$40,000 0.15 0.10
40,000–90,000 0.09 0.16
≥90,000 0.05 0.12.
0.08 0.33
0.20 0.45
0.05 0.22
Marginal probability of Y 0.29 0.38 0.33 Should sum to 1.00 both ways
pðA \ BÞ =
8 8 64 : = = 0:64 10 10 100
(b) What is the probability that two bulbs drawn in sequence (i.e., not replaced) are good where the status of the bulb after the first draw is known to be good? From Eq. 2.8, p(both bulbs drawn are good): pðA \ BÞ = pðAÞ pðB=AÞ =
8 7 28 = = 0:622 10 9 45 ■
2.2 Classical Probability
37
Example 2.2.7 Two events A and B have the following probabilities: p(A) = 0.3, p(B) = 0.4, and p A \ B = 0:28. (a) Determine whether the events A and B are independent or not? From Eq. 2.5, P A = 1 - pðAÞ = 0:7. Next, one will verify whether Eq. 2.8 holds or not. In this case, one needs to verify whether: p A \ B = p A pðBÞ or whether 0.28 is equal to (0.7 × 0.4). Since this is correct, one can state that events A and B are independent. (b) Find p(A [ B) From Eqs. 2.6 and 2.8: pðA [ BÞ = pðAÞ þ pðBÞ - pðA \ BÞ = pðAÞ þ pðBÞ - pðAÞ pðBÞ = 0:3 þ 0:4 - ð0:3Þð0:4Þ = 0:58
Example 2.2.8 Generating a probability tree for a residential air-conditioning (AC) system. Assume that the AC is slightly undersized for the house it serves. There are two possible outcomes (S, satisfactory, and NS, not satisfactory) depending on whether the AC is able to maintain the desired indoor temperature. The outcomes depend on the outdoor temperature, and, for simplicity, its annual variability is grouped into three categories: very hot (VH), hot (H ), and not hot (NH). The probabilities for outcomes S and NS to occur in each of the three day-type categories are shown in the conditional probability tree diagram (Fig. 2.4) while the joint probabilities computed following Eq. 2.8 are assembled in Table 2.2. Note that the relative probabilities of the three branches in both the first stage and in each of the two branches of each outcome add to unity (e.g., in the Very Hot, the S and NS outcomes add to 1.0, and so on). Further, note that the joint probabilities shown in the table also must sum to unity (it is advisable to perform such verification checks). The probability of the indoor conditions being satisfactory is determined as: p(S)= 0.02 + 0.27 + 0.6 = 0.89 while p (NS) = 0.08 + 0.03 + 0 = 0.11. It is wise to verify that p (S) + p(NS) = 1.0. ■
Fig. 2.4 The conditional probability tree for the residential air-conditioner when two outcomes are possible (S, satisfactory; and NS, not satisfactory) for each of three day-types (VH, very hot; H, hot; and NH, not hot). (Example 2.2.8) Table 2.2 Joint probabilities of various outcomes (Example 2.2.8) p(VH \ S) = 0.1 × 0.2 = 0.02 p(VH \ NS) = 0.1 × 0.8 = 0.08 p(H \ S) = 0.3 × 0.9 = 0.27 p(H \ NS) = 0.3 × 0.1 = 0.03 p(NH \ S) = 0.6 × 1.0 = 0.6 p(NH \ NS) = 0.6 × 0 = 0
Table 2.3 Probabilities of various outcomes (Example 2.2.9) pðA \ RÞ =
pðB \ RÞ =
pðA \
pðB \
pðA \
1 1 1 2 × 2 = 4 1 1 W Þ = 2 × 2 = 14 GÞ = 12 × 0 = 0
pðB \
1 3 3 2 × 4 = 8 1 WÞ = 2 × 0 = 0 GÞ = 12 × 14 = 18
Example 2.2.9 Two-stage experiment Consider a problem where there are two boxes with marbles as specified: Box A : ð1 red and 1 whiteÞ; and Box B : ð3 red and 1 greenÞ: A box is chosen at random and a marble drawn from it. What is the probability of getting a red marble? One is tempted to say that since there are 4 red marbles in total out of 6 marbles, the probability is 2/3. However, this is incorrect, and the proper analysis approach requires that one frames this problem as a two-stage experiment. The first stage is the selection of the box, and the second the drawing of the marble. Let event A (or event B) denote choosing Box A (or Box B). Let R, W, and G represent red, white, and green marbles. The resulting probabilities are shown in Table 2.3.
38
2 Probability Concepts and Probability Distributions
Fig. 2.5 The first stage of the forward probability tree diagram involves selecting a box (either A or B) while the second stage involves drawing a marble that can be red (R), white (W ), or green (G) in color. The total probability of drawing a red marble is 5/8. (Example 2.2.9)
Thus, the probability of getting a red marble = 1/4 + 3/ 8 = 5/8. The above example is depicted in Fig. 2.5, where the reader can visually note how the probabilities propagate through the probability tree. This is called the “forward tree” to differentiate it from the “reverse tree” discussed in Sect. 2.5. The above example illustrates how a two-stage experiment must be approached. First, one selects a box that by itself does not tell us whether the marble is red (since one has yet to pick a marble). Only after a box is selected, can one use the prior probabilities regarding the color of the marbles inside the box in question to determine the probability of picking a red marble. These prior probabilities can be viewed as conditional probabilities; i.e., for example, p(A \ R) = p (R/A) p(A) ■
2.2.5
Permutations and Combinations
The study of probability requires a sound knowledge of combinatorial mathematics, which is concerned with developing rules for situations involving permutations and combinations. (a) Permutation P(n, k) is the number of ways in which k objects can be selected from n objects with the order being important. It is given by: Pðn, kÞ =
n! ðn - kÞ!
ð2:11Þ
A special case is the number of permutations of n objects taken n at a time: Pðn, nÞ = n! = nðn - 1Þðn - 2Þ . . . ð2Þð1Þ
ð2:12Þ
(b) Combination C(n,k) is the number of ways in which k objects can be selected from n objects with the order not being important. It is given by: Cðn, kÞ =
n! ðn - k Þ!k!
n k
ð2:13Þ
Note that the same equation also defines the binomial coefficients since the expansion of (a+b)n according to the Binomial theorem is: n
ð a þ bÞ n = k=0
n k
an - k b k
ð2:14Þ
Example 2.2.10 (a) Calculate the number of ways in which three people from a group of seven people can be seated in a row. This is a case of permutation since the order is important. The number of possible ways is given by Eq. 2.11: Pð7, 3Þ =
ð7Þ ð6Þ ð5Þ 7! = = 2110 1 ð7 - 3Þ!
(b) Calculate the number of combinations in which three people can be selected from a group of seven. Here the order is not important, and the combination formula can be used (Eq. 2.13). Thus: C ð7, 3Þ =
ð7Þ ð6Þ ð5Þ 7! = = 35 ð7 - 3Þ!3! ð 3Þ ð 2Þ ■
2.3 Probability Distribution Functions
39
Table 2.4 Number of combinations for equipment scheduling in a large physical plant of a campus
One of each Two of each: assumed identical Two of each: non-identical except for boilers
Status (0, off; 1, on) Prime-movers Boilers 0–1 0–1 0–0, 0–1, 0–0, 0–1, 1–1 1–1 0–0, 0–1, 0–0, 0–1, 1–0, 1–1 1–1
Chillers-Vapor compression 0–1 0–0, 0–1, 1–1
Chillers-Absorption 0–1 0–0, 0–1, 1–1
Number of combinations 24 = 16 34 = 81
0–0, 0–1, 1–1
0–0, 0–1, 1–0, 1–1
43 × 31 = 192
Another type of combinatorial problem is the factorial problem to be discussed in Chap. 6 while dealing with design of experiments. Consider a specific example involving equipment scheduling at a physical plant of a large campus that includes prime-movers (diesel engines or turbines that produce electricity), boilers, and chillers (vapor compression and absorption machines). Such equipment need a certain amount of time to come online and so operators typically keep some of them “idling” so that they can start supplying electricity/heating/cooling at a moment’s notice. Their operating states can be designated by a binary variable; say “1” for on-status and “0” for off-status. Extensions of this concept include cases where, instead of two states, one could have m states. An example of three states is when say two identical boilers are to be scheduled. One could have three states altogether: (i) when both are off (0–0), (ii) when both are on (1–1), and (iii) when only one is on (1–0). Since the boilers are identical, state (iii) is identical to 0–1. In case the two boilers are of different size, there would be four possible states. The number of combinations possible for “n” such equipment where each one can assume “m” states is given by mn. Some simple cases for scheduling four different types of energy equipment in a physical plant are shown in Table 2.4.
2.3
Probability Distribution Functions
2.3.1
Density Functions
The notions of discrete and continuous random variables were introduced in Sect. 2.2.1. The concepts relevant to discrete outcomes of events or discrete random variables were addressed in Sect. 2.2. and these will now be extended to continuous random variables. The distribution of a random variable represents the probability of it taking its various possible values. For example, if the y-axis in Fig. 1.2 of the six-faced dice rolling experiment were to be changed into a relative frequency (= 1/6), the resulting histogram would graphically represent the corresponding probability density function (PDF) (Fig. 2.6a). Thus, the probability of getting a 2 in the rolling of a dice is 1/6th. Since, this is a discrete random variable, the function takes on specific values at discrete points of the x-axis (which represents the outcomes).
The same type of y-axis normalization done to the data shown in Fig. 1.3 would result in the PDF for the case of continuous random data. This is shown in Fig. 2.7a for the random variable taken to be the hourly outdoor dry-bulb temperature over the year at Philadelphia, PA. Notice that this is the envelope of the histogram of Fig. 1.3. Since the variable is continuous, it is implausible to try to determine the probability of, say temperature outcome of 57.5 °F. One would be interested in the probability of outcomes within a range, say 55–60 °F. The probability can then be determined as the area under the PDF as shown in Fig. 2.7b. It is for such continuous random variables that the cumulative distribution function (CDF) is useful. It is simply the cumulative area under the curve starting from the lowest value of the random variable X to the current value (Fig. 2.8). The vertical scale directly gives the probability (or, in this case, the fractional time) that X is less than or equal to a certain value. Thus, the probability (X ≤ 60) is about 0.58. The concept of CDF also applies to discrete variables but as a discontinuous stepped curve as illustrated in Fig. 2.6b for the dice rolling example. To restate, depending on whether the random variable is discrete or continuous, one gets discrete or continuous PDFs. Though most experimentally gathered data is discrete, the underlying probability theory is based on the data being continuous. Replacing the integration sign by the summation sign in the equations that follow allows extending the definitions to discrete distributions. Let f(x) be the PDF associated with a random variable X. This is a function that provides the probability that a discrete random variable X takes on a particular value x among its various possible values. The axioms of probability (Eqs. 2.2 and 2.3) for the discrete case are expressed for the case of continuous random variables as: • PDF cannot be negative: f ðxÞ ≥ 0
-1 10. In problems where the normal distribution is used, it is more convenient to standardize the random variable into a
2.4 Important Probability Distributions
51 0.4
0.4
Z=-0.8
Z=-1.2 0.3
0.3
0.2
0.2
p(-0.8) =0.2119
p(-1.2) =0.1151 0.1
0.1
0
Z=0.8
-3
-2
-1
0
1
2
z
a
0
3
b
-3
-2
-1
0
1
2
3
z
Fig. 2.16 Figures meant to illustrate the fact that the shaded figure areas are the physical representations of the tabulated standardized probability values in Table A3 (corresponds to Example 2.4.9ii). (a) Lower limit. (b) Upper limit
new random variable z x -σ μ with mean zero and variance of unity. This results in the standard normal curve or z-curve: z2 1 N ðz; 0, 1Þ = p exp 2 2π
ð2:42bÞ
In actual problems, the standard normal distribution is used to determine the probability of the random variable having a value within a certain interval, say z between z1 and z2. Then Eq. 2.42b can be modified into: N ðz 1 ≤ z ≤ z 2 Þ =
z2 z1
z2 1 p exp dz 2 2π
ð2:42cÞ
The shaded area in Table A3 permits evaluating the above integral, i.e., determining the associated probability assuming z1 = - 1. Note that for z = 0, the probability given by the shaded area is equal to 0.5. Since not all texts adopt the same format in which to present these tables, the user is urged to use caution in interpreting the values shown in such tables. Example 2.4.9 Graphical interpretation of probability using the standard normal table Resistors made by a certain manufacturer have a nominal value of 100 ohms, but their actual values are normally distributed with a mean of μ = 100.6 ohms and standard deviation σ = 3 ohms. Find the percentage of resistors that will have values: (i) Higher than the nominal rating. The standard normal variable z(X = 100) = (100 – 100.6)/3 = - 0.2. From
Table A3, this corresponds to a probability of (1 – 0.4207) = 0.5793 or 57.93%. (ii) Within 3 ohms of the nominal rating (i.e., between 97 and 103 ohms). The lower limit z1 = (97 - 100.6)/ 3 = - 1.2, and the tabulated probability from Table A3 is p(z = –1.2) = 0.1151 (as illustrated in Fig. 2.16a). The upper limit is: z2 = (103 - 100.6)/3 = 0.8. However, care should be taken in properly reading the corresponding value from Table A3, which only gives probability values of z < 0. One first determines the probability about the negative value symmetric about 0, i.e., p(z = –0.8) = 0.2119 (shown in Fig. 2.16b). Since the total area under the curve is 1.0, p (z = 0.8) = 1.0 – 0.2119 = 0.7881. Finally, the required probability p(–1.2 < z < 0.8) = (0.7881 – 0.1151) = 0.6730 or 67.3%. ■ Inspection of Table A3 allows the following statements to be made, which are often adopted during statistical inferencing: • The interval μ ± σ contains approximately [1 – 2 (0.1587)] = 0.683 or 68.3% of the observations. • The interval μ ± 2σ contains approximately 95.4% of the observations. • The interval μ ± 3σ contains approximately 99.7% of the observations. Another manner of using the standard normal table is for the “backward” problem. In such cases, instead of being specified the z value and having to deduce the probability, now the probability is specified, and the z value is to be deduced.
52
2 Probability Concepts and Probability Distributions
If the mean and standard deviation of this distribution are μ and σ, and the civil engineer wishes to determine the “statistical minimum strength” x, specified as the strength below which only say 5% of the cubes are expected to fail, one searches Table A3 and determines the value of z for which the probability is 0.05, i.e., p(z = - 1.645) = 0.05. This would correspond to x = μ - 1.645σ.
0.4 Normal N(0,1)
d.f=10 0.3
PDF
Example 2.4.10 Reinforced and pre-stressed concrete structures are designed so that the compressive stresses are carried mostly by the concrete itself. For this and other reasons, the main criterion by which the quality of concrete is assessed is its compressive strength. Specifications for concrete used in civil engineering jobs may require specimens of specified size and shape (usually cubes) to be cast and tested on site. One can assume the normal distribution to apply.
0.2 d.f=5 0.1 0
-3
-2
-1
0
1
2
3
x Fig. 2.17 Comparison of the normal (or Gaussian) z curve to two Student’s t-curves with different degrees of freedom (d.f.). As the d.f. decreases, the PDF for the Student’s t-distribution flattens out and deviates increasingly from the normal distribution
(b) Student’s t-Distribution One important application of the normal distribution is that it allows making statistical inferences about population means from random samples (see Sect. 4.2). In case the random samples are small (say, n < 30), then the Student’s t-distribution, rather than the normal distribution, should be used. If one assumes that the sampled population is approximately - μÞ normally distributed, then the random variable t = ðx p has s
nÞ
the Student’s t-distribution t(μ, s, ν) where μ is the sample mean, s is the sample standard deviation, and ν is the degrees of freedom = (n - 1). Thus, the number of degrees of freedom (d.f.) equals the number of data points minus the number of constraints or restrictions placed on the data. Table A4 (which is set up differently from the standard normal table) provides numerical values of the t-distribution for different degrees of freedom at different confidence levels. How to use these tables will be discussed in Sect. 4.2. Unlike the z curve, one has a family of t-distributions for different values of ν. Qualitatively, the t-distributions are similar to the standard normal distribution in that they are symmetric about a zero mean, while they are slightly wider than the corresponding normal distribution as indicated in Fig. 2.15. However, in terms of probability values represented by areas under the curves as in Example 2.4.11, the differences between the normal and the Student’s t-distributions are large enough to warrant retaining this distinction (Fig. 2.17).
Example 2.4.11 Differences between normal and Student’s t-inferences Consider Example 2.4.10 where the distribution of the strength of concrete samples tested followed a normal distribution with a mean and standard deviation of μ and σ. The probability that the “minimum strength” x specified as the strength below which only say 5% of the cubes are expected to fail was determined from Table A3 to be p(z = - 1.645) = 0.05. It was then inferred that that the statistical strength would correspond to x = μ - 1.645σ. The Student’s t-distribution allows one to investigate how this interval changes when different numbers of samples are taken. The mean and standard deviation now correspond to those of the sample. The critical values are the multipliers of the standard deviation and the results are assembled in the table below for different number of samples tested (found from Table A4 for a single-tailed distribution of 95%). For an infinite number of samples, we get back the critical value found for the normal distribution. Number of samples 5 10 30 1
Degrees of freedom 4 9 29 1
Critical values 2.132 1.833 1.699 1.645
2.4 Important Probability Distributions
53
1
PDF
0.8 L(1,1)
0.6
L(2,2) 0.4 L(3,3) 0.2 0
0
2
4
6
8
10
x Fig. 2.18 Lognormal distributions for different mean and standard deviation values
Example 2.4.12 Using lognormal distributions for pollutant concentrations Concentration of pollutants produced by chemical plants is often modeled by lognormal distributions and is used to evaluate compliance with government regulations. The concentration of a certain pollutant, in parts per million (ppm), is assumed lognormal with parameters μ = 4.6 and σ = 1.5. What is the probability that the concentration exceeds 10 ppm? One can use Eq. 2.43, or, simpler still, use the z tables (Table A3) by suitable transformations of the random variable. lnð10Þ - 4:6 1:5 = 1 - N ð - 1:531Þ = 1 - 0:0630 = 0:937
LðX > 10Þ = 1 - N ½lnð10Þ, 4:6, 1:5 = 1 - N
■
(c) Lognormal Distribution This distribution is appropriate for non-negative skewed data for which the symmetrical normal distribution is no longer appropriate. If a variate X is such that ln(X) is normally distributed, then the distribution of X is said to be lognormal. With X ranging from - 1 to + 1, ln(X) would range from 0 to + 1 . It is characterized by two parameters, the mean and standard deviation (μ, σ), as follows: Lðx; μ, σ Þ ¼
σ:x ¼0
1 p
2π
exp -
ðln x - μÞ2 2σ 2
(d) Gamma Distribution The gamma distribution (also called the Erlang distribution) is a good candidate for modeling random phenomena that can only be positive and are unimodal (akin to the lognormal distribution). The gamma distribution is derived from the gamma function for positive values of α, which, one may recall from mathematics, is defined by the integral:
when x ≥ 0 elsewhere ð2:43Þ
Note that the two parameters pertain to the logarithmic mean and standard deviation, i.e., to ln(X) and not to the random variable X itself. The lognormal curves are a family of skewed curves as illustrated in Fig. 2.18. Lognormal failure laws apply when the degradation in lifetime of a system is proportional to the current state of degradation. Typical applications in civil engineering involve flood frequency, in mechanical engineering with crack growth and mechanical wear, in electrical engineering to failure of electrical transformers or faults in electrical cables, and in environmental engineering with pollutants produced by chemical plants and threshold values for drug dosage. Lognormal distributions are also often used to characterize “fragility curves”, which represent the probability of damage due to extreme natural events (hurricanes, earthquakes, etc.), of the built environment such as buildings and other infrastructures. For example, Winkler et al. (2010) adopted topological and terrain-specific (μ, σ) parameters to model failure of electric power transmission lines and sub-stations and supporting towers/poles during hurricanes.
1
xα - 1 e - x dx
Γx ðαÞ =
α>0
ð2:44aÞ
0
Integration results in the following expression for non-negative integers k: Γðk þ 1Þ = k!
ð2:44bÞ
The continuous random variable X has a gamma distribution with positive parameters α and λ if its density function is given by: Gðx; α, λÞ = λα e - λx
xα - 1 ðα - 1Þ!
=0
x>0
ð2:44cÞ
elsewhere
The mean and variance of the gamma distribution are: μ=
α and λ
σ2 =
α λ2
ð2:44dÞ
Variation of the parameter α (called the shape factor) and λ (called the scale parameter) allows a wide variety of shapes to be generated (see Fig. 2.19). From Fig. 2.11, one notes that
54
2 Probability Concepts and Probability Distributions 0.3
2.4
0.25
2 G(3,1)
0.15
G(3,0.33)
0.1
1.2 G(1,1)
0.8
G(3,0.2)
0.05 0
G(0.5,1)
1.6
PDF
PDF
0.2
G(3,1)
0.4 0
5
10
15
20
25
30
35
x
a
0
40
0
2
4
6
8
10
12
x
b
Fig. 2.19 Gamma distributions for different combinations of the shape parameter α and the scale parameter β = 1/λ
if x ≥ 0 otherwise
ð2:45aÞ
where λ is the mean value per unit time or distance. The mean and variance of the exponential distribution are: μ=
E(0.5)
1.2 0.8
E(1)
0.4
A special case of the gamma distribution for α = 1 is the exponential distribution. It is the continuous distribution analogue to the geometric distribution which is applicable to discrete random variables. Its PDF is given by
= 0
1.6
E(2)
(e) Exponential Distribution
Eðx; λÞ = λe - λx
2
PDF
the Gamma distribution is the parent distribution for many other distributions discussed. If α→1 and λ = 1, the gamma distribution approaches the normal (see Fig. 2.11). When α = 1, one gets the exponential distribution. When α = 2υ and λ = 12 , one gets the chi-square distribution (discussed below).
1 1 and σ 2 = 2 λ λ
0
0
1
2
3
4
5
x Fig. 2.20 Exponential distributions for three different values of the parameter λ
of faults in a long cable, if the number of faults per unit length is Poisson distributed, then the cable length between faults is exponentially distributed. Its CDF is given by: a
ð2:45bÞ
λ:e - λx dx = 1 - e - λa
CDF ½Eða, λÞ =
ð2:45cÞ
0
The distribution is represented by a family of curves for different values of λ (see Fig. 2.20). Exponential failure laws apply to products whose current age does not have much effect on their remaining lifetimes. Hence, this distribution is said to be “memoryless.” It is used to model such processes as the interval between two occurrences, e.g., the distance between consecutive faults in a cable, or the time between chance failures of a component (such as a fuse) or a system, or the time between consecutive emissions of α-particles, or the time between successive arrivals at a service facility. The exponential and the Poisson distributions are closely related. While the latter represents the number of failures per unit time, the exponential represents the time between successive failures. In the context
Example 2.4.13 Temporary disruptions to the power grid can occur due to random events such as lightning, transformer failures, forest fires, etc. The exponential distribution has been known to be a good function to model such failures. If these occur, on average, say, once every 2.5 years, then λ = 1/2.5 = 0.40 per year. (a) What is the probability that there will be no more than one disruption next year? From Eq. 2.45c CDF ½E ðX ≤ 1; λÞ = 1 - e - 0:4ð1Þ = 1 0:6703 = 0:3297
2.4 Important Probability Distributions
55
(b) What is the probability that there will be more than two disruptions next year? This is the complement of at least two disruptions.
β > 1, the curves become close to bell-shaped and somewhat resemble the normal distribution. The expression for the CDF is given by: CDF ½W ðx; α, βÞ ¼ 1 - exp½ - ðx=βÞα for x ≥ 0
Probability = 1 - CDF ½E ðX ≤ 2; λÞ
¼0
= 1 - 1 - e - 0:4ð2Þ = 0:4493 ■ (f) Weibull Distribution Another widely used distribution is the Weibull distribution, which has been found to be applicable to datasets from a wide variety of systems and natural phenomena. It has been used to model the time of failure or life of a component as well as engine emissions of various pollutants. Moreover, the Weibull distribution has been found to be very appropriate to model reliability of a system, i.e., the failure time of the weakest component of a system (bearing, pipe joint failure, etc.). The continuous random variable X has a Weibull distribution with parameters α and β (shape and scale factors, respectively) if its density function is given by: α α-1 x exp½ - ðx=βÞα for x ≥ 0 βα =0 elsewhere ð2:46aÞ
0.12 0.1
PDF
0.08 0.06 0.04
with mean
0.02
1 μ = βΓ 1 þ α
ð2:46bÞ
Figure 2.21 shows the versatility of this distribution for different sets of α and β values. Also shown is the special case of W(1,1) which is the exponential distribution. For
0
0
5
10
15
20
25
30
x Fig. 2.22 PDF of the Weibull distribution W(2, 7.9) (Example 2.4.14)
1
8
0.8
W(10,0.5)
6
W(1,1)
0.6
PDF
PDF
elsewhere
Example 2.4.14 Modeling wind distributions using the Weibull distribution The Weibull distribution is widely used to model the hourly variability of wind velocity. The mean wind speed and its distribution on an annual basis, which are affected by local climate conditions, terrain, and height of the tower, are important in order to determine annual power output from a wind turbine of a certain design whose efficiency changes with wind speed. It has been found that the shape factor α varies between 1 and 3 (when α = 2, the distribution is called the Rayleigh distribution). The probability distribution shown in Fig. 2.22 has a mean wind speed of 7 m/s. In this case:
W ðx; α, βÞ =
W(2,1) 0.4
W(10,1)
4
W(10,2)
W(2,5) 2
0.2 0
0 0
a
ð2:46cÞ
2
4
6
x
8
10
b
0
1
2
x
Fig. 2.21 Weibull distributions for different values of the two parameters α and β (the shape and scale factors, respectively)
3
4
56
2 Probability Concepts and Probability Distributions
(a) The numerical value of the parameter β assuming the shape factor α = 2 can be calculated from the gamma μ = 7:9 function Γ 1 þ 12 = 0:8862, from which β = 0:8862 (b) Using the PDF given by Eq. 2.46a, it is left to the reader to compute the probability of the wind speed being equal to 10 m/s (and verify the solution against Fig. 2.22, which indicates a value of 0.064). ■
(g) Chi-square Distribution A third special case of the gamma distribution is when α = 2v and λ = 12 where v is a positive integer and is called the degrees of freedom. This distribution, called the chi-square (χ 2) distribution, plays an important role in inferential statistics where it is used as a test of significance for hypothesis testing and analysis of variance type of problems. It is based on the standard normal distribution with mean of 0 and standard deviation of 1. Just like the t-statistic, there is a family of distributions for different values of v (Fig. 2.23). The somewhat complicated PDF is given by Eq. 2.47a but its usefulness lies in the determination of a range of values from this distribution; more specifically, it provides the probability of observing a value of χ 2 from 0 to a specified value (Wolberg, 2006). Note that the distribution cannot assume negative values, and that it is positively skewed. Table A5 assembles critical values of the Chi-square distribution for different values of the degrees of freedom parameter v and for different significance levels. The usefulness of these tables will be discussed in Sect. 4.2. The PDF of the chi-square distribution is: χ2 ðx; νÞ =
1 ν
22 Γ
=0
υ x 2
ν 2-1
while the mean and variance values are :
While the t-distribution allows comparison between two sample means, the F-distribution allows comparison between two or more sample variances. It is defined as the ratio of two independent chi-square random variables, each divided by its degrees of freedom. The F-distribution is also represented by a family of plots (Fig. 2.24) where each plot is specific to a set of numbers representing the degrees of freedom of the two random variables (υ1, υ2). Table A6 assembles critical values of the F-distributions for different combinations of these two parameters, and its use will be discussed in Sect. 4.2. (i) Uniform Distribution The uniform probability distribution is the simplest of all PDFs and applies to both continuous and discrete data whose outcomes are all equally likely, i.e., have equal probabilities. Flipping a coin for heads/tails or rolling a six-sided dice for getting numbers between 1 and 6 are examples that come readily to mind. The probability density function for the discrete case where X can assume values x1, x2,. . .xk is given by: U ðx; kÞ =
1 k
e
x>0
ð2:48aÞ
k
with mean μ =
i=1
variance σ 2 =
ð2:47aÞ
xi and
k k
- 2x
ð2:47bÞ
(h) F-Distribution
i=1
ðxi - μÞ
ð2:48bÞ
2
k
elsewhere 0.8
1.2 1
0.6
0.8
F(6,24)
c (1) 2
PDF
PDF
σ2 = 2v
μ = v and
0.6 c (4)
0.4
2
0.4
c 2(6)
0.2 0
0
2
4
6
8
10
F(6,5)
0.2
12
x Fig. 2.23 Chi-square distributions for different values of the variable υ denoting the degrees of freedom
0
0
1
2
3
4
5
x Fig. 2.24 Typical F-distributions for two different combinations of the random variables (υ1 and υ2)
2.4 Important Probability Distributions
57
Fig. 2.25 The uniform distribution assumed continuous over the interval [c, d]
For random variables that are continuous over an interval (c,d) as shown in Fig. 2.25, the PDF is given by: 1 d-c =0
U ð xÞ =
when c < x < d
ð2:48cÞ
otherwise
The mean and variance of the uniform distribution (using notation shown in Fig. 2.25) are given by: μ=
cþd 2
and
σ2 =
ð d - cÞ 2 12
ð2:48dÞ
The probability of random variable X being between say x1 and x2 is: U ðx1 ≤ X ≤ x2 Þ =
x2 - x1 d-c
ð2:48eÞ
Example 2.4.15 A random variable X has a uniform distribution with c = -5 and d = 10 (Fig. 2.25). (a) On an average, what proportion will have a negative value? (Answer: 1/3) (b) On an average, what proportion will fall between -2 and 2? (Answer: 4/15) ■ (j) Beta Distribution
Fig. 2.26 Various shapes assumed by the Beta distribution depending on the values of the two model parameters
Another versatile distribution is the Beta distribution which is appropriate for discrete random variables between 0 and 1 (such as representing proportions). It is a two-parameter model that is given by: Betaðx; p, qÞ =
ðp þ q þ 1Þ! p - 1 x ð 1 - xÞ q - 1 ðp - 1Þ!ðq - 1Þ!
ð2:49aÞ
Depending on the values of p and q, one can model a wide variety of curves from u-shaped ones to skewed distributions (Fig. 2.26). The distributions are symmetrical
when p and q are equal, with the curves becoming peakier as the numerical values of the two parameters increase. Skewed distributions are obtained when the parameters are unequal. The mean of the Beta distribution μ = variance σ 2 =
pq ð p þ qÞ ð p þ q þ 1 Þ 2
p and pþq
ð2:49bÞ
58
2 Probability Concepts and Probability Distributions
This distribution originates from the Binomial distribution, and one can detect the obvious similarity of a two-outcome affair with specified probabilities. The usefulness of this distribution will become apparent in Sect. 2.5.3, dealing with the Bayesian approach to problems involving continuous probability distributions.
2.5
Bayesian Probability
2.5.1
Bayes’ Theorem
It was stated in Sect. 2.1.2 that the Bayesian viewpoint can enhance the usefulness of the classical frequentist notion of probability.3 Its strength lies in the fact that it provides a framework to include prior information in a two-stage (or multistage) experiment. If one substitutes the term p(A) in Eq. 2.10 by that given by Eq. 2.9 (known as Bayes’ Rule), one gets : pðB=AÞ =
pðA \ BÞ pðA \ BÞ þ p A \ B
pðA=BÞpðBÞ pðA=BÞpðBÞ þ p A=B p B
n
p A \ Bj =
ð2:53Þ
p A=Bj p Bj
j=1
j=1
Then
( / )=
ð2:51Þ
Bayes’ theorem, superficially, appears to be simply a restatement of the conditional probability equation given by Eq. 2.10. The question is why is this reformulation so insightful or advantageous? First, the probability is now re-expressed in terms of its disjoint parts B, B , and second the probabilities have been “flipped,” i.e., p(B/A) is now expressed in terms of p(A/B). Consider the two events A and B. If event A is observed while event B is not, this expression allows one to infer the “flip” probability, i.e., probability of occurrence of B from that of the observed event A. In Bayesian terminology, Eq. 2.51 can be written as: Posterior probability of event B given event A ðLikelihood of A given BÞðPrior probability of BÞ = Prior probability of A ð2:52Þ
3
n
pðAÞ =
ð2:50Þ
Also, one can re-arrange Eq. 2.10 into: p(A \ B) = p(A) p(B/A) =p(B) p(A/B). This allows expressing Eq. 2.50 into the following expression referred to as the law of total probability or Bayes’ theorem: pðB=AÞ =
Thus, the probability p(B) is called the prior probability (or unconditional probability) since it represents the opinion before any data was collected, while p(B/A) is said to be the posterior probability, which is reflective of the opinion revised in light of new data. The term “likelihood” is synonymous to “the conditional probability” of A given B, i.e., p(A/B) Equation 2.51 applies to the case when only one of two events is possible. It can be extended to the case of more than two events that partition the space S. Consider the case where one has n events, B1. . .Bn, which are disjoint and make up the entire sample space. Figure 2.27 shows a sample space of four events. Then, the law of total probability states that the probability of an event A is the sum of its disjoint parts:
There are several texts that deal only with Bayesian statistics; for example, Bolstad (2004).
( ∩
)
( )
=∑
( / ( /
) (
)
)
ð2:54Þ
likelihood prior
Posterior probability
This expression is known as Bayes’ theorem for multiple events. To restate, the marginal or prior probabilities p(Bi) for i = 1, . . ., n are assumed to be known in advance, and the intention is to update or revise our “belief” on the basis of the observed evidence of event A having occurred. This is captured by the probability p(Bi/A) for i = 1, . . ., n called the posterior probability. This is the weight one can attach to each event Bi after event A is known to have occurred.
S B1
B2
B3 ∩ A
A
B3
B4 Fig. 2.27 Bayes’ theorem for multiple events depicted on a Venn diagram. In this case, the sample space is assumed to be partitioned into four discrete events B1. . .B4. If an observable event A shown by the circle has already occurred, the conditional probability of Þ B3 is pðB3 =AÞ = pðpBð3A\A Þ . This is the ratio of the hatched area to the total area inside the ellipse
2.5 Bayesian Probability
59
Example 2.5.1 Consider the two-stage experiment of Example 2.2.9 with six marbles of three colors in two boxes. Assume that the experiment has been performed and that a red marble has been obtained. One can use the information known beforehand, i.e., the prior probabilities R, W, and G to determine from which box the marble came from. Note that the probability of the red marble having come from box A represented by p(A/R) is now the conditional probability of the “flip” problem. This is called the posterior probabilities of event A with event R having occurred. Thus, from the law of total probability, the posterior conditional probabilities (Eq. 2.51): – For the red marble to be from Box B: pðB=RÞ =
pðR=BÞpðBÞ = pðR=BÞpðBÞ þ p R=B p B
1 3 2 4 1 3 2 4
þ
1 1 2 2
=
3 5
– For the red marble to be from Box A:
p
A = R
1 1 2 2 1 1 2 2
þ
1 3 2 4
=
2 5
The reverse probability tree for this experiment is shown in Fig. 2.28. The reader is urged to compare this with the forward tree diagram of Example 2.2.9. The probabilities of 1.0 for both W and G outcomes imply that there is no uncertainty at all in predicting where the marble came from. This is obvious since only Box A contains W, and only Box B contains G. However, for the red marble, one cannot be sure of its origin, and this is where a probability measure must be determined. ■ Example 2.5.2 Forward and reverse probability trees for fault detection of equipment A large piece of equipment is being continuously monitored by an add-on fault detection system developed by another vendor in order to detect faulty operation. The vendor of the fault detection system states that their product correctly identifies faulty operation when indeed it is faulty (this is referred to as sensitivity) 90% of the time. This implies that there is a probability p = 0.10 of a “false negative” occurring
Fig. 2.28 The probabilities of the reverse tree diagram at each stage are indicated. If a red marble (R) is picked, the probabilities that it came from either Box A or Box B are 2/5 and 3/5, respectively (Example 2.5.1)
(i.e., a missed opportunity of signaling a fault). Also, the vendor quoted that the correct status prediction rate or specificity of the detection system (i.e., system identified as healthy when indeed it is so) is 0.95, implying that the “false positive” or false alarm rate is 0.05. Finally, historic data seem to indicate that the large piece of equipment tends to develop faults only 1% of the time. Figure 2.29 shows how this problem can be systematically represented by a forward tree diagram. State A is the faultfree state and state B is represented by the faulty state. Further, each of these states can have two outcomes as shown. While outcomes A1 and B1 represent, respectively, correctly identified fault-free and faulty operations, the other two outcomes are errors arising from an imperfect fault detection system. Outcome A2 is the false positive event (or false alarm or error type II, which will be discussed at length in Sect. 4.2), while outcome B2 is the false negative event (or missed opportunity or error type I). The figure clearly illustrates that the probabilities of A and B occurring along with the conditional probabilities p(A1/A) = 0.95 and p(B1/B) = 0.90, result in the probabilities of each of the four states as shown in the figure. The reverse tree situation, shown in Fig. 2.30, corresponds to the following situation. A fault has been signaled. What is the probability that this is a false alarm?
60
2 Probability Concepts and Probability Distributions
Fig. 2.29 The forward tree diagram showing the four events that may result when monitoring the performance of a piece of equipment (Example 2.5.2)
pðB=B2Þ=
Fig. 2.30 Reverse tree diagram depicting two possibilities. If an alarm sounds, it could be either an erroneous one (outcome A from A2) or a valid one (B from B1). Further, if no alarm sounds, there is still the possibility of missed opportunity (outcome B from B2). The probability that it is a false alarm is 0.846, which is too high to be acceptable in practice. How to decrease this is discussed in the text
Using Eq. 2.51: pðA2=AÞpðAÞ pðA2=AÞpðAÞ þ pðB1=BÞpðBÞ ð0:05Þð0:99Þ ¼ ð0:05Þð0:99Þ þ ð0:90Þð0:01Þ 0:0495 ¼ ¼ 0:846 0:0495 þ 0:009
pðA=A2Þ ¼
Working backward using the forward tree diagram (Fig. 2.29) allows one to visually understand the basis of the quantities appearing in the expression above. The value of 0.846 for the false alarm probability is very high for practical situations and could well result in the operator disabling the fault detection system altogether. One way of reducing this false alarm rate, and thereby enhance robustness, is to increase the sensitivity of the detection device from its current 90% to something higher by altering the detection threshold. This would result in a higher missed opportunity rate, which one must accept for the price of reduced false alarms. For example, the current missed opportunity rate is:
pðB2=BÞpðBÞ pðB2=BÞpðBÞ þ pðA1=AÞpðAÞ
=
ð0:10Þð0:01Þ ð0:10Þð0:01Þ þ ð0:95Þð0:99Þ
=
0:001 = 0:001 0:001 þ 0:9405
This is probably lower than what is needed, and so the above suggested remedy is one that can be considered. A practical way to reduce the false alarm rate is not to take action when a single alarm is sounded but to do so when several faults are flagged. Such procedures are adopted by industrial process engineers using control chart techniques (see Sect. 8.7). Note that as the piece of machinery degrades, the percent of time when faults are likely to develop will increase from the current 1% to something higher. This will have the effect of lowering the false alarm rate (left to the reader to convince himself why). ■ Bayesian statistics provide the formal manner by which prior opinion expressed as probabilities can be revised in the light of new information (from additional data collected) to yield posterior probabilities. When combined with the relative consequences or costs of being right or wrong, it allows one to address decision-making problems as pointed out in the example above. It has had some success in engineering (as well as in social sciences) where subjective judgment, often referred to as intuition or experience gained in the field, is relied upon heavily. The Bayes’ theorem is a consequence of the probability laws and is accepted by all statisticians. It is the interpretation of probability, which is controversial. Both approaches differ in how probability is defined: • Classical viewpoint: long-run relative frequency of an event. • Bayesian viewpoint: degree of belief held by a person about some hypothesis, event, or uncertain quantity (Phillips 1973).
2.5 Bayesian Probability
61
Advocates of the classical approach argue that human judgment is fallible while dealing with complex situations, and this was the reason why formal statistical procedures were developed in the first place. Introducing the vagueness of human judgment as done in Bayesian statistics would dilute the “purity” of the entire mathematical approach. Advocates of the Bayesian approach, on the other hand, argue that the “personalist” definition of probability should not be interpreted as the “subjective” view. Granted that the prior probability varies from one individual to the other based on their own experience, but with additional data collection all these views get progressively closer. Thus, with enough data, the initial divergent opinions would become indistinguishable. Hence, they argue, the Bayesian method brings consistency to informal thinking when complemented with collected data, and should, thus, be viewed as a mathematically valid approach.
2.5.2
The following examples illustrate how the Bayesian approach can be applied to discrete data. Example 2.5.34 Consider a machine whose prior PDF of the proportion X of defectives is given by Table 2.7. If a random sample of size 2 is selected, and one defective is found, the Bayes’ estimate of the proportion of defectives produced by the machine is determined as follows. Let y be the number of defectives in the sample. The probability that the random sample of size 2 yields one defective is given by the Binomial distribution since this is a two-outcome situation: 2 y
xy ð1 - xÞ2 - y ; y = 0, 1, 2
If x = 0.1, then f ð1=0:1Þ = Bð1; 2, 0:1Þ =
2 1
ð0:1Þ1 ð0:9Þ2 - 1
= 0:18 Similarly, for x = 0:2, f
1 0:2
= 0:32.
Table 2.7 Prior PDF of proportion of defectives (x) (Example 2.5.3) X f(x)
4
0.1 0.6
f ðy = 1Þ = ð0:18Þð0:6Þ þ ð0:32Þð0:40Þ = ð0:108Þ þ ð0:128Þ = 0:236 The posterior probability f(x/y = 1) is then given: • for x = 0.1: 0.108/0.236 = 0.458 • for x = 0.2: 0.128/0.236 = 0.542 Finally, the Bayes’ estimate of the proportion of defectives x is: x = ð0:1Þð0:458Þ þ ð0:2Þð0:542Þ = 0:1542 which is quite different from the value of 0.5 given by the classical method. ■
Application to Discrete Probability Variables
f ðy=xÞ = Bðy; n, xÞ =
Thus, the total probability of finding one defective in a sample size of 2 is:
0.2 0.4
From Walpole et al. (2007), by permission of Pearson Education.
Example 2.5.45 Using the Bayesian approach to enhance value of concrete piles testing Concrete piles driven in the ground are used to provide bearing strength to the foundation of a structure (building, bridge, etc.). Hundreds of such piles are used in large construction projects. These piles should not develop defects such as cracks or voids in the concrete, which would lower compressive strength. Tests are performed by engineers on piles selected at random during the concrete pour process in order to assess overall foundation strength. Let the random discrete variable be the proportion of defective piles out of the entire lot, which is taken to assume five discrete values as shown in the first column of Table 2.8. Consider the case where the prior experience of an engineer as to the proportion of defective piles from similar sites is given in the second column of the table below. Before any testing is done, the expected value of the probability of finding one pile to be defective is: p = (0.20) (0.30) + (0.4)(0.40) + (0.6)(0.15) + (0.8)(0.10) + (1.0)(0.05) = 0.44 (as shown in the last row under the second column). This is the prior probability. Had he drawn a conclusion on just a single test that turns out to be defective, without using his prior judgment, he would have concluded that all the piles were defective; clearly, an over-statement. Suppose the first pile tested is found to be defective. How should the engineer revise his prior probability of the proportion of piles likely to be defective? Bayes’ theorem (Eq. 2.51) can be used. For proportion x = 0.2, the posterior conditional probability is: 5
From Ang and Tang (2007), by permission of John Wiley and Sons.
62
2 Probability Concepts and Probability Distributions
pðx ¼ 0:2Þ
pðx ¼ 0:2Þ
ð0:2Þð0:3Þ ¼ ð0:2Þð0:3Þ þ ð0:4Þð0:4Þ þ ð0:6Þð0:15Þ þ ð0:8Þð0:10Þ þ ð1:0Þð0:05Þ 0:06 ¼ ¼ 0:136 0:44
ð0:2Þð0:136Þ ð0:2Þð0:136Þ þ ð0:4Þð0:364Þ þ ð0:6Þð0:204Þ þ ð0:8Þð0:182Þ þ ð1:0Þð0:114Þ ¼ 0:0272=0:55 ¼ 0:049
This is the value that appears in the first row under the third column. Similarly, the posterior probabilities for different values of x can be determined, which add up to the expected value E (x = 1) = 0.55. Hence, a single inspection has led to the engineer revising his prior opinion upward from 0.44 to 0.55. The engineer would probably get a second pile tested, and if it also turns out to be defective, the associated probabilities are shown in the fourth column of Table 2.8. For example, for x = 0.2:
¼
The expected value in case the two piles tested turn out to be defective increases to 0.66. In the limit, if each successive pile tested turns out to be defective, one gets back the classical distribution, listed in the last column of the table. The progression of the PDF from the prior to the infinite case is illustrated in Fig. 2.31. Note that as more piles tested turn out to be defective, the evidence from the data gradually overwhelms the prior judgment of the engineer. However, it is only when collecting data is so expensive or time
Table 2.8 Illustration of how a prior PDF is revised with new data (Example 2.5.4)
Proportion of defectives (x) 0.2 0.4 0.6 0.8 1.0 Expected probability of defective piles
Probability of being defective Prior PDF After one pile tested of defectives is found defective 0.30 0.136 0.40 0.364 0.15 0.204 0.10 0.182 0.05 0.114 0.44 0.55
After two piles tested are found defective 0.049 0.262 0.221 0.262 0.205 0.66
...
...
Limiting case of infinite defectives 0.0 0.0 0.0 0.0 1.0 1.0
Fig. 2.31 Illustration of how the prior discrete PDF is affected by data collection following Bayes’ theorem (Example 2.5.4)
2.5 Bayesian Probability
63
consuming that decisions must be made from limited data that the power of the Bayesian approach becomes evident. Of course, if one engineer’s prior judgment is worse than that of another engineer, then his conclusion from the same data would be poorer than the other engineer. It is this type of subjective disparity that antagonists of the Bayesian approach are uncomfortable with. On the other hand, proponents of the Bayesian approach would argue that experience (even if intangible) gained in the field is a critical asset in engineering applications and that discarding this type of heuristic knowledge entirely is naïve, and short-sighted. ■ There are instances when no previous knowledge or information is available about the behavior of the random variable; this is sometimes referred to as prior of pure ignorance. It can be shown that this assumption of the prior leads to results identical to those of the traditional probability approach (see Example 2.5.5).
2.5.3
Application to Continuous Probability Variables
The Bayes’ theorem can also be extended to the case of continuous random variables (Ang and Tang 2007). Let X be the random variable with a prior PDF denoted by p(x). Though any appropriate distribution can be chosen, the Beta distribution is particularly convenient,6 and is widely used to characterize prior PDF. Another commonly used prior is the uniform distribution called a diffuse prior. For consistency with convention, a slightly different nomenclature than that of Eq. 2.51 is adopted. Assume that the Beta distribution (Eq. 2.49a) can be rewritten to yield the prior: pðxÞ / xa ð1 - xÞb
ð2:55Þ
Recall that higher the values of the exponents a and b, the peakier the distribution indicative of the prior distribution being relatively well defined. Let L(x) represent the conditional probability or likelihood function of observing y “successes” out of n observations. Then, the posterior probability is given by: f ðx=yÞ / LðxÞ pðxÞ
ð2:56Þ
In the context of Fig. 2.27, the likelihood of the unobservable events B1. . .Bn is the conditional probability that A has occurred given Bi for i = 1, . . ., n, or by p(A/Bi). The likelihood function can be gleaned from probability considerations 6
Because of the corresponding mathematical simplicity that it provides as well as the ability to capture a wide variety of PDF shapes.
in many cases. Consider Example 2.5.4 involving testing the foundation piles of buildings. The Binomial distribution gives the probability of x failures in n independent Bernoulli trials, provided the trials are independent and the probability of failure in any one trial is p. This applies to the case when one holds p constant and studies the behavior of the PDF of defectives x. If instead, one holds x constant and lets p(x) vary over its possible values, one gets the likelihood function. Suppose n piles are tested and y piles are found to be defective or sub-par. In this case, the likelihood function is written as follows for the Binomial PDF: LðxÞ =
n y
xy ð 1 - xÞ n - y
0≤x≤1
ð2:57Þ
Notice that the Beta distribution is the same form as the likelihood function. Consequently, the posterior distribution given by Eq. 2.57 assumes the form: f ðx=yÞ = k xaþy ð1 - xÞbþn - y
ð2:58Þ
where k is independent of x and is a normalization constant introduced to satisfy the probability law that the area under the PDF is unity. What is interesting is that the information contained in the prior has the net result of “artificially” augmenting the number of observations taken. While the classical approach would use the likelihood function with exponents y and (n - y) (see Eq. 2.57), these are inflated to (a + y) and (b + n - y) in Eq. 2.58 for the posterior distribution. This is akin to having taken more observations and supports the previous statement that the Bayesian approach is particularly advantageous when the number of observations is low. The examples below illustrate the use of Eq. 2.58. Example 2.5.5 Let us consider the same situation as that treated in Example 2.5.4 for the concrete pile testing situation. However, the proportion of defectives X is now a continuous random variable for which no prior distribution can be assigned. This implies that the engineer has no prior information, and, in such cases, a uniform distribution (or a diffuse prior) is assumed: pðxÞ = 1:0
for
0≤x≤1
The likelihood function for the case of the single tested pile turning out to be defective is x, i.e., L(x) = x. From Eq. 2.58, the posterior distribution is then: f ðx=yÞ = k xð1:0Þ The normalizing constant
64
2 Probability Concepts and Probability Distributions -1
1
k=
=2
xdx 0
Hence, the posterior probability distribution is: f ðx=yÞ = 2x
0≤x≤1
for
The Bayesian estimate of the proportion of defectives, when one pile is tested and it turns out to be defective, is then: 1
p = E ðx=yÞ =
Fig. 2.32 Probability distributions of the prior, likelihood function, and the posterior (Example 2.5.5). (From Ang and Tang 2007, by permission of John Wiley and Sons)
x 2xdx = 0:667 0
■ Example 2.5.67 Enhancing historical records of wind velocity using the Bayesian approach Buildings are designed to withstand a maximum wind speed, which depends on the location. The probability x that the wind speed will not exceed 120 km/h more than once in 5 years is to be determined. Past records of wind speeds of a nearby location indicated that the following beta distribution would be an acceptable prior for the probability distribution (Eq. 2.49a): pðxÞ = 20x3 ð1 - xÞ for
0≤x≤1
Further, the likelihood that the annual maximum wind speed will exceed 120 km/h in 1 out of 5 years is given by the Binomial distribution as: 5
LðxÞ =
4
x4 ð1 - xÞ = 5x4 ð1 - xÞ
Hence, the posterior probability is deduced following Eq. 2.58: f ðx=yÞ = k 5x4 ð1 - xÞ 20x3 ð1 - xÞ = 100k x7 ð1 - xÞ2 where the constant k can be found from the normalization criterion: -1
1
k=
2
100x ð1 - xÞ dx 7
= 3:6
0
7
From Ang and Tang (2007), by permission of John Wiley and Sons.
Finally, the posterior PDF is given by f ðx=yÞ = 360x7 ð1 - xÞ2 for 0 ≤ x ≤ 1 Plots of the prior, likelihood, and the posterior functions are shown in Fig. 2.32. Notice how the posterior distribution has become more peaked reflective of the fact that the single test data has provided the analyst with more information than that contained in either the prior or the likelihood function. ■
2.6
Three Kinds of Probabilities
The previous sections in this chapter presented basic notions of classical probability and how the Bayesian viewpoint is appropriate for certain types of problems. Both these viewpoints are still associated with the concept of probability as the relative frequency of an occurrence. At a broader context, one should distinguish between three kinds of probabilities: (i) Objective or absolute probability, which is the classical measure interpreted as the “long-run frequency” of the outcome of an event. It is an informed estimate of an event that in its simplest form is a constant; for example, historical records yield the probability of flood occurring this year or of the infant mortality rate in the United States. It would be unchanged for all individuals since it is empirical having been deduced from historic records. Table 2.9 assembles probability estimates for the occurrence of natural disasters with 10 and 1000 fatalities per event (indicative of the severity level) during different time spans (1, 10, and 20 years). Note that floods and tornados have relatively small return times for small events while
2.6 Three Kinds of Probabilities
65
Table 2.9 Estimates of absolute probabilities for different natural disasters in the United States. (Adapted from Barton and Nishenko 2008) Exposure times Disaster Earthquakes Hurricanes Floods Tornadoes
10 fatalities per event 1 year 10 years 0.11 0.67 0.39 0.99 0.86 >0.99 0.96 >0.99
20 years 0.89 >0.99 >0.99 >0.99
Return time (years) 9 2 0.5 0.3
1000 fatalities per event 1 year 10 years 0.01 0.14 0.06 0.46 0.004 0.04 0.006 0.06
20 years 0.26 0.71 0.08 0.11
Return time (years) 67 16 250 167
Table 2.10 Leading causes of death in the United States, 1992. (Adapted from Kolluru et al. 1996) Cause Cardiovascular or heart disease Cancer (malignant neoplasms) Cerebrovascular diseases (strokes) Pulmonary disease (bronchitis, asthma, etc.) Pneumonia and influenza Diabetes mellitus Non-motor vehicle accidents Motor vehicle accidents HIV/AIDS Suicides Homicides All other causes Total annual deaths (rounded)
earthquakes and hurricanes have relatively short times for large events. Such probability considerations can be determined at a finer geographical scale, and these play a key role in the development of codes and standards for designing large infrastructures (such as dams) as well as small systems (such as residential buildings). Note that the probabilities do not add up to one since one cannot define the entire population of possible events. (ii) Relative probability, where the chance of occurrence of one event is stated in terms of another. This is a way of comparing the effect or outcomes of different types of adverse events happening on a system or on a population when the absolute probabilities are difficult to quantify. For example, the relative risk for lung cancer is (approximately) 10 if a person has smoked before, compared to a non-smoker. This means that he is 10 times more likely to get lung cancer than a non-smoker. Table 2.10 shows leading causes of death in the United States in the year 1992. Here the observed values of the individual number of deaths due to various causes are used to determine a relative risk expressed as percent (%) in the last column. Thus, heart disease, which accounts for 33% of the total deaths, is 16 times riskier than motor vehicle deaths. However, as a note of caution, these are values aggregated across the whole population and during a specific time interval and need to be interpreted accordingly. State and
Annual deaths (× 1000) 720 521 144 91 76 50 48 42 34 30 27 394 2177
Percent (%) 33 24 7 4 3 2 2 2 1.6 1.4 1.2 18 100
government analysts separate such relative risks by location, age groups, gender, and race for public policy-making purposes. (iii) Subjective probability, which differs from one person to another, is an informed or best guess about an event that can change as our knowledge of the event increases. Subjective probabilities are those where the objective view of probability has been modified to treat two types of events: (i) when the occurrence is unique and is unlikely to repeat itself, or (ii) when an event has occurred but one is unsure of the final outcome. In such cases, one has still to assign some measure of likelihood of the event occurring and use this in their analysis. Thus, a subjective interpretation is adopted with the probability representing a degree of belief of the outcome selected as having actually occurred (which could be based on a scientific analysis subject to different assumptions or even by “gut-feeling”). There are no “correct answers,” simply a measure reflective of one’s subjective judgment. A good example of such subjective probability is one involving forecasting the probability of whether the impacts on gross world product of a 3°C global climate change by 2090 would be large or not. A survey was conducted involving 20 leading researchers working on global warming issues but with different technical backgrounds, such as scientists,
66
2 Probability Concepts and Probability Distributions
Fig. 2.33 Example illustrating large differences in subjective probability. A group of prominent economists, ecologists, and natural scientists were polled so as to get their estimates of the loss of gross world product due to doubling of atmospheric carbon dioxide (which is likely to occur by the end of the twenty-first century when mean global temperatures increase by 3°C). The two ecologists predicted the highest adverse impact while the lowest four individuals were economists. (From Nordhaus 1994)
engineers, economists, ecologists, and politicians, who were asked to assign a probability estimate (along with 10% and 90% confidence intervals). Though this was not a scientific study as such since the whole area of expert opinion elicitation is still not fully mature, there was nevertheless a protocol in how the questioning was performed, which led to the results shown in Fig. 2.33. The median, and 10% and 90% confidence intervals predicted by different respondents show great scatter, with the ecologists estimating impacts to be 20–30 times higher (the two right-most bars in the figure), while the economists on average predicted the chance of large consequences to have only a 0.4% loss in gross world product. An engineer or a scientist may be uncomfortable with such subjective probabilities, but there are certain types of problems where this is the best one can do with current knowledge. Thus, formal analysis methods must accommodate such information, and it is here that Bayesian techniques can play a key role.
Problems Pr. 2.1 Three sets are defined as integers from 1 to 12: A = {1,3,5,6,8,10}, B = {4,5,7,8,11} and C = {2,9,12}. (a) (b) (c) (d)
Represent these sets in a Venn diagram. What are A [ B, A \ B, A [ C, A \ C ? What are A [ B, A [ B, A [ C, A [ C ? What are AB, A – B, A + B?
Pr. 2.2 A county office determined that of the 1000 homes in their area, 400 were older than 20 years (event A), that 500 were constructed of wood (event B), and that 400 had central air-conditioning (AC) (event C). Further, it is found that events A and B occur in 300 homes, that events A or C occur in 625 homes, that all three events occur in 150 homes, and that no event occurs in 225 homes. If a single house is picked, determine the following probabilities (also draw the Venn diagrams): (a) That it is older than 20 years and has central AC. (b) That it is older than 20 years and does not have central AC. (c) That it is older than 20 years and is not made of wood. (d) That it has central AC and is made of wood. Pr. 2.3 A university researcher has submitted three research proposals to three different agencies. Let E1, E2, and E3 be the outcomes that the first, second, and third bids are successful with probabilities: p(E1) = 0.15, p(E2) = 0.20, p (E3) = 0.10. Assuming independence, find the following probabilities using group theory: (a) That all three bids are successful. (b) That at least two bids are successful. (c) That at least one bid is successful. Verify the above results using the probability tree approach.
Problems
67
Fig. 2.34 Components in parallel and in series (Problem 2.4)
Pr. 2.4 Example 2.2.3 illustrated how to compute the reliability of a system made up of two components A and B. As an extension, it would be insightful to determine whether reliability of the system will be better enhanced by (i) duplicating the whole system in parallel, or (ii) by duplicating the individual components in parallel. Consider a system made up of two components A and B. Figure 2.34 (a) represents case (i) while Figure 2.34(b) represents case (ii). (a) If p(A) = 0.1 and p(B) = 0.8 are the failure probabilities of the two components, what are the probabilities of the system functioning properly for both configurations. (b) Prove that functional probability of system (i) is greater than that of system (ii) without assuming any specific numerical values. Derive the algebraic expressions for proper system functioning for both configurations from which the above proof can be deduced. Pr. 2.5 Consider the two system schematics shown in Fig. 2.35. At least one pump must operate when one chiller is operational, and both pumps must operate when both chillers are on. Assume that both chillers have identical reliabilities of 0.90 and that both pumps have identical reliabilities of 0.95. (a) Without any computation, make an educated guess as to which system would be more reliable overall when (i) one chiller operates, and (ii) when both chillers operate. (b) Compute the overall system reliability for each of the configurations separately under cases (i) and (ii) defined above. Pr. 2.68 An automatic sprinkler system for a high-rise apartment has two different types of activation devices for each sprinkler
8
From McClave and Benson (1988) with permission of Pearson Education.
Fig. 2.35 Two possible system configurations (for Pr. 2.5)
head. Reliability of such devices is a measure of the probability of success, i.e., that the device will activate when called upon to do so. Type A and Type B devices have reliability values of 0.90 and 0.85, respectively. In case a fire does start, calculate: (a) The probability that the sprinkler head will be activated (i.e., at least one of the devices works). (b) The probability that the sprinkler will not be activated at all. (c) The probability that both activation devices will work properly. (d) Verify the above results using the probability tree approach. Pr. 2.7 Consider the following probability distribution of a random variable X: f ðxÞ = ð1 þ bÞx:b =0
for 0 ≤ x ≤ 1 elsewhere
Use the method of moments: (a) To find the estimate of the parameter b (b) To find the expected value of X
68
2 Probability Concepts and Probability Distributions
Pr. 2.8 Consider the following cumulative distribution function (CDF): F ðxÞ = 1 - expð - 2xÞ for x > 0 =0 x≤0 (a) Construct and plot the cumulative distribution function. (b) What is the probability of X < 2? (c) What is the probability of 3 < X < 5? Pr. 2.9 The joint density for the random variables (X,Y) is given by: f ðx, yÞ = 10xy2 =0
0 absðr Þ → weak
2
A more statically sound procedure is described in Sect. 4.2.7, which allows one to ascertain whether observed correlation coefficients are significant or not depending on the number of data points.
It is very important to note that inferring non-association of two variables x and y from their correlation coefficient is misleading since it only indicates linear relationship. Hence, a poor correlation does not mean that no relationship exists between them (e.g., a second-order relation may exist between x and y; see Fig. 3.13f). Detection of non-linear correlations is addressed in Sect. 9.5.1. Note also that correlation analysis does not indicate whether the relationship is causal, i.e., whether the variation in the y-variable is directly caused by that in the x-variable. Finally, keep in mind that the correlation analysis does not provide an equation for predicting the value of a variable—this is done under model building (see Chaps. 5 and 9). Example 3.4.2 The following observations are taken of the extension of a spring under different loads (Table 3.4). Using Eq. 3.8, the standard deviations of load and extension are 3.7417 and 18.2978, respectively, while the correlation coefficient r = 0.9979 (following Eq. 3.12). This indicates a very strong linear positive correlation between the two variables as one would expect. ■
3.5
Exploratory Data Analysis (EDA)
3.5.1
What Is EDA?
EDA is an analysis process (some use the phrase “attitude toward data analysis”) that has been developed and championed by John Tukey (1970) and considerably expanded and popularized by subsequent statisticians (e.g., Hoaglin et al. 1983). Rather than directly proceeding to perform the then-traditional confirmatory analysis (such as hypothesis testing) or stochastic model building (such as regression), Tukey suggested that the analyst should start by “looking at the data and see what it seems to say.” Such
90
3
Data Collection and Preliminary Analysis
Fig. 3.13 Illustration of various plots with different correlation strengths. (a) Moderate linear positive correlation (b) Perfect linear positive correlation (c) Moderate linear negative correlation (d) Perfect linear negative correlation (e) No correlation at all (f) Correlation exists but it is not linear. (From Wonnacutt and Wonnacutt 1985, by permission of John Wiley and Sons)
Table 3.4 Extension of a spring with applied load (Example 3.4.2)
Load (Newtons) Extension (mm)
2 10.4
an investigation of data visualization and exploration, it was argued, will be more effective since it would indicate otherwise hidden/unexpected behavior in the dataset, allow evaluating the assumptions in data behavior presumed by traditional statistical analyses, and provide better guidance into the selection and use of new/appropriate statistical techniques. The timely advances in computer technology and software sophistication allowed new graphical techniques to be developed along with ways to transform/ normalize variables to correct for unwanted shape/spread of their distributions. In short, EDA can be summarized as a process of data guiding the analysis! At first glance, EDA techniques may seem to be ad hoc and not follow any unifying structure; however, the interested reader can refer
4 19.6
6 29.9
8 42.2
10 49.2
12 58.5
to Hoaglin et al. (1983) for a rationale for developing the EDA techniques and for an explanation and illustration of the connections between EDA and classical statistical theory. Heiberger and Holland (2015) provide an in-depth coverage and illustrate how different types of graphical data displays can be used to aid in exploratory data analysis and enhance various types of statistical computation and analysis, such as for inference, hypothesis testing, regression, timeseries, and experimental design (topics that are also covered in different chapters in this book). The book also provides software code in the open-source R statistical environment for generating such visual displays (https://github.com/ henze-research-group/adam/).
3.5 Exploratory Data Analysis (EDA)
3.5.2
Purpose of Data Visualization
Data visualization is done by graphs and serves two purposes. During exploration of the data, it provides a better means of assimilating broad qualitative trend behavior of the data than can be provided by tabular data. Second, it provides an excellent manner of communicating to the reader what the author wishes to state or illustrate (recall the adage “a picture is worth a thousand words”). Hence, data visualization can serve as a medium to communicate information, not just to explore data trends (an excellent reference is Tufte 2001). However, it is important to be clear as to the intended message or purpose of the graph, and also tailor it as to be suitable for the intended audience’s background and understanding. A pretty graph may be visually appealing but may obfuscate rather than clarify or highlight the necessary aspects being communicated. For example, unless one is experienced, it is difficult to read numerical values off of 3-D graphs. Thus, graphs should present data clearly and accurately without hiding or distorting the underlying intent. Table 3.5 provides a succinct summary of graph formats appropriate for different applications. Graphical methods are usually more insightful than numerical screening in identifying data errors in smaller datasets with few variables. For large datasets with several variables, they become onerous if basic software, such as the ubiquitous spreadsheet, is used. Additionally, the strength of a graphical analysis during EDA is to visually point out to the analyst relationships (linear or non-linear) between two or more variables in instances when a sound physical understanding is lacking, thereby aiding in the selection of the appropriate regression model. Present-day graphical visualization tools allow much more than this simple objective,
91
some of which will become apparent below. There is a very large number of graphical ways of presenting data, and it is impossible to cover them all. Only a small set of representative and commonly used plots will be discussed below, while other types of plots will be presented in later chapters as relevant. Nowadays, several high-end graphical software programs allow complex, and sometimes esoteric, plots to be generated. Graphical representations of data are the backbone of exploratory data analysis. They are usually limited to one-, two-, and three-dimensional (1-D, 2-D, and 3-D) data. In the last few decades, there has been a dramatic increase in the types of graphical displays largely due to the seminal contributions of Tukey (1970), Cleveland (1985), and Tufte (1990, 2001). A particular graph is selected based on its ability to emphasize certain characteristics or behavior of one-dimensional data, or to indicate relations between twoand three-dimensional data. A simple manner of separating these characteristics is to view them as being: (i) Cross-sectional (i.e., data collected at one point in time or when time is not a factor) (ii) Time-series data (iii) Hybrid or combined (iv) Relational (i.e., emphasizing the joint variation of two or more variables) An emphasis on visualizing data to be analyzed has resulted in statistical software programs becoming increasingly convenient to use and powerful. Any data analysis effort involving univariate and bivariate data should start by looking at basic plots (higher-dimension data require more elaborate plots discussed later).
Table 3.5 Type and function of appropriate graph formats Type of message Component
Function Shows relative size of various parts of a whole
Relative amounts
Ranks items according to size, impact, degree, etc.
Time series
Shows variation over time
Frequency
Shows frequency of distribution among certain intervals
Correlation
Shows how changes in one set of data is related to another set of data
Typical format Pie chart (for one or two important components) Bar chart Dot chart Line chart Bar chart Line chart Dot chart Bar chart (for few intervals) Line chart Histogram Line chart Box-and-Whisker Paired bar Line chart Scatter diagram
Downloaded from the Energy Information Agency (EIA) website in 2009, which was since removed (http://www.eia.doe.gov/neic/graphs/introduc. htm)
92
3.5.3
3
Static Univariate Graphical Plots
Commonly used graphics for cross-sectional representation are mean and standard deviation plots, steam-and-leaf diagrams, dot plots, histograms, box-whisker-mean plots, distribution plots, bar charts, pie charts, area charts, and quantile plots. Mean and standard deviation plots summarize the data distribution using the two most basic measures; however, this manner is of limited use (and even misleading) when the distribution is skewed. (a) Histograms For univariate discrete or continuous data, plotting of histograms is very straightforward while providing a compact visual representation of the spread and shape (such as unimodal or bimodal) of the relative frequency distribution. There are no hard and fast rules regarding how to select the number of bins (Nbins) or classes in case of continuous data, probably because there is no proper theoretical basis. The shape of the underlying distribution is better captured with larger number of bins, but then each bin will contain fewer observations, and may exhibit jagged behavior (see Fig. 1.3 for the variation of outdoor air temperature in Philadelphia). Generally, the larger the number of observations n, the more Fig. 3.14 Box and whisker plot and its association with a standard normal distribution. The box represents the 50th percentile range while the whiskers extend 1.5 times the inter-quartile range (IQR) on either side. (From Wikipedia website)
Data Collection and Preliminary Analysis
classes can be used, though as a guide it should be between 5 and 20. Devore and Farnum (2005) suggest: Number of bins or classes = N bins = ðnÞ1=2
ð3:14aÞ
which would suggest that if n = 100, Nbins = 10. Doebelin (1995) proposes another equation: N bins = 1:87:ðn - 1Þ0:4
ð3:14bÞ
which would suggest that if n = 100, Nbins = 12. (b) Box-and-whisker plots The box-and-whisker plots better summarize the distribution, but this is done using percentiles. Figure 3.14 depicts its shape for univariate continuous data, identifies various important quantities, and illustrates how these can be associated with the Gaussian or standard normal distribution. The lower and upper box values Q1 and Q3 (called hinges) correspond to the 25th and 75th percentiles (recall the interquartile range [IQR] defined in Sect. 3.4.1) while the whiskers commonly extend to 1.5 times the IQR on either side. From Fig. 3.14, it is obvious that the IQR would contain 50% of the data observations while the range bounded by the
3.5 Exploratory Data Analysis (EDA) Table 3.6 Values of time taken (in minutes) for 20 students to complete an exam
37.0 42.0 47.0
93 37.5 43.1 62.0
38.1 43.9 64.3
40.0 44.1 68.8
40.2 44.6 70.1
40.8 45.0 74.5
41.0 46.1
See Example 3.5.1
low and high whiskers for a normally distributed dataset would contain 99.3% of the total data (close to the 99.7% value corresponding to ± (3 × standard deviation)). Such a representation reveals the skewness in the data, and also indicates outlier points. Any observation between (1.5 × IQR) and (3.0 × IQR) from the closest quartile is considered to be a mild outlier, and shown as a closed or filled point, while one falling outside (3.0 × IQR) from the closest quartile is taken to be an extreme outlier and shown as an open circle point. However, if the outlier falls within the (1.5 × IQR) spread, then the whisker line should be terminated at this outlier point (Tukey 1970). (c) Q-Q plots Though plotting a box-and-whisker plot or a plot of the distribution itself can suggest the shape of the underlying distribution, a better visual manner of ascertaining whether a presumed or specified theoretical probability distribution is consistent with the dataset being analyzed is by means of quantile plots. Recall that quantiles are points dividing the range of a probability distribution (or any sample univariate data) into continuous intervals with equal probabilities; hence quartiles are a good way of segmenting a distribution. Further, percentiles are a subset of quantiles that divide the data into 100 equally sized groups. In essence, a quantile plot is one where the quantiles of the dataset are plotted as a cumulative distribution. A quantile–quantile or Q-Q plot is one where the quantiles of the data are plotted against the quantiles of a specified standardized theoretical distribution.3 How they align or deviate from the 45° reference line allows one to visually determine whether the two distributions are in agreement, and, if not, may indicate likely causes for this difference. The Q-Q plot can also be generated with quantiles from one sample data against those of another. A special case of the more general Q-Q plot is the normal probability plot where the comparison is done against the normal distribution plotted on the y-axis. There is some variation between different statistical software in the terminology adopted and how these plots are displayed (an issue to be kept in mind). Q-Q plots are better interpreted with larger datasets, but the following example is meant for illustrative purposes.
3
Another similar type of plot for comparing two distributions is the P-P plot, which is based on the cumulative probability distributions. However, the Q-Q plot based on the cumulative quantile distribution is more widely used.
Fig. 3.15 Q-Q plot of the data in Table 3.6 against a standardized Gaussian distribution
Example 3.5.1 An instructor wishes to ascertain whether the time taken by her students to complete the final exam follows a normal or Gaussian distribution. The values in minutes shown in Table 3.6 have been recorded. The normal quantile plot for this dataset against a Gaussian (or standard normal) is shown in Fig. 3.15. The pattern is obviously non-linear, and so a Gaussian distribution is improper for this data. The apparent break appearing in the data on the right side of the graph indicates the presence of outliers (in this case, caused by five students taking much longer to complete the exam). ■ Example 3.5.2 Consider the same dataset as for Example 3.4.1. The following plots have been generated (shown in Fig. 3.16): (a) Box-and-whisker plot (note that the two whiskers are not equal indicating a slight skew). (b) Histogram of data (assuming 9 bins) shown as relative frequency; the sum of the y-axis of the individual bars should add to 100. (c) Normal probability plot allows evaluating whether the distribution is close to a normal distribution. The tails deviate from the straight line indicating departure from a normal distribution. Note that the normal quantile is now plotted on the y-axis and rescaled compared to the Q-Q plot of Fig. 3.15. Different software programs generate such plots differently. One also comes across Q-Q plots where both axes represent quantiles, one from the normal distribution and one from the sample distribution.
94
3
Data Collection and Preliminary Analysis
Fig. 3.16 Various exploratory plots for the dataset in Table 3.2
Fig. 3.17 Common components of the box plot and the violin plot for an arbitrary continuous variable (shown on the y-axis)
120
Outside Points
95
Upper Adjacent Value 70 Third Quartile Median 45
First Quartile
Lower Adjacent Value 20 Box Plot
(d) Run chart (or a time-series plot) meant to retain the timeseries nature of the data while the other graphics do not. The manner in which the run chart has been generated is meaningless since the data has been entered into the spreadsheet in the wrong sequence, with data entered column-wise instead of row-wise. The run chart, had the data been entered correctly, would have resulted in a monotonically increasing curve and would have been more meaningful. ■
Violin Plot
(d) Violin plots Another visually appealing plot that conveys essentially the same information as the box plot (but more completely) is the violin plot (Fig. 3.17). It reveals the probability density of the data at different values (each half of the violin shows the same distribution; this convention has been adopted for symmetry and visual aesthetics). The plot is usually smoothed by a kernel density estimator. The dot in the middle of the box inside the violin is the median, and the first and third quartiles are drawn (the box represents inter-quartile range).
3.5 Exploratory Data Analysis (EDA)
3.5.4
Static Bi- and Multivariate Graphical Plots
There are numerous graphical representations that fall in this category and only an overview of the more common plots will be provided here. The box plot representation, discussed earlier, also allows a convenient visual comparison of the similarities and differences between the spread and shape of two or more datasets when multiple box plots are plotted side by side. (a) Pie Charts Multivariate stationary data of worldwide percentages of total primary energy sources can be represented by Fig. 3.18 Two different ways of plotting stationary data. Data corresponds to worldwide percentages of total primary energy supply in 2003. (a) Pie chart. (b) Bar chart. (From IEA, World Energy Outlook, IEA, Paris, France, 2004)
95
the widely used pie chart (Fig. 3.18a), which allows the relative aggregate amounts of the variables to be clearly visualized. The same information can also be plotted as a bar chart (Fig. 3.18b), which is not quite as revealing. (b) Elaborate Bar Charts More elaborate bar charts (such as those shown in Fig. 3.19) allow numerical values of more than one variable to be plotted such that their absolute and relative amounts are clearly highlighted. The plots depict differences between the electricity sales during each of the four different quarters of the year over 6 years. Such plots can be drawn as compounded plots to allow better visual inter-comparisons (Fig. 3.19a). Column charts or stacked charts (Fig. 3.19b, c) show the same information
96
3
Data Collection and Preliminary Analysis
Fig. 3.19 Different types of bar and area plots to illustrate year-by-year variation (over 6 years) in quarterly electricity sales (in GigaWatt-hours) for a certain city
as that in Fig. 3.19a but are stacked one above another instead of showing the numerical values side by side. One plot shows the stacked values normalized such that the sum adds to 100%, while another stacks them so as to retain their numerical values. Finally, the same information can be plotted as an area chart (Fig. 3.19d) wherein both the time-series trend and the relative magnitudes are clearly highlighted. (c) Scatter and Time-Series Plots Time-series plots or relational plots or scatter plots (such as x–y plots) between two variables are the most widely used types of graphical displays. Scatter plots allow visual determination of the trend line between two variables and the extent to which the data scatter around the trend line (Fig. 3.20). An important issue is that the manner of selecting the range of the variables can be misleading to the eye. The same data is plotted in Fig. 3.21 on two different scales, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a). This is
referred to as the lie factor defined as the ratio of the apparent size of effect in the graph and the actual size of effect in the data (Tufte 2001). The data at hand and the intent of the analysis should dictate the scale of the two axes, but it is difficult in practice to determine this heuristically.4 It is in such instances that statistical measures can be used to provide an indication of the magnitude of the graphical scales. (d) Bubble Plots Bubble plots allow observations with three attributes to be plotted. The 2-D version of such plots is the wellknown x–y scatter plot. An additional variable is represented by enlarging the dot, i.e., the bubble size. Figure 3.22 is illustrative of such a representation for the commute patterns in major US cities in 2008.
4
Generally, it is wise, at least at the onset, to adopt scales starting from zero, view the resulting graphs and then make adjustments to the scales as appropriate.
3.5 Exploratory Data Analysis (EDA)
97
Fig. 3.20 Scatter plot (or x–y plot) of worldwide population growth over time showing past values and projected values with 2010 as the current year. In this case, a second-order quadratic regression model has been selected to plot the trend line. The actual population in 2021 was 7.84 billion (quite close to the projected value)
Fig. 3.21 Figure to illustrate how the effect of resolution can mislead visually. The same data is plotted in the two plots, but one would erroneously conclude that there is more data scatter around the trend line for (b) than for (a)
Fig. 3.22 Bubble plot showing the commute patterns in major US cities in 2008. The size of the bubble represents the number of commuters. (From Wikipedia website, downloaded 2010)
98
3
Data Collection and Preliminary Analysis
Fig. 3.24 Scatter plot combined with box-whisker-mean (BWM) plot of the same data as shown in Fig. 3.10. (From Haberl and Abbas 1998, by permission of Haberl)
Fig. 3.23 Several types of combination charts are possible. The plots shown allow visual comparison of the standardized (subtracted by the mean and divided by the standard deviation) daily whole-house electricity use in 29 similar residences against the overlaid standard normal distribution. (From Reddy 1990)
(e) Combination Charts Combination charts can take numerous forms, but, in essence, are those where two basic but different ways of representing data are combined together in one graph. One example is Fig. 3.23 where the histogram depicts actual data spread, the distribution of which can be visually evaluated against the standard normal line overlaid on the histogram. The dataset corresponds to the daily energy use of 29 residences with similar diurnal energy use (classified as Stratum 5) during the summer and winter days. Occupant vagaries can be likened to randomness/noise in the electricity use data. Possible causes for the closeness or deviation from normality for each season can provide physical insights to the analyst and also allow different electricity curtailment measures to be modeled in a probabilistic framework. For purposes of data checking, x–y plots are perhaps most appropriate as discussed in Sect. 3.3.4. The x–y scatter plot
(Fig. 3.10) of hourly cooling energy use of a large institutional building versus outdoor temperature allowed outliers to be detected. The same data could be summarized by combined box-and-whisker plots (first suggested by Tukey 1988) as shown in Fig. 3.24. Here the x-axis range is subdivided into discrete bins (in this case, 5 °F bins), showing the median values (joined by a continuous line) along with the 25th percentiles on either side of the mean (shown boxed), the 10th and 90th percentiles indicated by the vertical whiskers from the box, and the values less than the 10th percentile and those greater than the 90th percentile are shown as individual points.5 Such a representation is clearly a useful tool for data quality checking, for detecting underlying patterns in data at different sub-ranges of the independent variable, and also for ascertaining the shape of the data spread around this pattern. Note that the cooling energy use line seems to plateau at outside temperature values above 80 °F. What does this indicate about the installed capacity of the chiller? Probably that the chiller was undersized at the onset or has degraded over time or that the building loads have increased over time. (f) Component-Effect Plots In case the functional relationship between the independent and dependent variables changes due to known causes, it is advisable to plot these in different frames. For example, hourly energy use in a commercial building is known to change with time of day, and, moreover, the functional relationship is quite different dependent on the season (time of year). Component-effect plots are multiple plots between the variables for cold, mild, and hot 5
Note that the whisker end points are different from those described earlier in Sect. 3.5.3. Different textbooks and studies adopt slightly different selection criteria.
3.5 Exploratory Data Analysis (EDA)
Fig. 3.25 Example of a combined box-whisker-component plot depicting how hourly energy use varies with hour of day during a year for different outdoor temperature bins for a large commercial building.
Fig. 3.26 Contour plot characterizing the sensitivity of total power consumption (condenser water pump power plus tower fan power) to condenser water-loop controls for a single chiller load, ambient wet-bulb temperature, and chilled water supply temperature. (From Braun et al. 1989, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org)
periods of the year combined with different box-andwhisker plots for different hours of the day. They provide more clarity in underlying trends and scatter as illustrated in Fig. 3.25, where the time of year is broken up into three temperature bins. (g) Contour Plots A contour plot depicts the relationship between system response (or dependent variable) and two independent variables plotted on the two axes. This relationship is
99
(From ASHRAE 2002, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org)
captured by a series of lines drawn for different pre-selected values of the response variable. This is illustrated by Fig. 3.26 where the total power of a condenser loop of a cooling system is the sum of the pump power and the cooling tower fan, both of which are function of their operating speed (since they can be modulated). The two axes of the figure are the normalized fan and pump speeds relative to their rated values. The minimum total power is shown by a cross at the center of the innermost contour circle, while one notes that this minimum is quite broad and different combinations of the two control variables are admissible. Such plots clearly indicate the degree of latitude allowable in the operating control settings of the pump and the fan and reveal the non-linear sensitivity of these control points stray from the optimal setting. Insights into how different combinations of the two independent variables impact total system power are useful to system operators. (h) Scatter Plot Matrix Figure 3.27, called scatter plot matrix, is another useful representation of visualizing multivariate data. Here the various permutations of the variables are shown as individual scatter plots. The idea, though not novel, has merit because of the way the graphs are organized and presented. The graphs are arranged in rows and columns such that each row or column has all the graphs relating a certain variable to all the others; thus, the variables have shared axes. Though there are twice as many graphs as needed minimally (since each graph has another one with the axis interchanged), the redundancy is sometimes useful to the analyst in better detecting underlying trends.
100 Fig. 3.27 Scatter plot matrix or carpet plots for multivariable graphical data visualization. The data corresponds to hourly climatic data for Phoenix, AZ, for January 1990. The bottom lefthand corner frame indicates how solar radiation in Btu/hr-ft2 (x-axis) varies with dry-bulb temperature (in °F) and is a flipped and rotated image of that at the top right-hand corner. The HR variable represents humidity ratio (in lbm/lba). Points that fall distinctively outside the general scatter can be flagged as outliers
3
Data Collection and Preliminary Analysis
TDB
HR
Solar
improperly time-stamped, such as overlooking daylight savings shift or misalignment of 24-h holiday profiles (Fig. 3.29). One negative drawback associated with these graphs is the difficulty in viewing exact details such as the specific hour or specific day on which a misalignment occurs. Some analysts complain that 3-D surface plots obscure data that is behind “hills” or in “valleys.” Clever use of color or dotted lines has been suggested to make it easier to interpret such graphs.
Fig. 3.28 Three-dimensional (3-D) surface chart of mean hourly whole-house electricity during different hourly segments of the day across several residences. (From Reddy 1990)
(i) 3-D Plots Three-dimensional (or 3-D) plots are being increasingly used from the past several decades. They allow plotting variation of a variable when it is influenced by two independent factors. They also allow trends to be gauged and are visually appealing, but the numerical values of the variables are difficult to read. The data plotted in Fig. 3.28 corresponds to the mean hourly energy use of 29 residences with similar diurnal energy use (classified as Stratum 5). The day has been broken up into eight segments of 3 h each to reduce clutter in the graph. Such a graph can reveal probabilistic trends in how occupants consume electricity so that different demand curtailment measures can be evaluated by electric utilities in a probabilistic modeling framework. Another benefit of such 3-D plots is their ability to aid in the identification of oversights. For example, energy use data collected from a large commercial building could be
(j) Domain-Specific Charts Different disciplines have developed different types of plots and graphical representations. Tornado diagrams are commonly used to illustrate variable sensitivity during risk analysis, and spider plots are common in multicriteria decision-making studies (Chap. 12). One example in HVAC studies is the well-known psychrometric chart (Reddy et al. 2016), which allows one to determine (for a given location characterized by its elevation above sea level) the various properties of air–water mixtures such as dry-bulb temperature, absolute humidity, relative humidity, specific volume, enthalpy, wet-bulb temperature. Solar scientists and architects have developed the sunpath diagram, which allows one to determine the position of the sun in the sky (defined by the solar altitude and the solar azimuth angles) at different times of the day and the year for a location of latitude 40° N (Fig. 3.30). Such a representation has also been used to determine periods of the year when shading occurs from neighboring obstructions. Such considerations are important while siting solar systems or designing buildings.
3.5.5
Interactive and Dynamic Graphics
The above types of plots can be generated by most of the present-day data analysis software programs. More
3.5 Exploratory Data Analysis (EDA)
101
Fig. 3.29 Example of a three-dimensional plot of measured hourly electricity use in a commercial building over 9 months. (From ASHRAE 2002, # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org) Fig. 3.30 Figure illustrating an overlay plot for shading calculations. The sun-path diagram is generated by computing the solar altitude and azimuth angles for a given latitude (for 40° N in this case) during different times of the day and times of the year. Trees and other objects can obstruct the observer. Such periods are conveniently determined by drawing the contours of these objects characterized by the angles φp and βp computed from basic geometry and overlaid on the sun-path diagram. (From Reddy et al. 2016, by permission of CRC Press)
specialized software programs allow interactive data visualization, which provide much greater insights and intuitive understanding into data trends, correlations, outliers, and local behavior, especially when large amounts of data are being analyzed. Animation has also been used to advantage in understanding time-series system behavior from monitored data since effects such as diurnal and seasonal differences in building energy use can be conveniently investigated. Animated scatter plots of the x and y variables, use of filtering, zooming, brushing, use of distortion lenses, etc. in conjunction with judicious use of color can provide even better visual insights to the professional and enhance classroom learning as well. The interested reader can refer to Keim and Ward (2003) for a more in-depth classification and treatment of advanced data visualization techniques most appropriate for large datasets. Different domains have developed specialized visualization tools. Glaser and Ubbelohde (2001) describe novel highperformance visualization techniques for viewing timedependent data common to building energy simulation program output. Some of these techniques include: (i) brushing
and linking where the user can investigate the behavior during a few days of the year, (ii) tessellating a 2-D chart into multiple smaller 2-D charts giving a four-dimensional (4-D) view of the data such that a single value of a representative sensor can be evenly divided into smaller spatial plots arranged by time of day, (iii) magic lenses that can zoom into a certain portion of the room, and (iv) magic brushes. These techniques enable rapid inspection of trends and singularities that cannot be gleaned from conventional viewing methods.
3.5.6
Basic Data Transformations
Another important aspect of EDA is data transformation. Such transformations are meant to simplify the analysis by removing effects such as strong asymmetry, many outliers in one tail, batches of data with different spreads, promoting linear relationships between regressor and response variables—in short to yield more effective insights of the dataset being analyzed (Hoaglin et al. 1983). Typical examples include converting into appropriate units, taking
102
3
ratios, rescaling, and applying mathematical corrections to the data such as taking logarithms or taking square roots. Such transformations are discussed in various chapters of this book (e.g., Chap. 5 in the context of regression model building, Chap. 11 for data mining, and Chap. 12 for sustainability assessments). Only the very basic rescaling or normalization methods are described below: (a) Decimal scaling moves the decimal point but still preserves most of the original data. The specific observations of a given variable may be divided by 10x where x is the minimum value so that all the observations are scaled between -1 and 1. For example, say the largest value is 289 and the smallest value is -150. Then since x = 3, all observations are divided by 1000 so as to lie between [–0.150 and 0.289]. (b) Min-max scaling allows for better distribution of observations over the range of variation than does decimal scaling. It does this by redistributing the values to lie between [0 and 1]. Hence, each observation is normalized as: zi =
xi - xmin xmax - xmin
ð3:15Þ
where xmax and xmin are the maximum and minimum numerical values, respectively, of the x variable. Sometimes, xmin may be close or equal to zero, and then Eq. 3.15 simplifies down to: zi =
xi xmax
xi - x sx
3.6
Overall Measurement Uncertainty
3.6.1
Need for Uncertainty Analysis
Any measurement exhibits some difference between the measured value and the true value and, therefore, has an associated uncertainty. A statement of measured value without an accompanying uncertainty statement has limited meaning. Uncertainty is the interval around the measured value within which the true value is expected to fall at some stated confidence level (CL). “Good data” is not characterized by point values only; the data should be within an acceptable uncertainty interval or, in other words, should provide the acceptable degree of confidence in the result. Measurements made in the field are especially subject to errors. In contrast to measurements taken under the controlled conditions of a laboratory setting, field measurements are typically made under less predictable circumstances and with less accurate and less expensive instrumentation. Furthermore, field measurements are vulnerable to errors arising from: (a) Variable measurement conditions so that the method employed may not be the best choice for all operating conditions. (b) Limited instrument field calibration, because it is typically more complex and expensive than laboratory calibration. (c) Limitations in the ability to adjust instruments in the field.
ð3:16Þ
Note that though this transformation may look very appealing, the scaling relies largely on the minimum and maximum values, which are generally not very robust and may be error prone. The min-max scaling is a linear transformation. One could also scale some of the variables using non-linear functions such as power or logarithmic functions (as discussed in Sect. 12.5.3). (c) Standard deviation scaling is widely used for distance measures (such as in multivariate statistical analysis) but transforms data into a form unrecognizable from the original data. Here, each observation is transformed as follows: zi =
Data Collection and Preliminary Analysis
ð3:17Þ
where x and sx are the mean and standard deviation, respectively, of the x variable. The dataset is said to be converted into one with zero mean and unit standard deviation (a transformation adopted in several statistical tests and during multivariate regression).
With appropriate care, many of these sources of error can be addressed: (i) through the optimization of the measurement system to provide maximum benefit for the chosen budget, and (ii) through the systematic development of a procedure by which an uncertainty statement can be ascribed to the result. The results of a practitioner who does not consider sources of error are likely to be questioned by others, especially since the engineering community is increasingly becoming sophisticated and mature about the proper reporting of measured data and associated uncertainties.
3.6.2
Basic Uncertainty Concepts: Random and Bias Errors
There are several standard documents for evaluating and reporting uncertainties in measurements, parameters, and methods and for propagation of those uncertainties to the test result. The International Organization of Standardization (ISO) standards are the basis from which professional organizations have developed standards specific to their
3.6 Overall Measurement Uncertainty
Random errors are differences from one observation to the next due to both sensor noise and extraneous conditions affecting the sensor. The random error changes from one observation to the next, but its mean (average value) over a very large number of observations is taken to approach zero. Random error generally has a well-defined probability distribution that can be used to bound its variability in statistical terms as described in the next two sub-sections when a finite number of observations is made of the same variable.
Population average
Frequency
True value Frequency
Parameter Measurement (a) Unbiased and precise
Parameter Measurement (b) Biased and precise
True value and Population average
Population average True value
Frequency
(a) Bias or systematic error (or precision or fixed error) is analogous to sensor precision (see Sect. 3.1). In certain cases, the fixed offset (bias) errors can be determined. For example, a bias is present if a temperature sensor always reads 1 °C higher than the true value from a certified calibration procedure, and this miscalibration error can be corrected. However, there are other causes such as improper placement of the sensor, degradation, or particular measurement technique that cause perturbations on the sensor reading (akin to Fig. 3.1b). These perturbations are also treated as a random variable but characterized by a fixed and unchanging uncertainty value which does not reduce even with multi-sampling (this aspect is elaborated below in Sect. 3.6.4). Thus, for the specific situation, a simple bias correction to the measurements cannot be applied. (b) Random error (or inaccuracy error) is an error due to the unpredictable and unknown variations in the experiment that causes readings to take random values on either side of some mean value. Measurements may be accurate or non-accurate, depending on how well an instrument can reproduce the subsequent readings of an unchanged input (Fig. 3.31). Only random errors can be treated by statistical methods. There are two types of random errors: (i) additive errors that are independent of the magnitude of the observations, and (ii) multiplicative errors that are dependent on the magnitude of the observations (Fig. 3.32). Usually instrument accuracy is stated in terms of percent of full scale, and in such cases uncertainty of a reading is taken to be additive, i.e., irrespective of the magnitude of the reading.
True value and Population average
Frequency
purpose (e.g., ASME PTC19.1-2018). The following material is largely drawn from Guideline 2 (ASHRAE G-2 2005), which deals with engineering analysis of experimental data. Uncertainty sources may be classified as either systematic/ bias or random and these are treated as random variables, with, however, different multipliers applied to them. The end result of an uncertainty analysis is a numerical estimate of the test uncertainty with an appropriate CL.
103
Parameter Measurement (c) Unbiased and imprecise
Parameter Measurement (d) Biased and imprecise
Fig. 3.31 The four general manifestations of measurement bias and precision errors of population average estimated from sample measurements
Fig. 3.32 Conceptual figures illustrating how additive and multiplicative errors affect the uncertainty bands around the trend line
3.6.3
Random Uncertainty
Based on measurements of a random variable X, the true value of X can be specified to lie in the interval (Xbest ± Ux) where Xbest is usually the mean value of the measurements taken and Ux is the uncertainty in X that corresponds to the estimate of the effects of combining fixed and random errors.
104
The uncertainty being reported is specific to a confidence level (CL),6 which can be directly interpreted as a probability. The confidence interval (CI) defines the range of values or the bounds/limits that can be expected to include the true value with a stated probability. For example, a statement that the CI at the 95% CL is 5.1–8.2 implies that the true value will be contained between the interval bounded by {5.1, 8.2} in 19 out of 20 predictions (95% probability), or that one is 95% confident that the true value lies between 5.1 and 8.2. This is a loose interpretation (but easier to understand by practitioners); the more accurate one is that the CI applies to the difference between the sample and the population means and not to the population mean itself. An uncertainty statement with a low CL is usually of little use. For example, in the previous example, if a CL of 40% is used instead of 95%, the interval becomes a tight {7.6, 7.7}. However, only 8 out of 20 predictions will likely lie between 7.6 and 7.7. Conversely, it is useless to seek a 100% CL since then the true value of some quantity would lie between plus and minus infinity. Multi-sample data (repeated measurements of a fixed quantity using altered test conditions, such as different observers or different instrumentation or both) provides greater reliability and accuracy than single sample data (measurements by one person using a single instrument). For the majority of engineering cases, it is impractical and too costly to perform a true multi-sample experiment. Strictly speaking, merely taking repeated readings with the same procedure and equipment does not provide multisample results; however, such a procedure is often accepted by the engineering community as a fair approximation of a multi-sample experiment. Depending upon the sample size of the data (greater or less than about 30 samples), different statistical considerations and equations apply. The issue of estimating CI is further discussed in Chap. 4, while operational equations are presented below. These levels or limits are directly based on the Gaussian and the Student-t distributions presented in Sect. 2.4.3. (a) Random uncertainty in large samples (n > about 30): The best estimate of a variable x is usually its sample mean value given by x: The limits of the CI are determined from the sample standard deviation sx. The typical procedure is then to assume that the individual data values are scattered about the mean following a certain probability distribution function, within (±z. sx) of the mean where z is a multiplier described below. Usually a normal probability curve (Gaussian distribution) is assumed to represent the dispersion in experimental 6
Several publications cite uncertainty intervals without specifying a corresponding CL; such practice should be avoided.
3
Data Collection and Preliminary Analysis
data, unless the process is known to follow one of the other standard distributions (discussed in Sect. 2.4). For a normal distribution, the standard deviation indicates the following degrees of dispersion of the values about the mean. From Table A3, for z = 1.96,7 the area shown shaded is 0.025, which translates to 0.05 for a two-tailed distribution, implying that 95% of the data will be within (±1.96sx) of the mean. Thus, the z-multiplier has a direct relationship with the CL selected (assuming a known probability distribution). The CL for the mean of n number of multi-sample random data, with no fixed error, is: z:s z:s xmin = x - p x and xmax = x þ p x n n
ð3:18aÞ
(b) Random uncertainty in small samples (n < about 30). In many circumstances, the analyst will not be able to collect a large number of data points and may be limited to a dataset of less than 30 values (n < 30). Under such conditions, the mean value and the standard deviation are computed as before. The z-value applicable for the normal distribution cannot be used for small samples. The new values, called t-values, are tabulated for different degrees of freedom d.f. (ν = n - 1) and for the acceptable degree of confidence (see Table A48). The CI for the mean value of x, when no fixed (bias) errors are present in the measurements, is given by: t:s xmin = x - p x n
t:s and xmax = x þ p x n
ð3:18bÞ
For example, consider the case of d.f. = 10 and two-tailed 95% CL. One finds from Table A4 that t = 2.228 for 95% CL. Note that this reduces to t = 2.086 for d.f. = 20 and reaches the z-value of 1.96 for d.f. = 1. Example 3.6.1 Estimating confidence intervals (CI) (a) The length of a field is measured 50 times. The mean is 30 with a standard deviation of 3. Determine the 95% CI assuming no fixed error. 7 Note that the value of 1.96 corresponds to very large samples (>120 or so). For a sample size of 30, the z-value can be read off of Table A4, which shows a value of 2.045 for degrees of freedom = 30 – 1 = 29 for a two-tailed 95% CL. However, it is common practice to simply assume the z-values for samples greater than 30 even though this does introduce some error. 8 Table A4 assembles critical values for both the one-tailed and two-tailed distributions, while most of the discussion here applies to the latter. See Sect. 4.2.1 for the distinction between both.
3.6 Overall Measurement Uncertainty
105
This is a large sample case, for which the z-multiplier is 1.96. Hence, from Eq. 3.17, the 95% CI = 30 ±
ð1:96Þð3Þ ð50Þ1=2
= 30 ± 0:83 = f29:17, 30:83g.
(b) Only 21 measurements are taken and the same mean and standard deviation as in (a) are found. Determine the 95% CI assuming no fixed error. This is a small sample case for which the t-value = 2.086 for d.f. = 20. Then, from Eq. 3.18, the 95% CI will turn out to be wider: 30 ± 1:37 = f28:63, 31:37g
3.6.4
ð2:086Þð3Þ ð21Þ1=2
= 30 ± ■
Bias Uncertainty
Estimating the bias or fixed error of a random variable at a specified confidence level (commonly, 95% CL) is described below. The fixed error BX for a given value x is assumed to be a single value drawn from some larger distribution of possible fixed errors. The treatment is similar to that of random errors with the major difference that only one value is considered even though several observations may be taken. When further knowledge is lacking, a normal distribution is usually assumed. Hence, if a manufacturer specifies that the fixed uncertainty BX = ±1.0 °C with 95% CL (compared to some standard reference device), then one assumes that the fixed error belongs to a larger distribution (taken to be Gaussian) with a standard deviation SB = 0.5 °C (since the corresponding z-value ≃2.0).
3.6.5
Overall Uncertainty
The overall uncertainty of a measured variable x combines the random and bias uncertainty estimates. Though several forms of this expression appear in different texts, a convenient working formulation is as follows:
Ux =
s Bx 2 þ t px n
2
ð3:19Þ
where: Ux = overall uncertainty in the value x at a specified CL Bx = uncertainty in the bias or fixed component at the specified CL sx = standard deviation estimates for the random component n = sample size t = t-value at the specified CL for the appropriate degrees of freedom
Example 3.6.2 For a single measurement, the statistical concept of standard deviation does not apply. Nonetheless, one could estimate it from manufacturer’s specifications if available. One wishes to estimate the overall uncertainty at 95% CL in an individual measurement of water flow rate in a pipe under the following conditions: (a) Full-scale meter reading 150 L/s (b) Actual flow reading 125 L/s (c) Random error of instrument is ±6% of full-scale reading at 95% CL (d) Fixed (bias) error of instrument is ±4% of full-scale reading at 95% CL The solution is rather simple since all stated uncertainties are at 95% CL. It is implicitly assumed that the normal distribution applies. The random error = 150 × 0.06 = ±9 L/s. The fixed error = 150 × 0.04 = ±6 L/s. The overall uncertainty can be estimated from Eq. 3.19 with n = 1: U x = 62 þ 92
1=2
= ± 10:82 L=s
The fractional overall uncertainty at 95% CL = = 0:087 = 8:7%
10:82 125
Ux x
= ■
Example 3.6.3 Consider Example 3.6.2. In an effort to reduce the overall uncertainty, 25 readings of the flow are taken instead of only one reading. The resulting uncertainty in this case is determined as follows: • The bias error remains unchanged at ±6 L/s p • The random error decreases by a factor of n to 9=ð25Þ1=2 = ± 1:8 L=s • Then from Eq. 3.19, the overall uncertainty is: Ux = (62 + 1.82)1/2 = ±6.26 L/s • The fractional overall uncertainty at 95% CL = Uxx = 6:26 = 0:05 = 5:0% 125 Increasing the number of readings from 1 to 25 reduces the absolute uncertainty in the flow measurement from ±10.82 L/s to ±6.26 L/s and the relative uncertainty from ±8.7% to ±5.0%. Because of the large, fixed error, further increase in the number of readings would result in only a small reduction in the overall uncertainty. ■ Example 3.6.4 A flow meter manufacturer stipulates a random error of 5% for his meter at 95.5% CL (i.e., at z = 2). Once installed, the engineer estimates that the bias error due to the placement of
106
3
Table 3.7 Table for Chauvenet’s criterion of rejecting outliers
Number of readings n 5 6 7 10 15 20 25 30 50 100 300 500 1000
Deviation ratio dmax/sx 1.65 1.73 1.80 1.96 2.13 2.24 2.33 2.51 2.57 2.81 3.14 3.29 3.48
the meter in the flow circuit is 2% at 95.5% CL. The flow meter takes a reading every minute, but only the mean value of 15 such measurements is recorded once every 15 min. Estimate the overall uncertainty of the mean of the recorded values at 99% CL. The bias uncertainty can be associated with the normal tables. From Table A3, z = 2.575 has an associated probability of 0.01, which corresponds to the 99% CL. Given that the bias error at 95.5% CL (z = 2) is 2%, the bias uncertainty at 99% CL (z = 2.575) would be 2.575%. Next, the random error at z = 1 is half of that at z = 2, i.e., 2.5%. However, the number of observations is less than 30, and so the student-t distribution has to be used for the random uncertainty component. From Table A4, the critical tvalue = 2.977 for d.f. = 15 – 1 = 14 and two-tailed CL = 95%. Hence, from Eq. 3.19, the overall uncertainty of the recorded values at 99% CL
Ux =
2
½2:575 þ
ð2:977Þ:ð2:5Þ ð15Þ1=2
2
1=2
= 0:0322 = 3:22% ■
3.6.6
Data Collection and Preliminary Analysis
Chauvenet’s Statistical Criterion of Data Rejection
The statistical considerations described above can lead to analytical screening methods that can point out data errors not flagged by graphical methods alone. Though several types of rejection criteria have been proposed, perhaps the best known is the Chauvenet’s criterion, which is said to provide an objective and quantitative method for data rejection. This criterion, which presumes that the errors are normally distributed and have constant variance, specifies that any reading out of a
series of n readings shall be rejected if the magnitude of its deviation dmax from the mean value of the series (=abs (xi - x)) is such that the two-sided probability of occurrence of such a deviation exceeds (1/2n). This criterion should not be applied to small datasets since the Gaussian distribution does not apply in such cases. The Chauvenet criteria is approximately given by the following regression model: d max = 0:8478 þ 0:5375 lnðnÞ - 0:02309 ln n2 sx
ð3:20Þ
where sx is the standard deviation of the series and n is the number of data points.9 The deviation ratio for different number of readings is more accurately given in Table 3.7. For example, if one takes 15 observations, an observation shall be discarded if its deviation from the mean exceeds a value dmax = (2.13)sx. This data rejection should be done only once and more than one round of elimination using the Chauvenet’s criterion is not advised. Note that the Chauvenet’s criterion has inherent assumptions that may not be justified. For example, the underlying distribution may not be normal, but could have a longer tail. In such cases, one may be throwing out good data. A more scientific manner of dealing with outliers is not to reject data points but to use either weighted regression or robust regression, where observations farther away from the mean are given less weight than those from the center (see Sects. 5.6 and 9.9).
3.7
Propagation of Errors
In many cases, the variable used for data analysis is not directly measured, but values of several associated variables are measured, which are then combined using a functional 9 The regression fit has an R-square of 99.6% and root mean square error (RMSE) = 0.0358; these goodness-of-fit statistical criteria are explained in Sect. 5.3.2.
3.7 Propagation of Errors
107
relationship. The objective of this section is to present the methodology to estimate overall error/uncertainty10 of a functional value y knowing the uncertainties in the individual input variables xi. The random and fixed components, which together constitute the overall error, must be estimated separately. The treatment that follows, though limited to random errors, could also apply to bias errors. It is recommended that the Taylor series method be applied when the errors are relatively small compared to the measurement values (say, 5% or so). When the errors are large (say, over 15%), the Monte Carlo (MC) method (Sect. 3.7.2) is preferable.
3.7.1
Taylor Series Method for Cross-Sectional Data
The general approach to estimate the error of a function y = y(x1, x2, . . ., xn), whose independently measured variables are all specified at the same confidence level, is to use the first-order Taylor series expansion (often referred to as the Kline-McClintock propagation of errors equation): n
εy = i=1
∂y εx,i ∂xi
2
ð3:21Þ
where: εy = error in function value εx,i = error in measured quantity xi (equivalent to Ux in the previous section) Neglecting terms higher than the first order (implied by a first-order Taylor Series expansion), the propagation of error expressions for the basic arithmetic operations are given below. Let x1 and x2 have errors ε1 and ε2. Then, for the basic arithmetic operations, Eq. 3.21 simplifies to: Addition or subtraction: For y = x1 ± x2 εy = ε2x1 þ ε2x2
ð3:22Þ
1=2
10 This chapter uses the terms “uncertainty” and “error” interchangeably. There is, however, a distinction. The word “error” is usually used in the context of sensors and derived measurements when the bias and random influences are small compared to the magnitude of the observation (say, 5% or less). When these are large, then the term “uncertainty” is more appropriate. For example, when population statistics are derived from sample data, or when the error propagation analysis involves very large uncertainties of the individual variables warranting a stochastic approach.
Multiplication : For y = x1 =x2 εy = ð x1 x2 Þ
εx1 x1
2
ε þ x2 x2
2 1=2
ð3:23Þ
2 1=2
ð3:24Þ
Division : For y = x1 =x2 x εy = 1 x2
εx1 x1
2
ε þ x2 x2
For functions involving multiplication and division only, it is much simpler to use the following expression than it is to use the more general Eq. 3.21. If y = xx1 x3 2 , then the fractional standard deviation is given by: εy ε 2 ε 2 ε 2 = x12 þ x22 þ x33 y x1 x2 x3
1=2
ð3:25Þ
The error or uncertainty in the result depends on the squares of the uncertainties in the independent variables. This means that if the uncertainty in one variable is larger than the uncertainties in the other variables, then it is the largest uncertainty that dominates. To illustrate, suppose there are three variables with an uncertainty of magnitude 1 and one variable with an uncertainty of magnitude 5. The uncertainty in the result would be (52 + 12 + 12 + 12)0.5 = (28)0.5 = 5.29. Clearly, the effect of the uncertainty in the single largest variable dominates the others. An analysis involving relative magnitude of uncertainties plays an important role during the design of an experiment and the procurement of instrumentation. Very little is gained by trying to reduce the “small” uncertainties since it is the “large” ones that dominate. Any improvement in the overall experimental result must be achieved by improving the instrumentation or experimental technique resulting in these relatively large uncertainties. This concept is illustrated below. Example 3.7.111 Relative error in Reynolds number for flow in a pipe Water is flowing in a pipe at a certain measured rate. The temperature of the water is measured, and the viscosity and density are then found from tables of water properties. Determine the probable errors of the Reynolds numbers (Re) at the low- and high-flow conditions given the information assembled in Table 3.8: Recall that Re = ρVd μ . Since the function involves multiplication and division only, it is easier to work with Eq. 3.25.
11
Adapted from Schenck (1969), by permission of McGraw-Hill.
108
3
Data Collection and Preliminary Analysis
Table 3.8 Error table of the four quantities that define the Reynolds number (Example 3.7.1) Minimum flow 1 0.2 1000 1.12 × 10-3
Quantity Velocity, m/s (V ) Pipe diameter, m (d ) Density, kg/m3 (ρ) Viscosity, kg/m-s (μ) a
Maximum flow 20 0.2 1000 1.12 × 10-3
Random error at full flow (95% CL) 0.1 0 1 0.45 × 10-5
% Errora Minimum 10 0 0.1 0.4
Maximum 0.5 0 0.1 0.4
Note that the last two columns under “% Error” are computed from the previous three columns of data
o
Relative error in Re
o o ooo o o oo o o oo o oo o o o o oo o o o o o o o
o o o
o
Reynolds number (Re)
Fig. 3.33 Expected variation in experimental relative error (at 95% CL) with magnitude of Reynolds number (Example 3.7.1)
At minimum flow condition, the relative error in Re (assuming no error in pipe diameter value) is: ε ðRe Þ = Re
0:1 1
2
þ
1 1000
2
þ
= 0:12 þ 0:0012 þ 0:0042
0:45 112 1=2
2 1=2
= 0:100 or 10%
On the other hand, at maximum flow condition, the percentage error is: ε ðRe Þ = 0:0052 þ 0:0012 þ 0:0042 Re
1=2
= 0:0065 or 0:65%
The above example reveals that (i) at low-flow conditions the error is 10%, which reduces to 0.65% at high-flow conditions, and (ii) at low-flow conditions the other sources of error are absolutely dwarfed by the 10% error due to flow measurement uncertainty. Thus, the only way to improve the experiment is to improve flow measurement accuracy. If the experiment is run without changes, one can confidently expect the data at the low-flow end to show a broad scatter becoming smaller as the velocity is increased. This phenomenon is captured by the 95% CI shown as relative errors in Fig. 3.33. ■
Equation 3.21 applies when the measured variables are uncorrelated. If they are correlated, their interdependence can be quantified by the covariance (defined by Eq. 3.11). If two variables x1 and x2 are correlated, then the error of their sum is given by: εy = εx1 2 þ εx2 2 þ 2x1 x2 covðx1 , x2 Þ
ð3:26Þ
Note that the covariance term can assume positive or negative values, and so the combined error can be higher or lower than that for uncorrelated independent variables. Example 3.7.2 Uncertainty in overall heat transfer coefficient The equation of the overall heat transfer coefficient U of a heat exchanger consisting of a fluid flowing inside and another fluid flowing outside a steel pipe of negligible thermal resistance is: U = ð1=h1 þ 1=h2 Þ - 1 = ½h1 h2 =ðh1 þ h2 Þ
ð3:27Þ
where h1 and h2 are the individual coefficients of the two fluids on either side of the pipe. If h1 = 15 W/m2 °C with a fractional error of 5% at 95% CL and h2 = 20 W/m2 °C with a fractional error of 3%, also at 95% CL, what will be the fractional error in random uncertainty of the U coefficient at 95% CL assuming bias error to be zero? In this case, because of the addition term, one has to use the fundamental equation given by Eq. 3.21. In order to use the propagation of error equation, the partial derivatives need to be computed. One could proceed to do so analytically using basic calculus. Then: ∂U δh1
=
h22 h2 ð h 1 þ h 2 Þ - h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2
ð3:28Þ
=
h21 h1 ð h 1 þ h 2 Þ - h1 h 2 = ð h1 þ h2 Þ 2 ðh1 þ h2 Þ2
ð3:29Þ
h2
and ∂U δh2
h1
The absolute uncertainty εU in the overall heat transfer coefficient U is given by Eq. 3.21:
3.7 Propagation of Errors
εU =
∂U εh1 ∂h1
109 2
þ
∂U εh2 ∂h2
2
ð3:30Þ
where εh1 and εh2 are the errors of the coefficients h1 and h2, respectively, and are determined as:
Example 3.7.3 Uncertainty in exponential growth models Exponential growth models are used to describe several commonly encountered phenomena from population growth to consumption of resources. The amount of resource consumed over time Q(t) can be modeled as:
εh1 = 0:05 × 15 = 0:75 and εh2 = 0:03 × 20 = 0:80:
t
Plugging numerical values in the expression for U given by Eq. 3.27, one gets U = 8.571, while the partial derivatives given by Eqs. 3.28 and 3.29 are computed as: ∂U ∂U = 0:3265 and = 0:1837 ∂h2 ∂h1 Finally, from Eq. 3.30, the absolute error in the overall heat transfer coefficient U: εU = ð0:3265 × 0:75Þ2 þ ð0:1837 × 0:80Þ2
1=2
= 0:288 at 95%CL The fractional error in U = (0.288/8.571) = 3.3%.
Q ðt Þ =
yðx1 þ Δx1 , x2 , . . .Þ - yðx1 - Δx1 , x2 , . . .Þ ∂y = 2:Δx1 ∂x1 yðx1 , x2 þ Δx2 , . . .Þ - yðx1 , x2 - Δx2 . . .Þ ∂y = etc . . . 2:Δx2 ∂x2 ð3:31Þ No strict rules for the size of the perturbation or step size Δx can be framed since they would depend on the underlying shape of the function. Perturbations in the range of 1–4% of the value are reasonable choices, and one should evaluate the stability of the partial derivative computed numerically by repeating the calculations for a few different step sizes. In cases involving complex experiments with extended debugging phases, one should update the uncertainty analysis whenever a change is made in the data reduction program. Commercial software programs are also available with in-built uncertainty propagation formulae. This procedure is illustrated in Example 3.7.3.
P0 rt ð e - 1Þ r
ð3:32Þ
0
where P0 = initial consumption rate, and r = exponential rate of growth. The world coal consumption in 1986 was equal to 5.0 billion (short) tons and the estimated recoverable reserves of coal were estimated at 1000 billion tons. (a) If the growth rate is assumed to be 2.7% per year, how many years will it take for the total coal reserves to be depleted? Rearranging Eq. 3.32 results in t=
■
Another method of determining partial derivatives is to adopt a perturbation approach, which allows the local sensitivity or slope of the function to be evaluated numerically. A computer routine can be written to perform this task. One method is based on approximating partial derivatives by a central finite-difference approach. If y = y(x1, x2, . . . xn), then:
P0 ert dt =
Or t =
1 0:027 : ln
1 Q:r ln 1 þ r P0
ð3:33Þ
1 þ ð1000Þ5ð0:027Þ = 68:75 years
(b) Assume that the growth rate r and the recoverable reserves are subject to random uncertainty. If the uncertainties of both quantities are taken to be normal with standard deviation values of 0.2% (absolute) and 10% (relative), respectively, determine the lower and upper estimates of the years to depletion at the 95% CL. While the partial derivatives can be derived analytically in this case, the use of Eq. 3.21 following a numerical approach is adopted for illustration. The pertinent results using Eq. 3.31 with a perturbation multiplier of 1% to both the base values of r (= 0.027) and of Q (=1000) are assembled in Table 3.9. From here: ∂t = ∂r ∂t = ∂Q
ð68:37917 - 69:12924Þ = - 1389, and ð0:02727 - 0:02673Þ ð69:06297 - 68:43795Þ = 0:03125 ð1010 - 990Þ
110
3
Table 3.9 Numerical computation of the partial derivatives of t with Q and r (Example 3.7.3) Multiplier 0.99 1.00 1.01
Assuming Q = 1000 r t (from Eq. 3.32) 0.02673 69.12924 0.027 68.75178 0.02727 68.37917
Assuming r = 0.027 Q t (from Eq. 3.32) 990 68.43795 1000 68.75178 1010 69.06297
While power E can be measured directly, the amount of cooling Qch has to be determined by individual measurements of the chilled water volumetric flow rate and the difference between the supply and return chilled water temperatures along with water properties. Qch = ρVcΔT
ð3:35Þ
where:
Then:
εt =
Data Collection and Preliminary Analysis
∂t εr ∂r
2
þ
∂t εQ ∂Q
2 1=2
= ½ð- 1389Þð0:002Þ2 þ ð0:03125Þð0:1Þð1000Þ2 = 2:7782 þ 3:1252
1=2
1=2
= 4:181
Thus, the lower and upper limits at the 95% CL (with the z = 1.96) is = 68:75 ± ð1:96Þ4:181 = f60:55, 76:94g years The analyst should repeat the above procedure with, say, a perturbation multiplier of 2% in order to evaluate the stability of the numerically derived partial derivatives. If these differ substantially, it is urged that the function be plotted and scrutinized for irregular behavior around the point of interest. ■ Example 3.7.4 Selecting instrumentation during the experimental design phase A general uncertainty analysis is recommended at the planning phase for the purposes of proper instrument selection. This analysis should intentionally be kept simple. It is meant to identify the primary sources of uncertainties, to evaluate the relative weights of different source, and to perform a sensitivity analysis. An experimental program is being considered involving continuous monitoring of a large chiller under field conditions. The objective of the monitoring is to determine the chiller coefficient-of-performance (COP) on an hourly basis. The fractional uncertainty in the COP should not be greater than 5% at 95% CL. The rated full load is 450 tons of cooling (1 ton = 12,000 Btu/h). The chiller is operated under constant chilled water and condenser water flow rates. Only random errors are to be considered. The COP of a chiller is defined as the ratio of the amount of cooling at the evaporator (Qch) to the electric power (E) consumed: COP =
Qch E
ð3:34Þ
ρ = density of water V = chilled water volumetric flow rate, assumed constant during operation (rated flow rate = 1080 gpm) c = specific heat of water ΔT = temperature difference between the entering and leaving chilled water at the evaporator (a quantity that changes during operation) From Eq. 3.25, the fractional uncertainty in COP (neglecting the small effect of uncertainties in the density and specific heat terms) is: UC O P = COP
UV V
2
þ
U ΔT ΔT
2
þ
UE E
2
ð3:36Þ
Note that since this is a preliminary uncertainty analysis, only random errors are considered. The maximum flow reading of the selected meter is 1500 gpm with 4% uncertainty at 95% CL. This leads to an absolute uncertainty of (1500 × 0.04) = 60 gpm. The first term UVV is a constant and does not depend on the chiller load since the flow through the evaporator is maintained constant at 1080 gpm, Thus, UV V
2
=
2
60 1080
= 0:0031 and
UV = ± 0:056: V
The random error at 95% CL for the type of commercial grade sensor to be used for temperature measurement is 0.2°F. Consequently, the error in the measurement of temperature difference ΔT = (0.22 + 0.22)1/2 = 0.28 °F. From manufacturer catalogs, the temperature difference between supply and return chilled water temperatures at full load can be assumed to be 10 °F. The fractional uncertainty at full load is then U ΔT ΔT
2
=
0:28 10
2
= 0:00078 and
U ΔT = ± 0:028: ΔT
The power instrument has a full-scale value of 400 kW with an error of 1% at 95% CL, i.e., an error of 4.0 kW. The chiller rated capacity is 450 tons of cooling, with an assumed realistic lower bound of 0.8 kW per ton of cooling. The
3.7 Propagation of Errors
111
anticipated electric draw at full load of the chiller = 0.8 × 450 = 360 kW. The fractional uncertainty at full load is then: UE E
2
=
4:0 360
2
= 0:00012 and
UE = ± 0:011 E
Thus, the fractional uncertainty in the power is about one fifth of the flow rate. Propagation of the above errors yields the fractional uncertainty at 95% CL at full chiller load of the measured COP (using Eq. 3.36): U COP = ð0:0031 þ 0:00078 þ 0:00012Þ1=2 = 0:063 = 6:3% COP It is clear that the fractional uncertainty of the proposed instrumentation is not satisfactory for the intended purpose. Looking at the fractional uncertainties of the individual contributions, the logical remedy is to select a more accurate flow meter or one with a lower maximum flow reading. ■
3.7.2
Monte Carlo Method for Error Propagation Problems
The previous method of ascertaining errors/uncertainty based on the first-order Taylor series expansion is widely used; but it has limitations. If relative uncertainty is large, this method may be inaccurate for non-linear functions since it assumes first-order derivatives based on local functional behavior. Further, for the CI of the functional variable to have a statistical interpretation, the errors have to be normally distributed. Finally, deriving partial derivatives of complex and interrelinked analytical functions (as is common for models involving system simulation of various individual components) is a tedious and error-prone affair. A more general manner of dealing with uncertainty propagation is to use Monte Carlo (MC) methods, which are widely used for a number of complex applications (and treated at more length in Sects. 6.7.3, 10.6.7, and 12.2.8). These methods are numerical methods for solving problems involving random numbers and require considerations of probability. MC, in essence, is a numerical process of repeatedly calculating a mathematical function in which the input variables and/or the function parameters are random or contain uncertainty with prescribed probability distributions. Specifically, the individual inputs and/or parameters are sampled randomly from their prescribed probability distributions to form one repetition (or run or trial). The corresponding numerical solution is one possible outcome of the function. This process of generating runs is repeated a large number of times, resulting in a distribution of the functional values that
can then be represented as probability distributions, or as histograms, or by summary statistics, or by CI for any percentile threshold chosen. Such insights cannot be gained from using the traditional Taylor series error propagation approach. The MC process is computer intensive and requires thousands of runs to be performed. However, the entire process is simple and easily implemented even on spreadsheet programs (which have in-built functions for generating pseudo-random numbers of selected distributions). Specialized engineering software programs are also available. There is a certain amount of arbitrariness associated with the process because MC simulation is a numerical method. Several authors propose approximate formulae for determining the number of trials, but a simple method is as follows. Start with a large number of trials (say, 1000), and generate pseudo-random numbers with the assumed probability distribution. Since they are pseudo-random, the mean and the distribution (say, the standard deviation) may deviate somewhat from the desired ones (which depend on the accuracy of the algorithm used). Generate a few such sets and pick one whose mean and standard deviation values are closest to the desired quantities. Use this set to simulate the corresponding values of the function. This can be repeated a few times until the mean and standard deviations stabilize around some average values, which can be taken to be the answer. It is also urged that the analyst evaluate the effect of the results with different number of trials, say using 3000 trials, and ascertaining that the results of both the 1000 trial and 3000 trials are similar. If they are not, sets with increasingly large number of trials should be used till the results converge. The approach is best understood by means of a simple example. Example 3.7.5 Using Monte Carlo (MC) method to determine uncertainty in exponential growth models Let us solve the problem given in Example 3.7.3 by the MC method. The approach involves setting up a spreadsheet table as shown in Table 3.10. Since only two variables (namely Q and r) have uncertainty, one only needs to assign two columns to these and a third column to the desired quantity, i.e., time t over which the total coal reserves will be depleted. The first row shows the calculation using the mean values and one sees that the value of t = 68.75 as found in part (a) of Example 3.7.3 is obtained (this is done for verifying the spreadsheet cell formula). The analyst then generates random numbers of Q and r with the corresponding mean and standard deviations as specified and shown in the first row of the table. MC methods, being numerical methods, require that a large sample be generated in order to obtain reliable results.
112 Table 3.10 The first few and last few calculations used to determine uncertainty in variable t using the Monte Carlo (MC) method
3 Run # 1 2 3 4 5 6 7 8 9 10 ... ... 990 991 992 993 994 995 996 997 998 999 1000 Mean SD
Data Collection and Preliminary Analysis
Q (1000, 100) 1000.0000 1050.8152 1171.6544 1098.2454 1047.5003 1058.0283 946.8644 1075.5269 967.9137 1194.7164
r (0.027, 0.002) 0.0270 0.0287 0.0269 0.0284 0.0261 0.0247 0.0283 0.0277 0.0278 0.0262
t (years) 68.7518 72.2582 73.6445 73.2772 69.0848 67.7451 68.5256 71.8072 68.6323 73.3758
1133.6639 997.0123 896.6957 1056.2361 1033.8229 1078.6051 1137.8546 950.8749 1023.7800 950.2093 849.0252 1005.0 101.82
0.0278 0.0252 0.0257 0.0283 0.0298 0.0295 0.0276 0.0263 0.0264 0.0248 0.0247 0.0272 0.00199
73.6712 66.5173 63.8175 71.9108 72.8905 73.9569 73.4855 66.3670 68.7452 64.5692 61.0231 68.91 3.919
The last two rows indicate the mean and standard deviation (SD) of the individual columns (Example 3.7.5)
Fig. 3.34 Pertinent plots of the Monte Carlo (MC) analysis for the time variable t (years) (a) Histogram. (b) Normal probability plot with 95% limits
Often, 1000 normal distribution samples are selected but it is advisable to repeat the analysis a few times for more robust results. For example, instead of having (1000, 100) for the mean and standard deviation of Q, the 1000 samples have (1005.0, 101.82). The corresponding mean and standard deviation of t are found to be (68.91, 3.919) compared to the previously estimated values of (68.75, 4.181). Further, even though normal distributions were assumed for variables Q and r, the functional form for time t results in a non-normal
distribution as can be seen by the histogram and the normal probability plots of Fig. 3.34. Hence it is more meaningful to look at the percentiles rather than simply the mean and standard deviation (shown in Table 3.10). These are easily deduced from the MC runs and shown in Table 3.11. Such additional insights into the distribution of the variable t provided by the MC-generated data are certainly advantageous. ■
3.8 Planning a Non-Intrusive Field Experiment
113
Table 3.11 Percentiles for the time variable t (years) determined from the 1000 MC runs of Table 3.10 % 1.0 5.0 10.0 25.0 50.0 75.0 90.0 95.0 99.0
3.8
Percentiles 58.509 62.331 63.8257 66.3155 68.968 71.6905 73.818 74.9445 77.051
Planning a Non-Intrusive Field Experiment
One needs to differentiate between two conditions under which data can be collected. On the one hand, one can have a controlled setting where the various variables of interest can be altered by the experimenter. In such a case, referred to as intrusive testing, one can plan an “optimal” experiment where one can adjust the inputs and boundary or initial conditions as well as choose the number and location of the sensors so as to minimize the effect of errors on estimated values of the parameters (treated in Chap. 6). On the other hand, one may be in a situation where one is a mere “spectator,” i.e., the system or phenomenon cannot be controlled, and the data is collected under non-experimental conditions (as is the case of astronomical observations). Such an experimental protocol, known as non-intrusive identification, is usually not the best approach. In certain cases, the driving forces may be so weak or repetitive that even when a “long” dataset is used for identification, a strong enough or varied output signal cannot be elicited for proper statistical treatment (see Sect. 10.3). An intrusive or controlled experimental protocol, wherein the system is artificially stressed to elicit a strong response, is more likely to yield robust and accurate models and their parameter estimates. However, in some cases, the type and operation of the system may not allow such intrusive experiments to be performed. Further, one should appreciate differences between measurements made in a laboratory setting and in the field. The potential for errors, both bias and random, is usually much greater in the latter. Not only can measurements made on a piece of laboratory equipment be better designed and closely controlled, but they will be more accurate as well because more expensive sensors can be selected and placed correctly in the system. For example, proper flow measurement requires that the flow meter be placed 30 pipe diameters after a bend to ensure the flow profile is well established. A laboratory setup can be designed accordingly, while field
conditions may not allow such conditions to be met satisfactorily. Further, systems being operated in the field may not allow controlled tests to be performed, and one has to develop a model or make decisions based on what one can observe. Any experiment should be well-planned involving several rational steps (e.g., ascertaining that the right sensors and equipment are chosen, that the right data collection protocol and scheme are followed, and that the appropriate data analysis procedures are selected). It is advisable to explicitly adhere to the following steps (ASHRAE 2005): (i) Identify experimental goals and acceptable accuracy that can be achieved within the time and budget available for the experiment. (ii) Identify entire list of measurable variables and relationships. If some are difficult to measure, find alternative variables. (iii) Establish measured variables and limits (theoretical limits and expected bounds) to match the selected instrument limits. Also, determine instrument limits— all sensor and measurement instruments have physical limits that restrict their ability to accurately measure quantities of interest. (iv) Preliminary instrumentation selection should be based on accuracy, repeatability, and features of the instrument increase, as well as cost. Regardless of the instrument chosen, it should have been calibrated within the last 12 months or within an interval required by the manufacturer, whichever is less. The required accuracy of the instrument will depend upon the acceptable level of uncertainty for the experiment. (v) Document uncertainty of each measured variable using information gathered from manufacturers or past experience with specific instrumentation. Document the uncertainty for each measured variable. This information will then be used in estimating the overall uncertainty of results using propagation of error methods. (vi) Perform preliminary uncertainty analysis of proposed measurement procedures and experimental methodology. This should be completed before the procedures and methodology are finalized in order to estimate the uncertainty in the final results. The higher the accuracy required of measurements, the higher the accuracy of sensors needed to obtain the raw data. The uncertainty analysis is the basis for selection of a measurement system that provides acceptable uncertainty at least cost. How to perform such a preliminary uncertainty analysis was illustrated by Example 3.7.4. (vii) Final instrument selection and methods should be based on the results of the preliminary uncertainty
114
analysis and selection of instrumentation. Revise selection, if necessary, to achieve the acceptable uncertainty in the experiment results. (viii) Install instrumentation in accordance with manufacturer’s recommendations. Any deviation in the installation from the manufacturer’s recommendations should be documented and the effects of the deviation on instrument performance evaluated. A change in instrumentation or location may be required if in situ uncertainty exceeds acceptable limits determined by the preliminary uncertainty analysis. (ix) Perform initial data quality verification to ensure that the measurements taken are not too uncertain and represent reality. Instrument calibration and independent checks of the data are recommended. Independent checks can include sensor validation, energy balances, and material balances (see Sect. 3.3). (x) Collect data. The challenge for data acquisition in any experiment is to collect the required amount of information while avoiding collection of superfluous information. Superfluous information can overwhelm simple measures taken to follow the progress of an experiment and can complicate data analysis and report generation. The relationship between the desired result—either static, periodic stationary, or transient— and time is the determining factor for how much information is required. A static, non-changing result requires only the steady-state result and proof that all transients have died out. A periodic stationary result, the simplest dynamic result, requires information for one period and proof that the one selected is one of three consecutive periods with identical results within acceptable uncertainty. Transient or non-repetitive results—whether a single pulse or a continuing, random result—require the most information. Regardless of the result, the dynamic characteristics of the measuring system and the full transient nature of the result must be documented for some relatively short interval of time. Identifying good models requires a certain amount of diversity in the data, i.e., should cover the spatial domain of variation of the independent variables. Some basic suggestions pertinent to controlled experiments are summarized below, which are also pertinent for non-intrusive data collection. (a) Range of variability: The most obvious way in which an experimental plan can be made compact and efficient is to space the variables in a predetermined manner. If a functional relationship between an independent variable X and a dependent variable Y is sought, the most obvious way is to select end points or limits of the test,
3
Data Collection and Preliminary Analysis
Fig. 3.35 A possible XYZ envelope with X and Z as the independent variables. The dashed lines enclose the total family of points over the feasible domain space
thus covering the test envelope or domain that encloses the complete family of data. For a model of the type Z = f(X,Y ), a plane area or map is formed (see Fig. 3.35). Functions involving more variables are usually broken down to a series of maps. The above discussion relates to controllable regressor variables. Extraneous variables, by their very nature, cannot be varied at will. An example are phenomena driven by climatic variables. The energy use of a building is affected, among others, by outdoor dry-bulb temperature, humidity, and solar radiation. Since these cannot be varied at will, a proper experimental data collection plan would entail collecting data during different seasons of the year. (b) Grid spacing considerations: Once the domains or ranges of variation of the variables are defined, the next step is to select the grid spacing. Being able to anticipate the system behavior from theory or from prior publications would lead to a better experimental design. For a relationship between X and Y, which is known to be non-linear, the optimal grid is to space the points at the two extremities. However, if a linear relationship between X and Y is sought for a phenomenon that can be approximated as linear, then it would be best to space the x points evenly. For non-linear or polynomial functions, an equally spaced test sequence in X is clearly not optimal. Consider the pressure drop through a new fitting as a function of flow. It is known that the relationship is quadratic. Choosing an experiment with equally spaced X values would result in a plot such as that shown in Fig. 3.36a. One would have more observations in the
Problems
115
Fig. 3.36 Two different experimental designs for proper identification of the parameter (k) appearing in the model for pressure drop versus velocity of a fluid flowing through a pipe assuming ΔP = kV2. The grid spacing shown in (a) is the more common one based on equal
increments in the regressor variable, while that in (b) is likely to yield more robust estimation but would require guess-estimating the range of variation for the pressure drop
low-pressure drop region and less in the higher range. One may argue that an optimal spacing would be to select the velocity values such that the pressure drop readings are more or less evenly spaced (see Fig. 3.36b). Which one of the two is better depends on the instrument precision. If the pressure drop instrument has constant uncertainty over the entire range of variation of the experiment, then test spacing as shown in Fig. 3.36b is better because it is uniform. But if the fractional uncertainty of the instrument decreases with increasing pressure drop values, then the points at the lower end have higher uncertainty. In that case it is better to take more readings at the lower end as for the spacing sequence shown in Fig. 3.36a. (xi) Accomplish data reduction and analysis, which involves the distillation of raw data into a form that is usable for further analysis. It may involve averaging multiple measurements, quantifying necessary conditions (e.g., steady-state), comparing with physical limits or expected ranges, and rejecting outlying measurements. (xii) Perform final uncertainty analysis, which is done after the entire experiment has been completed and when the results of the experiments are to be documented or reported. This will take into account unknown field effects and variances in instrument accuracy during the experiment. A final uncertainty analysis involves the following steps: (i) Estimate fixed (bias) error based upon instrumentation calibration results, and (ii) document the random error due to the instrumentation based upon instrumentation calibration results. The fixed errors needed for the detailed uncertainty analysis are usually more difficult to estimate with a high degree of certainty. Minimizing fixed errors can
be accomplished by careful calibration with referenced standards. (xiii) Reporting results is the primary means of communication. Different audiences require different reports with various levels of detail and background information. The report should be structured to clearly explain the goals of the experiment and the evidence gathered to achieve the goals. It should describe the data reduction, data analysis, and uncertainty analysis performed. Graphical and mathematical representations are often used. On graphs, error bars placed vertically and horizontally on representative points are a very clear way to present expected uncertainty. A data analysis section and a conclusion are critical sections and should be prepared with great care while being succinct and clear.
Problems Pr. 3.1 Consider the data given in Table 3.2. Determine: (a) The 10% trimmed mean value. (b) Which observations can be considered to be “mild” outliers (>1.5 × IQR)? (c) Which observations can be considered to be “extreme” outliers (>3.0 × IQR)? (d) Identify outliers using Chauvenet’s criterion given by Eq. 3.20. Compare them with those obtained by using Table 3.7. (e) Compare the results from (b), (c), and (d). (f) Generate a few plots (such as Fig. 3.16) using your statistical software. Other types of plots can be generated but they should be relevant.
116
3
Pr. 3.2 Consider the data given in Table 3.6. Perform an exploratory data analysis involving pertinent statistical summary measures and generate at least three pertinent graphical plots along with a discussion of findings. Pr. 3.312 A nuclear power facility produces a vast amount of heat that is usually discharged into the aquatic system. This heat raises the temperature of the aquatic system resulting in a greater concentration of chlorophyll that in turn extends the growing season. To study this effect, water samples were collected monthly at three stations for one year. Station A is located closest to the hot water discharge, and Station C the farthest (Table 3.12). You are asked to perform the following tasks with the time-series data and annotate with pertinent comments: (a) Flag any outlier points: (i) visually, (ii) using box-andwhisker plots, and (iii) following the Chauvenet’s criterion. (b) Compute pertinent statistical descriptive measures (after removing outliers).
Table 3.12 Data table for Problem 3.3
Month January February March April May June July August September October November December
Table 3.13 Parameters and uncertainties to be assumed (Pr. 3.4)
Parameter cpc mc Tc,i Tc,o cph mh Th,i Th,o
(c) Generate at least two pertinent graphical plots and discuss relevance of these plots in terms of insights they provide. (d) Compute the covariance and correlation coefficients between the three stations and draw relevant conclusions (do this both without and with outlier rejection). Pr. 3.413 Consider a basic indirect heat exchanger where heat exchange rates associated with the cold and hot fluid flow sides are given by: Qactual = mc cpc ðT c,o - T c,i Þ ðcoldsideheatingÞ
Data available electronically on book website.
ð3:37Þ
Qactual = mh cph ðT h,i - T h,o Þ ðhot sidecoolingÞ where m, T, and cp are the mass flow rate, temperature, and specific heat, respectively. The subscripts o and i stand for outlet and inlet, and c and h denote cold and hot streams, respectively. Assume the values and uncertainties of various parameters shown in Table 3.13:
Station A 9.867 14.035 10.700 13.853 7.067 11.670 7.357 3.358 4.210 3.630 2.953 2.640
Nominal value 1 Btu/lb°F 475,800 lb/h 34 °F 46 °F 0.9 Btu/h°F 450,000 lb/h 55 °F 40 °F
Station B 3.723 8.416 12.723 9.168 4.778 9.145 8.463 4.086 4.233 2.320 3.843 3.610
Station C 4.410 11.100 4.470 8.010 14.080 8.990 3.350 4.500 6.830 5.800 3.480 3.020
95% Uncertainty ±5% ±10% ±1 °F ±1 °F ±5% ±10% ±1 °F ±1 °F
From ASHRAE-G2 (2005) # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org. 13
12
Data Collection and Preliminary Analysis
Problems
117
Table 3.14 Data table for Problem 3.6 Entering air enthalpy (hai) Leaving air enthalpy (hao) Entering water enthalpy (hci)
Units Btu/lb Btu/h Btu/h
When installed 38.7 27.2 23.2
(a) Compute the heat exchanger loads and the uncertainty ranges for the hot and cold sides assuming all variables to be uncorrelated. (b) What would you conclude regarding the uncertainty around the heat balance checks? (c) It has been found that the inlet hot fluid temperature is correlated with the hot fluid mass flow rate with a correlation coefficient of –0.6 (i.e., when the hot fluid flow rate increases, the corresponding inlet temperature decreases). How would your results for (a) and (b) change? Provide a short discussion.
Current 36.8 28.2 21.5
95% Uncertainty 5% 5% 2.5%
Pr. 3.6 Determining cooling coil degradation based on effectiveness The thermal performance of a cooling coil can also be characterized by the concept of effectiveness widely used for thermal modeling of traditional heat exchangers. In such coils, a stream of humid air flows across a coil supplied by chilled water and is cooled and dehumidified as a result. In this case, the effectiveness can be determined as: ε=
ðh - hao Þ actual heat transfer rate = ai maximum possible heat transfer rate ðhai - hci Þ ð3:38Þ
Hint: The standard deviations can be taken to be half of the 95% uncertainty values listed in Table 3.13. Pr. 3.5 Consider Example 3.7.4 where the uncertainty analysis on chiller COP was done at full-load conditions. What about part-load conditions, especially since there is no collected data? One could use data from chiller manufacturer catalogs for a similar type of chiller, or one could assume that part-load operation will affect the inlet minus the outlet chilled water temperatures (ΔT) in a proportional manner, as stated below. (a) Compute the 95% CL uncertainty in the COP at 70% and 40% full load assuming the evaporator water flow rate to be constant. At part load, the evaporator temperature difference is reduced proportionately to the chiller load, while the electric power drawn is assumed to increase from a full-load value of 0.8 kW/t to 1.0 kW/t at 70% full load and to 1.2 kW/t at 40% full load. (b) Would the instrumentation be adequate or would it be prudent to consider better instrumentation if the fractional COP uncertainty at 95% CL is to be less than 10%? (c) Note that fixed (bias) errors have been omitted from the analysis, and some of the assumptions in predicting part-load chiller performance can be questioned. A similar exercise with slight variations in some of the assumptions, called a sensitivity study, would be prudent at this stage. How would you conduct such an investigation?
where hai and hao are the enthalpies of the air stream at the inlet and outlet, respectively, and hci is the enthalpy of entering chilled water. The effectiveness is independent of the operating conditions provided the mass flow rates of air and chilled water remain constant. An HVAC engineer would like to determine whether the coil has degraded after it has been in service for a few years. For this purpose, she measures the current coil performance at air and water flow rates identical to those when originally installed as shown in Table 3.14. Note that the uncertainty in determining the air enthalpies are relatively large due to the uncertainty associated with measuring bulk air stream temperatures and humidities. However, the uncertainty in the enthalpy of the chilled water is only half of that of air. (a) Assess, at 95% CL, whether the cooling coil has degraded or not. Clearly state any assumptions you make during the evaluation. (b) What are the relative contributions of the uncertainties in the three enthalpy quantities to the uncertainty in the effectiveness value? Do these differ from the installed period to the time when current tests were performed? Pr. 3.7 Table 3.15 assembles values of the total electricity generated by five different types of primary energy sources and their associated total emissions (EIA 1999). Clearly, coal and oil generate a lot of emissions of pollutants, which are harmful not only to the environment but also to public health.
118
3
Data Collection and Preliminary Analysis
Table 3.15 Data table for Problem 3.7 US power generation mix and associated pollutant emissions Electricity Fuel kWh (1999) % Total Coal 1.77E + 12 55.7 Oil 8.69E + 10 2.7 Natural gas 2.96E + 11 9.3 Nuclear 7.25E + 11 22.8 Hydro/Wind 3.00E + 11 9.4 Totals 3.18E + 12 100.0
Short tons (=2000 lb/t) SO2 1.13E + 07 6.70E + 05 2.00E + 03 0.00E + 00 0.00E + 00 1.20E + 07
NOx 6.55E + 06 1.23E + 05 3.76E + 05 0.00E + 00 0.00E + 00 7.05E + 06
CO2 1.90E + 09 9.18E + 07 1.99E + 08 0.00E + 00 0.00E + 00 2.19E + 09
Data available electronically on book website Table 3.16 Data table for Problem 3.8 Symbol HP Hours ηold ηnew
Description Horse power of the end-use device Number of operating hours in the year Efficiency of the old motor Efficiency of the new motor
France, on the other hand, has a mix of 21% coal and 79% nuclear. (a) Calculate the total and percentage reductions in the three pollutants should the United States change its power generation mix to mimic that of France (Hint: First normalize the emissions per kWh for all three pollutants). (b) The relative uncertainties of the three pollutants SO2, NOx, and CO2 are 10%, 12%, and 5%, respectively. Assuming log-normal distributions for all quantities, compute the uncertainty in the total reductions of the three pollutants estimated in (a) above. Pr. 3.8 Uncertainty in savings from energy conservation retrofits There is great interest in implementing retrofit measures meant to conserve energy in individual devices as well as in buildings. These measures must be justified economically, and including uncertainty in the estimated energy savings is an important element of the analysis. Consider the rather simple problem involving replacing an existing electric motor with a more energy-efficient one. The annual energy savings Esave in kWh/year are given by: E save = ð0:746ÞðHPÞðHoursÞ
1 1 ηold ηnew
ð3:39Þ
with the symbols described in Table 3.16 along with their numerical values. (a) Determine the absolute and relative uncertainties in Esave under these conditions. (b) If this uncertainty had to be reduced, which variable will you target for further refinement?
Value 40 6500 0.85 0.92
Outdoor air (OA)
95% Uncertainty 5% 10% 4% 2%
Mixed air (MA)
To building zones Air-handler unit
Return air (RA)
Fig. 3.37 Sketch of an all-air HVAC system supplying conditioned air to indoor zones/rooms of a building (Problem 3.9)
(c) What is the minimum value of ηnew at which the lower bound of the 95% CL interval of Esave is greater than zero? Pr. 3.9 Uncertainty in estimating outdoor air fraction in HVAC systems Ducts in heating, ventilating, and air-conditioning (HVAC) systems supply conditioned air (SA) to the various spaces in a building, and also exhaust the air from these spaces, called return air (RA). A sketch of an all-air HVAC system is shown in Fig. 3.37. Occupant health requires that a certain amount of outdoor air (OA) be brought into the HVAC systems while an equal amount of return air be exhausted to the outdoors. The OA and the RA mix at a point just before the air-handler unit. Outdoor air ducts have dampers installed to control the OA since excess OA leads to unnecessary energy wastage. One of the causes for recent complaints from occupants has been identified as inadequate OA, and sensors installed inside the ducts could modulate the dampers accordingly. Flow measurement is always problematic on a continuous basis. Hence, OA flow is inferred from measurements of the air temperature TR inside the RA stream, of TO inside the OA stream, and TM inside the mixed air (MA) stream. The supply air is deduced by measuring the fan speed with a tachometer,
Problems
119
using a differential pressure gauge to measure static pressure rise, and using manufacturer equation for the fan curve. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. (a) From a sensible heat balance (with changes in specific heat with temperature neglected), derive the following expression for the outdoor air fraction (ratio of outdoor air and mixed air) OAf = ðT R - T M Þ=ðT R - T O Þ. (b) Derive the expression for the uncertainty in OAf and calculate the 95% CL in the OAf if TR = 70 °F, TO = 90 °F, and TM = 75 °F. Pr. 3.10 Sensor placement in HVAC ducts with consideration of flow non-uniformity Consider the same situation as in Pr. 3.9. Usually, the air ducts have large cross-sections. The problem with inferring outdoor air flow using temperature measurements is the large thermal non-uniformity usually present in these ducts due to both stream separation and turbulence effects. Moreover, temperature (and, hence density) differences between the OA and MA streams result in poor mixing. The following table gives the results of a traverse in the mixed air duct with nine measurements (using an equally spaced grid of 3 × 3 designated by numbers in bold in Table 3.17). The measurements were replicated four times under the same outdoor conditions. The random error of the sensors is 0.2 °F at 95% CL with negligible bias error. Determine: (a) The worst and best grid locations for placing a single sensor (to be determined based on analyzing the recordings at each of the nine grid locations and for all four time periods). (b) The maximum and minimum errors at 95% CL one could expect in the average temperature across the duct cross-section, if the best grid location for the single sensor was adopted. Pr. 3.11 Consider the uncertainty in the heat transfer coefficient illustrated in Example 3.7.2. The example was solved
Table 3.17 Table showing the temperature readings (in °F) at the nine different sections (S#1–S#9) of the mixed air (MA) duct (Pr. 3.10) 55.6, 54.6, 55.8, 54.2 S#1 66.4, 67.8, 68.7, 67.6 S#4 63.5, 65.0, 63.6, 64.8 S#7
56.3, 58.5, 57.6, 63.8 S#2 58.0, 62.4, 62.3, 65.8 S#5 67.4, 67.4, 66.8, 65.7 S#8
Data available electronically on book website
53.7, 50.2, 59.0, 49.4 S#3 61.2, 56.3, 64.7, 58.8 S#6 63.9, 61.4, 62.4, 60.6 S#9
analytically using the Taylor’s series approach. You are asked to solve the same example using the Monte Carlo (MC) method: (a) Using 100 data points (b) Using 1000 data points Compare the results from this approach with those in the solved example. Also determine the 2.5th and the 97.5th percentile bounds and compare results. Pr. 3.12 You will repeat Example 3.7.3 involving uncertainty in exponential growth using the Monte Carlo (MC) method. (a) Instead of computing the standard deviation, plot the distribution of the time variable t to evaluate its shape for 100 trials. Generate probability plots (such as q-q plots) against a couple of the promising distributions. (b) Determine the mean, median, 25th percentile, 75th percentile, and the 2.5% and 97.5% percentiles. (c) Compare the 2.5% and 97.5% values to the corresponding values in Example 3.7.3. (d) Generate the box-and-whisker plot and compare them with the results of (b). Pr. 3.13 In 2015, the United States had about 1000 GW of installed electricity generation capacity. (a) Assuming an exponential electric growth rate of 1% per year, what would be the needed electricity generation capacity in 2050? (b) If the penetration target set for renewables and energy conservation is 75% of the electricity capacity in 2050, what should be their needed annual growth rate (taken to be exponential)? Assume an initial value of 14% renewable capacity in 2015 (which includes hydropower). (c) If the electric growth rate assumed in (a) has an uncertainty of 15% (taken to be a normally distributed), calculate the uncertainty associated with the growth rate computed in (b). Pr. 3.14 Uncertainty in the estimation of biological dose over time for an individual Consider an occupant inside a building in which a biological agent has been accidentally released. The dose (D) is the cumulative amount of the agent to which the human body is subjected, while the response is the measurable physiological change produced by the agent. The widely accepted approach for quantifying dose is to assume functional forms based on first-order kinetics. For biological and radiological agents
120
3
Data Collection and Preliminary Analysis
Solve this problem following both the numerical partial derivative method and the Monte Carlo (MC) method and compare results.
where the process of harm being done is cumulative, one can use Haber’s law (Heinsohn and Cimbala 2003): t2
D ðt Þ = k
C ðt Þdt
ð3:40Þ
Pr. 3.15 Propagation of optical and tracking errors in solar concentrators Solar concentrators are optical devices meant to increase the incident solar radiation flux density (power per unit area) on a receiver. Separating the solar collection component (viz., the reflector) and the receiver can allow heat losses per collection area to be reduced. This would result in higher fluid operating temperatures at the receiver. However, there are several sources of errors that lead to optical losses:
t1
where C(t) is the indoor concentration at a given time t, k is a constant that includes effects such as the occupant breathing rate, the absorption efficiency of the agent or species, etc., and t1 and t2 are the start and end times. This relationship is often used to determine health-related exposure guidelines for toxic substances. For a simple one-zone building, the free response, i.e., the temporal decay given in terms of the initial concentration C(t1), is determined by: C ðt Þ = Cðt 1 Þ exp½ - aðt - t 1 Þ
(i) Due to non-specular or diffuse reflection from the reflector, which could be due to improper curvature of the reflector surface during manufacture (shown in Fig. 3.38), or to progressive dust accumulation over the surface over time as the system operates in the field. (ii) Due to tracking errors arising from improper tracking mechanisms as a result of improper alignment sensors or non-uniformity in drive mechanisms (usually, the tracking is not continuous; a sensor activates a motor every few minutes, which re-aligns the reflector to the solar disk as it moves in the sky). The result is a spread in the reflected radiation as illustrated in Fig. 3.38b. (iii) Improper reflector and receiver alignment during the initial mounting of the structure or due to small ground/pedestal settling over time.
ð3:41Þ
where the model parameter a is a function of the volume of the space and the outdoor and supply air flow rates. The above equation is easy to integrate during any time period from t1 to t2, thus providing a convenient means of computing total occupant inhaled dose when occupants enter or leave the contaminated zones at arbitrary times. Let a = 0.017186 with 11.7% uncertainty while C(t1) = 7000 cfu/m3 (cfu: colony forming units) with 15% uncertainty. Assume k = 1. (a) Determine the total dose to which the individual is exposed to at the end of 15 min. (b) Compute the uncertainty of the corresponding total dose in terms of absolute and relative terms.
The above errors are characterized by root mean square random errors (or RMSE) and their combined effect can be determined statistically following the basic propagation of errors formula. Bias errors such as that arising from structural
Incoming ray
Incident ray
Reflected rays
a Fig. 3.38 Different types of optical and tracking errors (Problem 3.15). (a) Micro-roughness in solar concentrator surface leads to a spread in the reflected radiation. The roughness is illustrated as a dotted line for the ideal reflector surface and as a solid line for the actual surface. (b) Tracking errors lead to a spread in incoming solar radiation shown as
b
Tracker reflector a normal distribution. Note that a tracker error of σ track results in a reflection error σ reflec = 2. σ track from Snell’s law. Factor of 2 also applies to other sources based on the error occurring as light both enters and leaves the optical device (see Eq. 3.42)
References
121
Table 3.18 Data table for Problem 3.15 Component
Source of error
Solar disk Reflector
Finite angular size Curvature manufacture Dust buildup Sensor misalignment Drive non-uniformity Misalignment
Tracker Receiver
RMSE error Fixed value 9.6 mrad 1.0 mrad – 2.0 mrad – 2.0 mrad
mismatch can be partially corrected by one-time or regular corrections and are not considered. Note that these errors need not be normally distributed, but such an assumption is often made in practice. Thus, RMSE values representing the standard deviations of these errors are often used for such types of analysis. (a) You will analyze the absolute and relative effects of this source of radiation spread at the receiver considering various other optical errors described above, using the numerical values shown in Table 3.18. σ totalpread = ðσ solardisk Þ2 þ ð2σ manuf Þ2 þ ð2σ dustbuild Þ2 þ ð2σ sensor Þ2 þ ð2σ drive Þ2 þ σ rec - mesalign
2 1=2
ð3:42Þ (b) Plot the variation of the total error as a function of the tracker drive non-uniformity error for three discrete values of dust building up (0, 1, and 2 mrad). Note that the finite angular size of the solar disk results in incident solar rays that are not parallel but subtend an angle of about 33 min or 9.6 mrad.
References Abbas, M., and J.S. Haberl, 1994. Development of indices for browsing large building energy databases, Proc. Ninth Symp. Improving Building Systems in Hot and Humid Climates, pp. 166–181, Dallas, TX, May. ASHRAE G-14, 2002. Guideline14–2002: Measurement of Energy and Demand Savings, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta. ASHRAE G-2, 2005. Guideline 2–2005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. ASME PTC 19.1, 2018. Test Uncertainty, American Society of Mechanical Engineers, New York, NY. Ayyub, B.M. and R.H. McCuen, 1996. Numerical Methods for Engineers, Prentice-Hall, Upper Saddle River, NJ Baltazar, J.C., D.E. Claridge, J. Ji, H. Masuda and S. Deng, 2012. Use of First Law Energy Balance as a Screening Tool for Building Energy Data, Part 2: Experiences on its Implementation As a Data Quality Control Tool, ASHRAE Trans., Vol. 118, Pt. 1, Conf. Paper CH-12C021, pp. 167-174, January.
Variation over time – – 0–2 mrad – 0–10 mrad –
Braun, J.E., S.A. Klein, J.W. Mitchell and W.A. Beckman, 1989. Methodologies for optimal control of chilled water systems without storage, ASHRAE Trans., 95(1), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Cleveland, W.S., 1985. The Elements of Graphing Data, Wadsworth and Brooks/Cole, Pacific Grove, California. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. Doebelin, E.O., 1995. Measurement Systems: Application and Design, 4th Edition, McGraw-Hill, New York EIA, 1999. Electric Power Annual 1999, Vol.II, October 2000, DOE/ EIA-0348(99)/2, Energy Information Administration, US DOE, Washington, D.C. 20585–065 http://www.eia.doe.gov/eneaf/electric ity/epav2/epav2.pdf. Glaser, D. and S. Ubbelohde, 2001. Visualization for time dependent building simulation, 7th IBPSA Conference, pp. 423–429, Rio de Janeiro, Brazil, Aug. 13–15. Haberl, J.S. and M. Abbas, 1998. Development of graphical indices for viewing building energy data: Part I and Part II, ASME J. Solar Energy Engg., vol. 120, pp. 156–167 Hawkins, D. 1980. Identification of Outliers, Chapman and Hall, Kluwer Academic Publishers, Boston/Dordrecht/London Heiberger, R.M. and B. Holland, 2015. Statistical Analysis and Data Display, 2nd Ed., Springer, New York. Heinsohn, R.J. and J.M. Cimbala, 2003. Indoor Air Quality Engineering, Marcel Dekker, New York, NY Hoaglin, D.C., F. Mosteller and J.W. Tukey (Eds.), 1983. Understanding Robust and Exploratory Data Analysis, John Wiley and Sons, New York. Holman, J.P. and W.J. Gajda, 1984. Experimental Methods for Engineers, 5th Ed., McGraw-Hill, New York Keim, D. and M. Ward, 2003. Chap.11 Visualization, in Intelligent Data Analysis, M. Berthold and D.J. Hand (Editors), 2nd Ed., SpringerVerlag, Berlin, Germany. Reddy, T.A., J.K. Kreider, P.S. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings, 3rd Ed., CRC Press, Boca Raton, FL. Reddy, T.A., 1990. Statistical analyses of electricity use during the hottest and coolest days of summer for groups of residences with and without air-conditioning. Energy, vol. 15(1): pp. 45–61. Schenck, H., 1969. Theories of Engineering Experimentation, 2nd Edition, McGraw-Hill, New York. Tufte, E.R., 1990. Envisioning Information, Graphic Press, Cheshire, CT. Tufte, E.R., 2001. The Visual Display of Quantitative Information, 2nd Edition, Graphic Press, Cheshire, CT Tukey, J.W., 1970. Exploratory Data Analysis, , Vol.1, Reading MA, Addison-Wesley Tukey, J.W., 1988. The Collected Works of John W. Tukey,W. Cleveland (Editor), Wadsworth and Brookes/Cole Advanced Books and Software, Pacific Grove, CA Wang, Z., T. Parkinson, P. Li, B. Lin and T. Hong, 2019. The squeaky wheel: Machine learning for anomaly detection in subjective thermal comfort votes, Building and Environment, 151 (2019), pp. 219-227, Elsevier. Wonnacutt, R.J. and T.H. Wonnacutt, 1985. Introductory Statistics, 4th Ed., John Wiley & Sons, New York.
4
Making Statistical Inferences from Samples
Abstract
This chapter covers various concepts and statistical methods on inferring reliable parameter estimates about a population from sample data using knowledge of probability and probability distributions. More specifically, such statistical inferences involve point estimation, confidence interval estimation, hypothesis testing of means and variances from two or more samples, analysis of variance methods, goodness of fit tests, and correlation analysis. The basic principle of inferential statistics is that a random sample (or subset) drawn from a population tends to exhibit the same properties as those of the entire population. Traditional single and multiple parameter estimation techniques for sample means, variances, correlation coefficients, and empirical distributions are presented. This chapter covers the popular single-factor ANOVA technique which allows one to test whether the mean values of data taken from several different groups are essentially equal or not, i.e., whether the samples emanate from different populations or whether they are essentially from the same population. Also treated are non-parametric statistical procedures, best suited for ordinal data or for noisy data, that do not assume a specific probability distribution from which the sample(s) is taken. They are based on relatively simple heuristic ideas and are generally more robust; but, on the other hand, are generally less powerful and efficient. How prior information on the population (i.e., the Bayesian approach) can be used to make sounder statistical inferences from samples and for hypothesis testing problems is also discussed. Further, various types of sampling methods are described which is followed by a discussion on estimators and their desirable properties. Finally, resampling methods, which reuse an available sample multiple times that was drawn from a population to make statistical inferences of parameter estimates, are treated which, though computationally expensive, are more intuitive, conceptually simple, versatile, and allow robust point and interval estimation. Not surprisingly, they
have become indispensable techniques in modern day statistical analyses.
4.1
Introduction
The primary reason for resorting to sampling as against measuring the whole population is to reduce expense, or to make quick decisions (say, in case of a real-time production process), or because often, it is simply impossible to do otherwise. Random sampling, the most common form of sampling, involves selecting samples (or subsets) from the population in a random and independent manner. If done correctly, it reduces or eliminates bias while enabling inferences to be made about the population from the sample. Such inferences are usually made on distributional characteristics or parameters such as the mean value or the standard deviation. Estimators are mathematical formulae or expressions applied to sample data to deduce the estimate of the true parameter (i.e., its numerical value). For example, Eqs. (3.4a) and (3.8) in Chap. 3 are the estimators for deducing the mean and standard deviation of a data set. Given the uncertainty involved, the point estimates must be specified along with statistical confidence intervals (CI) albeit under the presumption that the samples are random. Methods of doing so are treated in Sect. 4.2 and 4.3 for single and multiple samples involving single parameters and in Sect. 4.4 involving multiple parameters. Unfortunately, certain unavoidable, or even undetected, biases may creep into the supposedly random sample, and this could lead to improper or biased inferences. This issue, as well as a more complete discussion of sampling and sampling design, is covered in Sect. 4.7. Resampling methods, which are techniques that involve reanalyzing an already drawn sample, are discussed in Sect. 4.8. Parameter tests on population estimates assume that the sample data are random and independently drawn, and thus, treated as random variables. The sampling fraction, in the case of finite populations, is very much smaller than the
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_4
123
124
4
population size (often about one to three orders of magnitude). Further, the data of the random variable is assumed to be close to being normally distributed. There is an entire field of inferential statistics based on nonparametric or distributionfree tests which can be applied to population data with unknown probability distributions. Though nonparametric tests are encumbered by fewer restrictive assumptions, are easier to apply and understand, they are less efficient than parametric tests (in that their uncertainty intervals are wider). These are briefly discussed in Sect. 4.5, while Bayesian inferencing, whereby one uses prior information to enhance the inference-making process, is addressed in Sect. 4.6.
4.2
Basic Univariate Inferential Statistics
4.2.1
Sampling Distribution and Confidence Interval of the Mean
(a) Sampling distribution of the mean Consider a huge box filled with N round balls of unknown but similar diameter (the population). Let μ be the population mean and σ the population standard deviation of the ball diameter. If a sample of n balls is drawn and recognizing that the mean diameter X of the sample will vary from sample to sample, what can one say about the distribution of X ? It can be shown that the sample mean X would behave like a normally distributed random variable such that its expected value is: E X =μ
ð4:1Þ
and the standard error (SE) of X 1 is: SE X =
σ ðnÞ1=2
sx N -n 1=2 N - 1 ð nÞ
1=2
ð4:3Þ
where N is the population size, and n the sample size. Note that if N ≫ n, one effectively gets back Eq. (4.2).
1
The parameters (say, e.g., the mean) of a single sample are unlikely to be equal to those of the population. Since the sample mean values vary from sample to sample, this pattern of variation is referred to as the sampling distribution of the mean. Can the information contained in sampling distribution be reliably extended to provide estimates of the population mean? The answer is yes. However, since randomness is involved from one sample to another, the answer can only be reframed in terms of probabilities or confidence levels. In summary, the confidence interval (CI) of the population parameter estimates is based on sample size; the larger the better. They also depend on the variance in outcomes of different samples or trials. The larger the variance from sample to sample, the larger ought to be the sample size for the same degree of confidence. The distribution of the sample data is presumed to be Gaussian or normal. Such an assumption is based on the Central Limit Theorem (one of the most important theorems in statistical theory), which states that if independent random samples of n observations are selected with replacement from a population with any arbitrary distribution, then the distribution of the sample means X will approximate a Gaussian distribution provided n is sufficiently large (n > 30). The larger the sample n, the closer does the sampling distribution approximate the Gaussian (Fig. 4.1).2 A consequence of the theorem is that it leads to a simple method of computing approximate probabilities of sums of independent random variables. It explains the remarkable fact that the empirical frequencies of so many natural “populations” exhibit bell-shaped (i.e., normal or Gaussian) curves. Let X1, X2,. . .,Xn be a sequence of independent identically distributed random variables with population mean μ and variance σ 2. Then the distribution of the random variable Z (Sect. 2.4.3):
ð4:2Þ
Since commonly the population standard deviation σ is not known, the sample standard deviation sx (given by Eq. 3.8) can be used instead. In case the population sample is small, and sampling is done without replacement, then Eq. (4.2) is modified to: SE X =
Making Statistical Inferences from Samples
The standard deviation is a measure of the variability within a sample while the SE is a measure of the variability of the mean between samples.
Z=
X -μ p σ= n
ð4:4Þ
approaches the standard normal as n tends toward infinity. Note that this theorem is valid for any distribution of X; therein lies its power. Probabilities for random quantities can be found by determining areas under the standard normal curve as described in Sect. 2.4.3. Suppose one takes a random sample of size n from a population of mean μ and standard deviation σ. Then the random variable Z has (i) approximately the standard normal distribution if n > 30 regardless of the 2 That the sum of two Gaussian distributions from a population would be another Gaussian variable (a property called invariant under addition) is somewhat intuitive. Why the sum of two non-Gaussian distributions should gradually converge to a Gaussian is less so, and hence the importance of this theorem.
4.2 Basic Univariate Inferential Statistics
125
Fig. 4.1 Illustration of the central limit theorem. The sampling distribution of X contrasted with the parent population distribution for three cases. The first case (left column of figures) shows sampling from a normal population. As sample size n increases, the standard error of X
decreases. The next two cases show that even though the populations are not normal, the sampling distribution still becomes approximately normal as n increases. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)
distribution of the population, and (ii) exactly the standard normal distribution if the population itself is normally distributed regardless of the sample size (Fig. 4.1).
Note that when sample sizes are small (n < 30) and the underlying distribution is unknown, the Student t-distribution which has wider uncertainty bands (Sect. 2.4.3) should be
126
4
Making Statistical Inferences from Samples
Fig. 4.2 Illustration of critical cutoff values between one-tailed and two-tailed tests for the standard normal distribution. The shaded areas representing the probability values ( p) corresponding to 95% CL or
significance level α = 0.05 illustrate the difference in how these two types of tests differ in terms of the critical values (which are determined from Table A.3)
used with (n - 1) degrees of freedom instead of the Gaussian (Fig. 2.15 and Table A.4). Unlike the z-curve, there are several t-curves depending on the degrees of freedom (d.f.). At the limit of infinite d.f., the t-curve collapses into the zcurve.
select only one sample of the population comes at a price: one must necessarily accept some uncertainty in our estimates. Based on a sample taken from a population:
(b) One-tailed and two-tailed tests An important concept needs to be clarified, namely “when does one use one-tailed as against two-tailed tests?” In the two-tailed test, one is testing whether the sample parameter is different (i.e., smaller or larger) than that of the stipulated population. In cases where one wishes to test whether the sample parameter is specifically larger (or specifically smaller) than that of the stipulated population, then the one-tailed test is used. The tests are set up and addressed in like manner, the difference being in how the significancelevel expressed as a probability ( p) value is finally determined. The smaller the p-value, the stronger the statistical evidence that the observed data did not happen by chance. The shaded areas of the normal distributions shown in Fig. 4.2 illustrate the difference in how the one-tailed and two-tailed types of tests have to be performed. For a significance level α of 0.05 or probability p = 0.05 that the observed value is lower than the mean value, the cut-off value is 1.645 for the one-tailed test and is 1.96 for the two-tailed test, as indicated in Fig. 4.2. The nomenclature adopted to denote the critical cutoff values for two-tailed and one-tailed are zα/2 = 1.96 and zα = - 1.645 respectively for 95% CL (which can be determined from Table A.3). (c) Confidence interval for the mean In the sub-section above, the behavior of many samples, all taken from one population, was considered. Here, only one large random sample from a population is selected and analyzed to make an educated guess on parameter estimates of the population such as its mean and standard deviation. This process is called inductive reasoning or arguing backwards from a set of observations to reach a reasonable hypothesis. However, the benefit provided by having to
(a) One can test whether the sample mean differs from a known population mean (this is covered in this sub-section). (b) One can deduce interval bounds of the population mean at a specified probability or confidence level which is expressed as a confidence interval CI (covered in Sect. 4.2.2). The concept of CI was introduced in Sect. 3.6.3 in reference to instrument errors. This concept pertinent to random variables in general is equally applicable to sampling. A 95% CI is traditionally interpreted as implying that there is a 95% chance that the difference between the sample and population mean values is contained within this interval (and that this conclusion may be erroneous in 5% of the time).3 The range is obtained from the z-curve by finding the value at which the area under the curve (i.e., the probability) is equal to 0.95 (Fig. 4.2). From Table A.3, the corresponding two-tailed cutoff value zα/2 is 1.96 (corresponding to a probability value of [(1 - 0.95)/2] = 0.025). This implies that the probability is: X -μ p < 1:96 ≈ 0:95 sx = n s s or X - 1:96 px < μ < X þ 1:96 px n n p - 1:96
30). The half-width of the 95% CI about the mean value is (1:96 psxn ) and is called the bound of the error of estimation. For small samples, instead of random variable z, one uses the Student-t variable. Note that Eq. (4.5a) refers to the long-run bounds, i.e., in the long run, roughly 95% of the intervals will contain μ. If one is interested in predicting a single X value that has yet to be observed, one uses the following equation and the term prediction interval (PI) is used (Devore and Farnum 2005): PI ðX Þ = X ± t α=2 sx 1 þ
1 n
1=2
ð4:6Þ
where tα/2 is the two-tailed cutoff value determined from the Student t-distribution at d.f. = (n - 1) at the desired confidence level, and sx is the sample standard deviation. It is apparent that the prediction intervals are much wider than the CI because the quantity “1” within the brackets of Eq. (4.6) will generally dominate (1/n). This means that there is a lot more uncertainty in predicting the value of a single observation X than there is in estimating the mean value μ. Example 4.2.1 Evaluating manufacturer-quoted lifetime of light bulbs from sample data A manufacturer of xenon light bulbs for street lighting claims that the distribution of the lifetimes of his best model has a mean μ = 16 years and a standard deviation σ = 2.0 years when the bulbs are lit for 12 h every day. Suppose that a city official wants to check the claim by purchasing a sample of 36 of these bulbs and subjecting them to tests that determine their lifetimes. (i) Assuming the manufacturer’s claim to be true, describe the sampling distribution of the mean lifetime of a sample of 36 bulbs. Even though the shape of the distribution is unknown, the Central Limit Theorem suggests that the normal distribution can be used. Thus μ = x = 16 and SEðxÞ = p2:0 = 0:33 years.4 36 (ii) What is the probability that the sample purchased by the city official has a mean-lifetime of 15 years or less? The normal distribution N(16, 0.33) is drawn and the darker shaded area to the left of x = 15 as shown in Fig. 4.3 provides the probability of the city official 4
The nomenclature of using X or x is somewhat confusing. The symbol X is usually used to denote the random variable while x is used to denote one of its values.
density
1 0.8 0.6 0.4 0.2 0 14
15
16
17
18
x Fig. 4.3 Sampling distribution of X for a normal distribution N(16, 0.33). Shaded area represents the probability of the mean life of the sample of bulbs being 1200 h, but the convention is to frame the alternative hypothesis as “different from” the null hypothesis.
4
Making Statistical Inferences from Samples
Assume a sample size of n = 100 of bulbs manufactured by the new process and set the significance or error level of the test to be α = 0.05. This is clearly a one-tailed test since the new bulb manufacturing process should have a longer life, not just different from that of the traditional process, and the critical value must be selected accordingly. The mean life x of the sample of 100 bulbs is assumed to be normally distributed with mean value of 1200 and standard error p p σ= n = 300= 100 = 30. From the standard normal table (Table A.3), the one-tailed critical z-value is: zα = 1.64. cpμ0 Recalling that the critical value is defined as: zα = xσ= n
leads to xc = 1200 + 1.64 × 300/(100)1/2 = 1249 or about 1250. Suppose testing of the 100 tubes yields a value of x = 1260. As x > xc , one would reject the null hypothesis at the 0.05 significance (or error) level. This is akin to jury trials where the null hypothesis is taken to be that the accused is innocent, and the burden of proof during hypothesis testing is on the alternate hypothesis, i.e., on the prosecutor to show overwhelming evidence of the culpability of the accused. If such overwhelming evidence is absent, the null hypothesis is preferentially favored. ■ There is another way of looking at this testing procedure (Devore and Farnum 2005):
(a) Null hypothesis H0 is true, but one has been exceedingly unlucky and got a very improbable sample with mean x. In other words, the observed difference turned out to be significant when, in fact, there is no real difference. Thus, the null hypothesis has been rejected erroneously. The innocent man has been falsely convicted; (b) H0 is not true after all. Thus, it is no surprise that the observed x value was so high, or that the accused is indeed culpable. The second explanation is likely to be more plausible, but there is always some doubt because statistical decisions inherently contain probabilistic elements. In other words, statistical tests of hypothesis do not always yield conclusions with absolute certainty: they have in-built margins of error just like jury trials are known to hand down wrong verdicts. Specifically, two types of errors can be distinguished: (i) Concluding that the null hypothesis is false, when in fact it is true, is called a Type I error, and represents the probability α (i.e., the pre-selected significance level) of erroneously rejecting the null hypothesis. This is also called the false negative or “false alarm” rate. The upper normal distribution shown in Fig. 4.4 has a mean value of 1200 (equal to the population or claimed mean value) with a standard error of 30. The area to the right of
4.2 Basic Univariate Inferential Statistics
Reject Ho
Accept Ho
(X 0.001) 15
N(1200,30) 12
density
Fig. 4.4 The two kinds of error that occur in a classical test. (a) False negative: If H0 is true, then significance level α = probability of erring (rejecting the true hypothesis H0). (b) False positive: If Ha is true, then β = probability of erring (judging that the false hypothesis H0 is acceptable). The numerical values correspond to data from Example 4.2.2
129
9
From population
6 Area represents probability of falsely rejecting null hypothesis (Type I error)
3 0 1100
1150
1200
1250
1300
(X 0.001) 15
N(1260,30)
density
12
Area represents probability of falsely accepting the alternative hypothesis (Type II error)
From sample
9 6 3 0 1200
1250 Critical value
the critical value of 1250 represents the probability of Type I error occurring. (ii) The flip side, i.e., concluding that the null hypothesis is true, when in fact it is false, is called a Type II error and represents the probability β of erroneously rejecting the alternate hypothesis, also called the false positive rate. The lower plot of the normal distribution shown in Fig. 4.4 now has a mean of 1260 (the mean value of the sample) with a standard error of 30, while the area to the left of the critical value xc indicates the probability β of being in error of Type II. The two types of error are inversely related as is clear from the vertical line in Fig. 4.4 drawn through both plots. A decrease in probability of one type of error is likely to result in an increase in the probability of the other. Unfortunately, one cannot simultaneously reduce both by selecting a smaller value of α. The analyst would select the significance level depending on the tolerance, or seriousness of the consequences of either type of error specific to the circumstance. Recall that the probability of making a Type I error is called the significance level of the test. This probability of correctly rejecting the null hypothesis is also referred to as the statistical power. The only way of reducing both types of
1300
1350
1400
x
errors is to increase the sample size with the expectation that the standard error would decrease, and the sample mean would get closer to the population mean. One final issue relates to the selection of the test statistic. One needs to distinguish between the following two instances: (i) if the population variance σ is known and for sample sizes n > 30, then the z-statistic is selected for performing the test along with the standard normal tables (as done for Example 4.2.2 above); (ii) if the population variance is unknown or if the sample size n < 30, then the t-statistic is selected (using the sample standard deviation sx instead of σ) for performing the test using Student-t tables with the appropriate degree of freedom.
4.2.3
Two Independent Sample and Paired Difference Tests on Means
As opposed to hypothesis tests for a single population mean, there are hypothesis tests that allow one to compare two population mean values from samples taken from each
130
4
population. Two basic presumptions for the tests (described below) to be valid are that the standard deviations of the populations are reasonably close, and that the populations are approximately normally distributed. (a) Two independent samples test The test is based on information (namely, the mean and the standard deviation) obtained from taking two independent random samples from the two populations under consideration whose variances are unknown and unequal. Using the same notation as before for population and sample and using subscripts 1 and 2 to denote the two samples, the random variable z=
ð x 1 - x2 Þ - ð μ 1 - μ 2 Þ s21 n1
ð4:7Þ
s2 1=2
þ n22
is said to approximate the standard normal distribution for large samples (n1 > 30 and n2 > 30) where s1 and s2 are the standard deviations of the two samples. The denominator is called the standard error (SE) and is a measure of the total variability of both samples combined. Notice the similarity between the expression for SE and that for the combined error of two independent measurements given by the quadrature sum of their squared values (Eq. 3.22). The confidence intervals CI of the difference in the population means can be determined as: μ1 - μ2 = ðx1 - x2 Þ ± zα SEðx1 , x2 Þ where SE ðx1 , x2 Þ =
s21 s22 þ n1 n2
1=2
ð4:8Þ
where zα is the critical value at the selected significance level. Thus, the testing of the two samples involves a single random variable combining the properties of both. For smaller sample sizes, Eq. (4.8) still applies, but the z-standardized variable is replaced with the Student-t variable. The critical values are found from the Student t-tables with degrees of freedom d.f. = n1 + n2 - 2. If the variances of the population are known, then these should be used instead of the sample variances. When the samples are small and only when the variances of both populations are close, some textbooks suggest that the two samples be combined, and the parameter be treated as a single random variable. Here, instead of using individual standard deviation values s1 and s2, a combined quantity called the pooled variance sp2 is used: 5
Such energy conservation programs result in a revenue loss in electricity sales but often this is more cost effective to utilities in terms of
s2p =
Making Statistical Inferences from Samples
ðn1 - 1Þs21 þ ðn2 - 1Þs22 with d:f : = n1 þ n2 - 2 ð4:9Þ n1 þ n2 - 2
Note that the pooled variance is simply the weighted average of the two sample variances. The use of the pooled variance approach is said to result in tighter confidence intervals, and hence its appeal. The random variable approximates the t-distribution, and the confidence interval, CI of the difference in the population means is: μ1 - μ2 = ðx1 - x2 Þ ± t α SE ðx1 , x2 Þ where SEðx1 , x2 Þ = s2p
1 1 þ n1 n2
1=2
ð4:10Þ
Note that the above equation is said to apply when the variances of both samples are close. However, Devore and Farnum (2005) strongly discourage the use of the pooled variance approach as a general rule, and so the better approach, when in doubt, is to use Eq. (4.8) so as to be conservative. Manly (2005), on the other hand, states that the independent random sample test is fairly robust to the assumptions of normality and equal population variance especially when the sample size exceeds 20 or so. The assumption of equal population variances is said not to be an issue if the ratio of the two variances is within 0.4–2.5. Example 4.2.3 Verifying savings from energy conservation measures in homes Certain electric utilities with limited generation capacities fund contractors to weather strip residences in an effort to reduce infiltration losses thereby reducing building loads, which in turn lower electricity needs.5 Suppose an electric utility wishes to determine the cost-effectiveness of their weather-stripping program by comparing the annual electric energy use of 200 similar residences in a given community, half of which were weather-stripped, and the other half were not. Samples collected from both types of residences yield: Control sample: Weather-stripped sample:
x1 = 18,750; s1 = 3200 and n1 = 100. x2 = 15,150; s2 = 2700 and n2 = 100.
The mean difference ðx1 - x2 Þ = 18,750 - 15,150 = 3600, i.e., the mean savings fraction or percentage in each weather-stripped residence is 19.2% (=3600/18,750) of the mean baseline or control home. However, there is an uncertainty associated with this mean value since only a sample has been analyzed. This uncertainty is characterized as a deferred generation capacity expansion costs. Another reason for implementing conservation programs is the mandatory requirement set by Public Utility Commissions (PUC) which regulates electric utilities in the United States.
4.2 Basic Univariate Inferential Statistics
131
bounded range for the mean difference. The one-tailed critical value at the 95% CL corresponding to a significance level α = 0.05 is zα = 1.645 from Table A.3. Then from Eq. (4.8): s2 s2 μ1 - μ2 = ðx1 - x2 Þ ± 1:645 1 þ 2 100 100
p t = d=SE where SE = sd = n
ð4:11aÞ
1=2
and the CI around d is:
To complete the calculation of the confidence interval (CI), it is assumed, given that the sample sizes are large, that the sample variances are reasonably close to the population variances. Thus, the CI is approximately: ð18,750 - 15,150Þ ± 1:645
of two small paired samples (n < 30), and d their mean value. Then, the t-statistic is taken to be:
32002 27002 þ 100 100
1=2
= 3600 ± 689 = ð2911 and 4289Þ: These intervals represent the lower and upper values of saved energy at the 95% CL. To conclude, one can state that the savings are positive, i.e., one can be 95% confident that there is an energy benefit in weather-striping the homes. More specifically, the mean saving percentage is 19.2% = [(18,750–15,150)/18,750] of the baseline value with an uncertainty of 19.1% (= 689/3600) in the savings at the 95% CL. Thus, the uncertainty in the savings estimate is quite large. In practice, the energy savings fraction is usually (much) smaller than what was assumed above, and this could result in the uncertainty in the savings to be as large as the savings amount itself. Such a concern reflects realistic situations where the efficacy of energy conservation programs in homes is often difficult to verify accurately. One could try increasing the sample size or resorting to stratified sampling (see Sect. 4.7.4) but they may not necessarily be more conclusive. Another option is to adopt a less stringent confidence level such as the 90% CL. ■ (b) Paired difference test The previous section dealt with independent samples from two populations with close to normal probability distributions. There are instances when the samples are somewhat correlated, and such interdependent samples are called paired samples. This interdependence can also arise when the samples are taken at the same time and are affected by a timevarying variable which is not explicitly considered in the analysis. Rather than the individual values, the difference is taken as the only random sample since it is likely to exhibit much less variability than those of the two samples. Thus, the confidence intervals calculated from paired data will be narrower than those calculated from two independent samples. Let di be the difference between individual readings
p μd = d ± t α sd = n
ð4:11bÞ
Hypothesis testing of means for paired samples is done the same way as that for a single independent mean and is usually (but not always) superior to an independent sample test. Paired difference tests are used for comparing “before and after” or “with and without” type of experiments done on the same group in turn, say, to assess effect of an action performed. For example, the effect of an additive in gasoline meant to improve gas mileage can be evaluated statistically by considering a set of data representing the difference in the gas mileage of n cars which have each been subjected to tests involving “no additive” and “with additive.” Its usefulness is illustrated by the following example which is a more direct application of paired difference tests. Example 4.2.4 Comparing energy use of two similar buildings based on utility bills—the wrong way Buildings which are designed according to certain performance standards are eligible for recognition as energyefficient buildings by federal and certification agencies. A recently completed building (B2) was awarded such an honor. The federal inspector, however, denied the request of another owner of an identical building (B1) located close by who claimed that the differences in energy use between both buildings were within statistical error. An energy consultant was hired by the owner to prove that B1 is as energy efficient as B2. He chose to compare the monthly mean utility bills over a year between the two commercial buildings using data recorded over the same 12 months and listed in Table 4.1 (the analysis would be more conclusive if bill data over several years were used, but such data may be hard to come by). This problem can be addressed using the two-sample test method. The null hypothesis is that the mean monthly utility charges μ1 and μ2 for the two buildings are equal against the alternative hypothesis that Building B2 is more energy efficient that B1 (thus, a one-tailed test is appropriate). Since the sample sizes are less than 30, the t-statistic has to be used instead of the standard normal z-statistic. The pooled variance approach given by Eq. (4.9) is appropriate in this instance. It is computed as:
132
4
Making Statistical Inferences from Samples
Table 4.1 Monthly utility bills and the corresponding outdoor temperature for the two buildings being compared (Example 4.2.4) Month 1 2 3 4 5 6 7 8 9 10 11 12 Mean Std. Deviation
Building B1 utility cost ($) 693 759 1005 1074 1449 1932 2106 2073 1905 1338 981 873 1,349 530.07
Building B2 utility cost ($) 639 678 918 999 1302 1827 2049 1971 1782 1281 933 825 1,267 516.03
Fig. 4.5 Month-by-month variation of the utility bills for the two buildings B1 and B2 (Example 4.2.5)
Difference in costs (B1-B2) 54 81 87 75 147 105 57 102 123 57 48 48 82 32.00
Outdoor temperature (°C) 3.5 4.7 9.2 10.4 17.3 26 29.2 28.6 25.5 15.2 8.7 6.8
2500 B1 B2
Utility Bills ($/month)
2000
Difference 1500 1000 500 0 1
2
3
4
5
6
7
8
9
10
11
12
Month of Year
s2p =
ð12 - 1Þ ð530:07Þ2 þ ð12 - 1Þ ð516:03Þ2 12 þ 12 - 2
= 273,630:6 while the t-statistic can be deduced by rearranging Eq. (4.10): t=
ð1349 - 1267Þ - 0 ð273, 630:6Þ
1 12
þ
1 12
1=2
=
82 = 0:38 213:54
The t-value is very small and will not lead to the rejection of the null hypothesis even at significance level α = 0.02 (from Table A.4, the one-tailed critical value is 1.321 for CL = 90% and d.f. = 12 + 12 – 2 = 22). Thus, the consultant would report that insufficient statistical evidence exists to state that the two buildings are different in their energy consumption. As explained next, this example demonstrates a faulty analysis.
Example 4.2.5 Comparing energy use of two similar buildings based on utility bills—the right way There is, however, a problem with the way the energy consultant performed the test of the previous example. Close observation of the data as plotted in Fig. 4.5 would lead one not only to suspect that this conclusion is erroneous, but also to observe that the utility bills of the two buildings tend to rise and fall together because of seasonal variations in the outdoor temperature. Hence the condition that the two samples are independent is violated. It is in such circumstances that a paired test is relevant. Here, the test is meant to determine whether the monthly mean of the differences in utility bills between buildings B1 and B2 ðxD Þ is zero (null hypothesis) or is positive. In this case: 82 - 0 xD - 0 p = 8:88 p = sD = nD 32= 12 with d:f: = 12 - 1 = 11
t - statistic =
4.2 Basic Univariate Inferential Statistics
133
Fig. 4.6 Conceptual illustration of three characteristic cases that may arise during two-sample testing of medians. The box and whisker plots provide some indication as to the variability in the results of the tests. Case (a) very clearly indicates that the samples are very much different,
case (b) also suggests the same but with a little less certitude. Finally, it is more difficult to draw conclusions from case (c), and it is in such cases that statistical tests are useful
where the values of 82 and 32 are found from Table 4.1. For a significance level of α = 0.05 and d.f. = 11, the one-tailed test (Table A.4) suggests a critical value t0.05 = 1.796. Because 8.88 is much higher than this critical value, one can safely reject the null hypothesis. In fact, Bldg. 1 is less energy efficient than Bldg. 2 even at a significance level of 0.0005 (or CL = 99.95%), and the owner of B1 does not have a valid case at all! This illustrates how misleading results can be obtained if inferential tests are misused, or if the analyst ignores the underlying assumptions behind a particular test. It is important to keep in mind the premise that the random variables should follow a normal distribution, and so some preliminary exploratory data analysis using multiple years of utility bills is advisable (Sect. 3.5). Figure 4.6 illustrates, in a simple conceptual manner, the three characteristic cases which can arise when comparing the means of two populations based on sampled data. Recall that the box and whisker plot is a type of graphical display of the shape of the distribution where the solid line denotes the median, the upper and lower hinges of the box indicate the interquartile range values (25th and 75th percentiles) with the whiskers extending to 1.5 times this range. Case (a) corresponds to the case where the whisker of one box plot extends to the lower quartile of the second boxplot. One does not need to perform a statistical test to conclude that the two-population means are different. Case (b) also suggests difference between population means but with a little less certitude. Case (c) illustrates the case when the two whisker bands are quite close, and the value of statistical tests becomes apparent. As a rough rule of thumb, if the 25th percentile for one sample exceeds the median line of the other sample, one could conclude that the mean are likely to be different (Walpole et al. 2007).
4.2.4
Single and Two Sample Tests for Proportions
There are several cases where surveys are performed to determine fractions or proportions of populations who either have preferences of some sort or have purchased a certain type of equipment. For example, the gas company may wish to determine what fraction of their customer base has natural gas heating as against another source (e.g., electric heat pumps). The company performs a survey on a random sample from which it would like to extrapolate and ascertain confidence limits on this fraction. It is in a case such as this which can be interpreted as either a “success” (the customer has gas heat) or a “failure”—in short, a binomial experiment (see Sect. 2.4.2b)—that the following test is useful. (a) Single sample test Let p be the population proportion one wishes to estimate from the sample proportion p which can be determined as: of successes in sample p = number = nx. Then, provided the sample is total number of trials large (n ≥ 30), proportion p is an unbiased estimator of p with approximately normal distribution. Dividing the expression for standard deviation of the Bernoulli trials (Eq. 2.37b) by n2, yields the standard error of the sampling distribution of p : SE ðpÞ = ½pð1 - pÞ=n1=2
ð4:12Þ
Thus, the large sample CI for p for the two tailed case at a significance level z is given by: CI = p ± zα=2 ½pð1 - pÞ=n1=2
ð4:13Þ
134
4
Example 4.2.6 In a random sample of n = 100 new residences in Scottsdale, AZ, it was found that 63 had swimming pools. Find the 95% CI for the fraction of buildings which have pools. 63 In this case, n = 100, while p = 100 = 0:63: From Table A.3, the two-tailed critical value z0.05/2 = 1.96, and hence from Eq. (4.13), the two tailed 95% CI for p is: 0:63 - 1:96 þ1:96
0:63ð1 - 0:63Þ 100
0:63ð1 - 0:63Þ 100
1=2
< p < 0:63
1=2
or 0:5354 < p < 0:7246: ■
Example 4.2.7 The same equations can also be used to determine a sample size in order for p not to exceed a certain range or error e. For instance, one would like to determine from Example 4.2.6 data, the sample size which will yield an estimate of p within 0.02 or less at 95% CL. Then, recasting Eq. (4.13) results in a sample size: n=
Example 4.2.8 Hypothesis testing of increased incidence of lung ailments due to radon in homes The Environmental Protection Agency (EPA) would like to determine whether the fraction of residents with health problems living in an area where the subsoil is known to have elevated radon concentrations is statistically higher than the fraction of homes where sub-soil radon concentrations is negligible. Specifically, the agency wishes to test the hypothesis at the 95% CL that the fraction of residents p1 with lung ailments in radon-prone areas is higher than the fraction p2 corresponding to low radon level locations. The following data are collected: High radon level area: Low radon area:
z=
= ■
(b) Two sample tests The intent here is to estimate whether statistically significant differences exist between proportions of two populations based on one sample drawn from each population. Assume that the two samples are large and independent. Let p1 and p2 be the sampling proportions. Then, the sampling distribution of ðp1 - p2 Þ is approximately normal with ðp1 - p2 Þ being an unbiased estimator of ( p1 - p2) and the standard error given by: SE ðp1 - p2 Þ =
p 1 ð 1 - p 1 Þ p2 ð 1 - p2 Þ þ n1 n2
1=2
The following example illustrates the procedure.
ð4:14Þ
n1 = 100, p1 = 0:38 n2 = 225, p2 = 0:22 null hypothesis H 0 : ðp1 - p2 Þ = 0 alternative hypothesis H 1 : ðp1 - p2 Þ ≠ 0
One calculates the random variable using Eq. (4.14) to compute the SE:
z2 α=2 pð1 - pÞ 1:962 ð0:63Þð1 - 0:63Þ = = 2239 2 e ð0:02Þ2
It must be pointed out that the above example is somewhat misleading since one does not know the value of p beforehand. One may have a preliminary idea, in which case, the sample size n would be an approximate estimate, and this may have to be revised once some data is collected.
Making Statistical Inferences from Samples
ð p 1 - p2 Þ p1 ð1 - p1 Þ n1
þ
p2 ð1 - p2 Þ n2
1=2
ð0:38 - 0:22Þ ð0:38Þð0:62Þ 100
Þð0:78Þ þ ð0:22225
1=2
= 2:865
A one-tailed test is appropriate, and from Table A.3 the critical value of z0.05 = 1.65 for the 95% CL. Since the calculated z value > zα, this would suggest that the null hypothesis can be rejected. Thus, one would conclude that those living in areas of high radon levels have statistically higher lung ailments than those who do not. Further inspection of Table A.3 reveals that zα = 2.865 corresponds to a probability value of 0.021 or close to 98% CL. Should the EPA require mandatory testing of all homes at some expense to all homeowners or should some other policy measure be adopted? These types of considerations fall under the purview of decision-making discussed in Chap. 12. ■
4.2.5
Single and Two Sample Tests of Variance
Recall that when a sample mean is used to provide an estimate of the population mean μ, it is more informative to give a confidence interval CI for μ instead of simply stating the value x. A similar approach can be adopted for estimating the population variance from that of a sample.
4.2 Basic Univariate Inferential Statistics
135
(a) Single sample test
F=
The CI for a population variance σ based on sample variance s2 are to be determined. To construct such a CI, one will use the fact that if a random sample of size n is taken from a population that is normally distributed with variance σ 2, then the random variable 2
χ2 =
n-1 2 s σ2
ð4:15Þ
has the Pearson chi-square distribution with ν = (n - 1) degrees of freedom (described in Sect. 2.4.3). The advantage of using χ 2 instead of s2 is akin to standardizing a variable to a normal random variable. Such a transformation allows standard tables (such as Table A.5) to be used for determining probabilities irrespective of the magnitude of s2. The basis of these probability tables is again akin to finding the areas under the chi-square curves. Example 4.2.9 A company which makes boxes wishes to determine whether their automated production line requires major servicing or not. Their decision will be based on whether the weight from one box to another is significantly different from a maximum permissible population variance value of σ 2 = 0.12 kg2. A sample of 10 boxes is selected, and their variance is found to be s2 = 0.24 kg2. Is this difference significant at the 95% CL? From Eq. (4.15), the observed chi-square value is -1 ð0:24Þ = 18. Inspection of Table A.5 for ν = 9 χ 2 = 100:12 degrees of freedom reveals that for a significance level α = 0.05, the critical chi-square value χ 2α = 16.92 and, for α = 0.025, χ 2α = 19.02. Thus, the result is significant at α = 0.05 or 95% CL but not at the 97.5% CL. Whether to service the automated production line based on these statistical tests would involve weighing the cost of service with associated benefits in product quality. ■ (b) Two sample tests This instance applies to the case when two independent random samples are taken from two populations that are normally distributed, and one needs to determine whether the variances of the two populations are different or not. Such tests find application prior to conducting t-tests on two means which presumes equal variances. Let σ 1 and σ 2 be the standard deviations of both the populations, and s1 and s2 be the sample standard deviations. If σ 1 = σ 2, then the random variable
s21 s22
ð4:16Þ
has the F-distribution (described in Sect. 2.4.3) with degrees of freedom (d.f.) = (ν1, ν2) where ν1 = (n1 - 1) and ν2 = (n2 - 1). Note that the distributions are different for different combinations of ν1 and ν2. The probabilities for F can be determined using areas under the F curves or from tabulated values as in Table A.6. Note that the F-test applies to independent samples, and, unfortunately, is known to be rather sensitive to the assumption of normality. Hence, some argue against its use altogether for two sample testing (for example, Manly 2005). Example 4.2.10 Comparing variability in daily productivity of two workers It is generally acknowledged that worker productivity increases if the environment is properly conditioned to meet the stipulated human comfort environment. One is interested in comparing the mean productivity of two office workers under the same conditions. However, before undertaking that evaluation, one is unsure about the assumption of equal variances in productivity of the workers (i.e., how consistent are the workers from one day to another). This test can be used to check the validity of this assumption. Suppose the following data have been collected for two workers under the same environment and performing similar tasks. An initial analysis of the data suggests that the normality condition is met for both workers: Worker A: n1 = 13 days, mean x1 = 26.3 production units, standard deviation s1 = 8.2 production units. Worker B: n2 = 18 days, mean x2 = 19.7 production units, standard deviation s2 = 6.0 production units. The intent here is to compare not the means but the standard deviations. The F-statistic is determined by always choosing the larger variance as the numerator. Then F = (8.2/6.0)2 = 1.87. From Table A.6, the critical F value is Fα = 2.38 for (13 - 1) = 12 and (18 - 1) = 17 degrees of freedom at a significance level α = 0.05. Thus, as illustrated in Fig. 4.7, one is forced to accept the null hypothesis since the calculated F-value < Fα, and conclude that the data provide not enough evidence to indicate that the population variances of the two workers are statistically different at α = 0.05. Hence, one can now proceed to use the two-sample t-test with some confidence to determine whether the difference in the means between both workers is statistically significant or not.
136
4
Table 4.2 Expected number of homes for different number of non-code compliance values if the process is assumed to be a Poisson distribution with sample mean of 0.5 (Example 4.2.11)
F distribution with d.f. (17,12) 1 Critical value =2.38 for a =0.05
density
0.8
X = number of non-code compliance values 0 1 2 3 4 5 or more Total
Rejection region
0.6 0.4
Calculated F-value=1.87
0.2 0 0
1
2
3
4
Fig. 4.7 Since the calculated F value is lower than the critical value, one is forced to accept the null hypothesis (Example 4.2.10)
Tests for Distributions
Recall from Sect. 2.4.3 that the chi-square (χ 2) statistic applies to discrete data. It is used to statistically test the hypothesis that a set of empirical or sample data does not differ significantly from that expected from a specified theoretical distribution. In other words, it is a goodness-of-fit test to ascertain whether the distribution of proportions of one group differs from another or not. The chi-square statistic is computed as: χ2 = k
f obs - f exp f exp
P(x) n (0.6065) 380 (0.3033) 380 (0.0758) 380 (0.0126) 380 (0.0016) 380 (0.0002) 380 (1.000) 380
Expected number 230.470 115.254 28.804 4.788 0.608 0.076 380
5
x
4.2.6
Making Statistical Inferences from Samples
2
ð4:17Þ
where fobs is the observed frequency of each class or interval, fexp is the expected frequency for each class predicted by the theoretical distribution, and k is the number of classes or intervals. If χ 2 = 0, then the observed and theoretical frequencies agree exactly. If not, the larger the value of χ 2, the greater the discrepancy. Tabulated values of χ 2 are used to determine significance for different values of degrees of freedom υ = k - 1 (Table A.5). Certain restrictions apply for proper use of this test. The sample size should be greater than 30, and none of the expected frequencies should be less than 5 (Walpole et al. 2007). In other words, a long tail of the probability curve at the lower end is not appropriate, and in that sense, some power is lost when the test is adopted. The following two examples serve to illustrate the process of applying the chi-square test. Example 4.2.11 Ascertaining whether non-code compliance infringements in residences is random or not
A county official was asked to analyze the frequency of cases when home inspectors found new homes built by one specific builder to be non-code compliant and determine whether the violations were random or not. The following data for 380 homes were collected: No. of code infringements Number of homes
0 242
1 94
2 38
3 4
4 2
The underlying random process can be characterized by the Poisson distribution (see Sect. 2.4.2): PðxÞ =
λx expð- λÞ : x!
The null hypothesis, namely that the sample is drawn from a population that is Poisson distributed, is to be tested at the 0.05 significance level. ð38Þþ3ð4Þþ4ð2Þ = 0:5 The sample mean λ = 0ð242Þþ1ð94Þþ2 380 infringements per home. For a Poisson distribution with λ = 0.5, the underlying or expected values are found for different values of x as shown in Table 4.2. The last two categories have expected frequencies that are less than 5, which do not meet one of the requirements for using the test (as stated above). Hence, these will be combined into a new category called “3 or more cases” which will have an expected frequency of (4.7888 + 0.608 + 0.076) = 5.472. The following statistic is calculated first:
χ2 =
ð242 - 230:470Þ2 ð94 - 115:254Þ2 ð38 - 28:804Þ2 þ þ 230:470 115:254 28:804 2 ð6 - 5:472Þ þ = 7:483 5:472
Since there are only 4 groups, the degrees of freedom υ = 4 – 1 = 3, and from Table A.5, the two-tailed critical value at 0.05 significance level is χ 2α = 7.815. This would suggest that the null hypothesis cannot be rejected at the 0.05 significance level; however, the two frequencies are very close, and some further analysis may be warranted. ■
4.2 Basic Univariate Inferential Statistics
137
Table 4.3 Observed and computed (assuming gender independence) number of accidents in different circumstances (Example 4.2.12) Male Observed 40 49 18 107
Circumstance At work At home Other Total
Female Observed 5 58 13 76
Expected 26.3 62.6 18.1
Example 4.2.126 Evaluating whether injuries in males and females are independent of circumstance Chi-square tests are also widely used as tests of independence using contingency tables. In 1975, more than 59 million Americans suffered injuries. More males (=33.6 million) were injured than females (=25.6 million). These statistics do not distinguish whether males and females tend to be injured in similar circumstances. A safety survey of n = 183 accident reports was selected at random to study this issue in a large city, and the results are summarized in Table 4.3. The null hypothesis is that the circumstance of an accident (whether at work or at home) is independent of the gender of the victim. This hypothesis is to be verified at a significance level of α = 0.01. The degrees of freedom d.f. = (r - 1) (c - 1) where r is the number of rows and c the number of categories. Hence, d.f. = (3 - 1) (2 - 1) = 2. From Table A.5, the two-tailed critical value is χ 2α = 9.21 at α = 0.01 for d.f. = 2. The expected values for different joint occurrences (male/ work, male/home, male/other, female/work, female/home, female/other) are shown in italics in Table 4.3 and correspond to the case when the occurrences are independent. Recall from basic probability (Eq. 2.10) that if events A and B are independent, then p(A \ B) = p(A). p(B) where p indicates the probability. In our case, if being “male” and “being involved in an accident at work” were truly independent, then p(work \ male) = p(work). p(male). Consider the cell corresponding to male/at work. Its expected value = n pðwork \ maleÞ = n pðworkÞ pðmaleÞ = 183 ð45Þð107Þ 183
45 183
107 183 =
= 26:3 (as shown Table 4.3). Expected values for other joint occurrences shown in the table have been computed in like manner. Thus, the chi-square statistics is χ 2 = ð5 - 18:7Þ2 18:7
ð40 - 26:3Þ2 26:3
ð13 - 12:9Þ2 12:9
þ
þ ... þ = 24.3. Since, χ 2α = 9.21 which is much lower than 24.3, the null hypothesis can be safely rejected at a significance level of 0.01. Hence, one would conclude that gender has a bearing on the circumstance in which the accidents occur. ■
6
From Weiss (1987) by permission of Pearson Education.
Expected 18.7 44.4 12.9
Total Observed 45 107 31 183 = n
Another widely used test to determine whether the distribution of a set of empirical or sample data comes from a specified theoretical distribution is the Kolmogorov-Smirnov (KS) test (Shannon, 1975). It is analogous to the chi-square test and is especially useful for small sample sizes since it does not require that adjacent categories with expected frequencies less than 5 be combined. The chi-square test is very powerful for large samples (greater than sample sizes of about n > 100, with some authors even suggesting n > 30). The KS test is based on binning the cumulative probability distribution of the empirical data into a specific number of classes or intervals and is best used for 10 < n < 100. In general, the greater the number of intervals the more discriminating the test. While the chi-square test requires a minimum of 5 data points in each class, the KS test can assume even a single observation in a class.
4.2.7
Test on the Pearson Correlation Coefficient
Recall that the Pearson correlation coefficient was presented in Sect. 3.4.2 as a means of quantifying the linear relationship between samples of two variables. One can also define a population correlation coefficient ρ for two variables. Section 4.2.1 presented methods by which the uncertainty around the population mean could be ascertained from the sample mean by determining confidence limits. Similarly, one can make inferences about the population correlation coefficient ρ from knowledge of the sample correlation coefficient r. Provided both the variables are normally distributed (called a bivariate normal population), then Fig. 4.8 provides a convenient way of ascertaining the 95% CL of the population correlation coefficient for different sample sizes. Say, r = 0.6 for a sample n = 10 pairs of observations, then the 95% CL interval limits for the population correlation coefficient are (-0.05 < ρ < 0.87), which are very wide. Notice how increasing the sample size shrinks these bounds. For example, when n = 100, the interval limits are (0.47 < ρ < 0.71). The accurate manner of determining whether the sample correlation coefficient r found from analyzing a data set with two variables is significant or not is to conduct a standard hypothesis test involving null and alternative hypothesis.
138
4
Making Statistical Inferences from Samples
Fig. 4.8 Plot depicting 95% CI for population correlation in a bivariate normal population for various sample sizes n. The bold vertical line defines the lower and upper limits of ρ when r = 0.6 from a data set of 10 pairs of observations. (From Wonnacutt and Wonnacutt 1985 by permission of John Wiley and Sons)
Table A.7 lists the critical values of the sample correlation coefficient r for testing the null hypothesis that the population correlation coefficient is statistically significant (i.e., ρ ≠ 0) at the 0.05 and 0.01 significance levels for one and two tailed tests. The interpretation of these values is of some importance in many cases, especially when dealing with small data sets. Say, analysis of the 12 monthly bills of a residence revealed a linear correlation of r = 0.6 with degree-days at the location (see Pr. 2.28 for a physical explanation of the degree-day concept). Assume that a one-tailed test applies. The sample correlation suggests the presence of a correlation at a significance level α = 0.05 (the critical value from Table A.7 is ρc = 0.497) while there is none at α = 0.01 (for which ρc = 0.658). Note that certain simplified suggestions on interpreting values of r in terms of whether they are strong, moderate, or weak for general engineering analysis were given by Eq. (3.13); these are to be used with caution and were meant as rules of thumb only. Instances do arise when the correlation coefficients determined on the same two variables, but from two different samples, are to be statistically compared. Fisher’s r to z transformation method can be adopted whereby the sampling distribution of the Pearson correlation coefficient r is converted into a normally distributed random variable z. Convert the two data sets (x1, y1) and (x2, y2) into two sets of normalized (zx1, zy1) and (zx2, zy2) scores and calculate
the correlation coefficients zr1 and zr2 of both data sets separately following Eq. (3.12). Then calculate the random variable: zobs = ðzx1 –zx2 Þ
1 1 þ n1 - 3 n 2 - 3
1=2
ð4:18Þ
Standard statistical significance tests can then be performed on the random variable zobs. For example, say the observed value was zobs = 2.15. If the level of significance is set at 0.05, the critical value for a two-tailed test would be zcritical = 1.96. Since zobs > zcritical, the null hypothesis that the two correlations are not significantly different can be rejected.
4.3
ANOVA Test for Multi-Samples
The statistical methods known as ANOVA (analysis of variance) are a broad set of widely used and powerful techniques meant to identify and measure sources of variation within a data set. This is done by partitioning the total variation in the data into its component parts. Specifically, ANOVA uses variance information from several samples to make inferences about the means of the populations from which these samples were drawn (and, hence, the appellation).
4.3 ANOVA Test for Multi-Samples Fig. 4.9 Conceptual explanation of the basis of a single-factor ANOVA test. (From Devore and Farnum 2005 with permission from Thomson Brooks/Cole)
139
Variation within samples Variation between sample means
.. .. .
... ..
. .. ..
... . .
SMALL
.. .. .
4.3.1
Single-Factor ANOVA
The ANOVA procedure uses just one test for comparing k sample means, just like that followed by the two-sample test. The following example allows a conceptual understanding of the approach. Say, four random samples of the same physical quantity have been selected, each from a different source. Whether the sample means differ enough to suggest different parent populations for the sources can be ascertained from the within-sample variation to the variation between the four samples. The more the sample means differ, the larger will be the between-samples variation, as shown in Fig. 4.9b, and the less likely is the probability that the samples arise from the same population. The reverse is true if the ratio of between-samples variation to that of the withinsamples is small (Fig. 4.9a). Single-factor ANOVA methods test the null hypothesis of the form:
LARGE
Variation between samples Variation within samples
When H0 is true
Recall that z-tests and t-tests described previously are used to test for differences in one parameter treated as a random variable (such as mean values) between two independent groups depending on whether the sample sizes are greater than or less than about 30 respectively. The two groups differ in some respect; they could be, say, two samples of 10 marbles each selected from a production line during different “time periods.” The “time period” is the experimental variable which is referred to as a factor in designed experiments and hypothesis testing. It is obvious that the cases treated in Sect. 4.2 are single-factor hypothesis tests (on mean and variance) involving single and two groups or samples; the extension to multiple groups is the single-factor (or one-way) ANOVA method. This section deals with single-factor ANOVA method which is a logical lead-in to multivariate techniques (discussed in Sect. 4.4) and to experimental design methods involving multiple factors or experimental variables (discussed in Chap. 6).
.. . ..
.. . ..
. .. ..
When H0 is false
H 0 : μ1 = μ2 = . . . = μk H a : at least two of the μi 0 s are different
ð4:19Þ
Adopting the following notation: Sample sizes: n1 , n2 . . . , nk Sample means: x1 , x2 . . . xk Sample standard deviations: s1 , s2 . . . sk Total sample size: n = n1 þ n2 . . . þ nk Grand average: hxi = weighted average of all n responses Then, one defines between-sample variation called “treatment sum of squares7” (SSTr) as: k
ni ðxi - hxiÞ2
SSTr =
with
d:f : = k - 1
ð4:20Þ
i=1
and within-samples variation or “error sum of squares” (SSE) as: k
SSE = i=1
ðni - 1Þs2i with d:f : = n - k
ð4:21Þ
Together these two sources of variation comprise the “total sum of squares” (SST): k
n
SST = SSTr þ SSE =
2
xij - hxi with d:f : = n - 1 i=1 j=1
ð4:22Þ
The term “treatment” was originally coined for historic reasons where one was interested in evaluating the effect of treatments or specific changes in material mix and processing of a product during its development.
7
140
4
Table 4.4 Amount of vibration (values in microns) for five brands of bearings tested on six motor samples (Example 4.3.1)a
Sample 1 2 3 4 5 6 Mean Std. dev. a
Brand 1 13.1 15.0 14.0 14.4 14.0 11.6 13.68 1.194
Brand 2 16.3 15.7 17.2 14.9 14.4 17.2 15.95 1.167
Making Statistical Inferences from Samples
Brand 3 13.7 13.9 12.4 13.8 14.9 13.3 13.67 0.816
Brand 4 15.7 13.7 14.4 16.0 13.9 14.7 14.73 0.940
Brand 5 13.5 13.4 13.2 12.7 13.4 12.3 13.08 0.479
Data available electronically on book website
Table 4.5 ANOVA table for Example 4.3.1 Source Factor Error Total
d.f. 5-1=4 30 - 5 = 25 30 - 1 = 29
Sum of squares SSTr = 30.855 SSE = 22.838 SST = 53.694
SST is simply the sample variance of the combined set of n data points = (ni - 1)s2 where s is the standard deviation of all the n data points. The statistic defined below as the ratio of two variances follows the F-distribution: F=
MSTr MSE
ð4:23Þ
ð4:24Þ
and the mean within-sample square error is MSE = SSE=ðn - k Þ
ð4:25Þ
Recall that the p-value is the area of the F curve for (k - 1, n - k) degrees of freedom to the right of F-value. If p-value ≤α (the selected significance level), then the null hypothesis can be rejected. Note that the F-test is meant to be used for normal populations and equal population variances. Example 4.3.18 Comparing mean life of five motor bearings A motor manufacturer wishes to evaluate five different motor bearings for motor vibration (which adversely results in reduced life). Each type of bearing is installed on different random samples of six motors. The amount of vibration (in microns) is recorded when each of the 30 motors are running. The data obtained is assembled in Table 4.4.
From Devore and Farnum (2005) by # permission of Cengage Learning.
8
F-value 8.44
Determine from a F-test whether the bearing brands influence motor vibration at the α = 0.05 significance level. In this example, Grand average : hxi = 14:22, k = 5, and n = 30. The one-way ANOVA table is first generated as shown in Table 4.5 using Eqs. (4.20)–(4.25). For example, SSTr = 6× ð13:68–14:22Þ2 þ ð15:95–14:22Þ2 þ ð13:67–14:22Þ2
þð14:73–14:22Þ2 þ ð13:08–4:22Þ2 = 30:855
where the mean between-sample variation is MSTr = SSTr=ðk - 1Þ
Mean square MSTr = 7.714 MSE = 0.9135
SSE = 5 × 1:1942 þ 1:1672 þ 0:8162 þ 0:9402 þ 0:4792 = 22:838: From the F-tables (Table A.6) and for α = 0.05, the critical F-value for d.f. = (4, 25) is Fc = 2.76, which is less than F = 8.44 computed from the data. Hence, one is compelled to reject the null hypothesis that all five means are equal, and conclude that type of bearing motor does have a significant effect on motor vibration. In fact, this conclusion can be reached even at the more stringent significance level of α = 0.001. The results of the ANOVA analysis can be conveniently illustrated by generating an effects plot, as shown in Fig. 4.10a. This clearly illustrates the relationship between the mean values of the response variable, i.e., vibration level for the five different motor bearing brands. Brand 5 gives the lowest average vibration, while Brand 2 has the highest. Note that such plots, though providing useful insights, are not generally a substitute for an ANOVA analysis. Another way of plotting the data is a means plot (Fig. 4.10b) which includes 95% CL intervals as well as the information provided in Fig. 4.10a. Thus, a sense of the variation within samples can be gleaned. ■
4.3 ANOVA Test for Multi-Samples
141
Fig. 4.10 (a) Effect plot. (b) Means plot showing the 95% CL intervals around the mean values of the 5 brands (Example 4.3.1)
Table 4.6 Pairwise analysis of the five samples following Tukey’s HSD procedure (Example 4.3.2)
Samples 1,2 1,3 1,4 1,5 2,3 2,4 2,5 3,4 3,5 4,5 a
4.3.2
Distance j13.68 - 15.95 j j13.68 - 13.67 j j13.68 - 14.73 j j13.68 - 13.08 j j15.95 - 13.67 j j15.95 - 14.73 j j15.95 - 13.08 j j13.67 - 14.73 j j13.67 - 13.08 j j14.73 - 13.08 j
= 2.27 = 0.01 = 1.05 = 0.60 = 2.28 = 1.22 = 2.87 = 1.06 = 0.59 = 1.65
μi ≠ μj μi ≠ μj
μi ≠ μj
Indicated only if distance > critical value of 1.62
Tukey’s Multiple Comparison Test
A limitation with the ANOVA test is that, in case the null hypothesis is rejected, one is unable to determine the exact cause. For example, one poor motor bearing brand could have been the cause of this rejection in the example above even though there is no significant difference between the other four other brands. Thus, one needs to be able to pinpoint the culprit. One could, of course, perform paired comparisons of two brands one at a time. In the case of 5 sets, one would then make 10 such tests. Apart from the tediousness of such a procedure, making independent paired comparisons leads to a decrease in sensitivity, i.e., type I errors are magnified.9 Rigorous classical procedures that allow multiple comparisons to be made simultaneously have been proposed for this purpose (see Manly 2005). One such method is discussed in Sect. 4.4.2. In this section, the Tukey’s HSD (honestly significant difference) procedure is described which is used to test the differences among multiple sample means by pairwise testing. It is limited to cases of equal sample sizes. This procedure allows the simultaneous formation of prespecified confidence intervals for all paired comparisons using the Student t-distribution. Separate tests are conducted to determine whether μi = μj for each pair (i,j) of means in an ANOVA study of k population means. Tukey’s procedure is based on comparing the distance (or absolute value)
between any two sample means j xi - xj j to a threshold value T that depends on significance level α as well as on the mean square error (MSE) from the ANOVA test. The T-value is calculated as: T = qα
This can be shown mathematically but is beyond the scope of this text.
MSE ni
1=2
ð4:26aÞ
where ni is the size of the sample drawn from each brand population, qα values are called the Student-t range distribution values and are given in Table A.8 for α = 0.05 for d. f. = (k, n - k). If j xi - xj j > T, then one concludes that μi ≠ μj at the corresponding significance level. Otherwise, one concludes that there is no difference between the two means. Tukey also suggested a convenient visual representation to keep track of the results of all these pairwise tests. The Tukey’s HSD procedure and this representation are illustrated in the following example. Example 4.3.210 Using the same data as that in Example 4.3.1 conducts a multiple comparison procedure to distinguish which of the motor bearing brands are superior to the rest. Following Tukey’s HSD procedure given by Eq. (4.26a), the critical distance between sample means at α = 0.05 is: From Devore and Farnum (2005) by # permission of Cengage Learning. 10
9
Conclusiona μi ≠ μj
142
4
Making Statistical Inferences from Samples
Brand 1 13.68 Brand 5 13.08
Brand 3 13.67
Brand 4 14.73
Brand 2 15.95
Fig. 4.11 Graphical depiction summarizing the ten pairwise comparisons following Tukey’s HSD procedure. Brand 2 is significantly different from Brands 1, 3, and 5, and so is Brand 4 from Brand 5 (Example 4.3.2)
T = qα
MSE ni
1=2
= 4:15
0:913 6
1=2
= 1:62
where qα is found by interpolation from Table A.8 based on d.f. = (k, n - k) = (5, 25). The pairwise distances between the five-sample means shown in Table 4.6 can be determined, and appropriate inferences made. Note that the number of pairwise comparison is [k(k - 1)/2] which in this case of k = 5 is equal to 10 comparisons as shown in the table. Thus, the distance T between the following pairs is less than 1.62: {1,3;1,4;1,5}, {2,4}, {3,4;3,5}. This information is visually summarized in Fig. 4.11 by arranging the five sample means in ascending order and then drawing rows of bars connecting the pairs whose distances do not exceed T = 1.62. It is now clear that though brand 5 has the lowest mean value, it is not significantly different from brands 1 and 3. Hence, the final selection of which motor bearing to pick that has low vibration can be made from these three brands only (brands 1, 3, 5). ■ The Tukey method is said to be too conservative when the intent is to compare the means of (k - 1) samples against a single sample taken to be the “control” or reference. An alternative multi-comparison test suitable in such instances is the Dunnett’s method (see for example, Devore and Farnum 2005). Instead of performing pairwise comparison among all the samples, this method computes the critical T-value following: T = t α MSE
1 1 þ ni n c
1=2
ð4:26bÞ
where ni and nc are the sizes of the sample drawn from the individual groups and the control group respectively and tα is the critical value found from Table A.9 based on d. f. = (k - 1, n - k) degrees of freedom. Thus, the number of pairwise comparisons reduce to (k - 1) tests as against [k(k - 1)/2] for the Tukey method. For the example above, the number of tests is only 4 instead of 10 for the Tukey approach.
4.4
Tests of Significance of Multivariate Data
4.4.1
Introduction to Multivariate Methods
Multivariate statistical analysis (also called multifactor analysis) deals with statistical inference as applied to multiple parameters (each considered to behave like a random variable) deduced from one or several samples taken from one or several populations. Multivariate methods can be used to make inferences about parameters such as sample means and variances. Rather than treating each parameter or variable separately as done in t-tests and single-factor ANOVA, multivariate inferential methods allow the analyses of multiple variables simultaneously as a system of measurements. This generally results in sounder inferences to be made, a point elaborated below. The univariate probability distributions presented in Sect. 2.4 can also be extended to bivariate and multivariate distributions. Let x1 and x2 be two variables of the same type, say both discrete (for continuous variables, the summations in the equations below need to be replaced with integrals). Their joint distribution is given by: f ðx1 , x2 Þ ≥ 0 and
f ð x1 , x2 Þ = 1
ð4:27Þ
allðx1 , x2 Þ
Consider two sets of multivariate data each consisting of p variables. However, the sets could be different in size, i.e., the number of observations in each set may be different, say n1 and n2. Let X 1 and X 2 be the sample mean vectors of dimension p. For example, X 1 = x11 , x12 , . . . x1i , . . . x1p
ð4:28Þ
where x1i is the sample average over n1 observations of parameter i for the first set. Further, let C1 and C2 be the sample covariance matrices of size (p × p) for the two sets respectively (the basic concepts of covariance and correlation were presented in Sect. 3.4.2). Then, the sample matrix of variances and covariances for the first data set is given by:
4.4 Tests of Significance of Multivariate Data
143
Fig. 4.12 Two bivariate normal distributions and associated 50% and 90% contours assuming equal standard deviations for both variables. However, the left-hand side plot (a) presumes the two variables to be uncorrelated, while for the right (plot b) the two variables have a correlation coefficient of 0.75 which results in elliptical contours. (From Johnson and Wichern, 1988 by # permission of Pearson Education)
C1 =
c11
c12
::
c1p
c21 ::
c22 ::
:: ::
c2p ::
cp1
cp2
::
cpp
ð4:29Þ
where cii is the variance for parameter i and cik the covariance for parameters i and k. Similarly, the sample correlation matrix where the diagonal elements are equal to unity and other terms scaled appropriately is given by 1
r 12
::
r 1p
r 21 R1 = ::
1 ::
:: ::
r 2p ::
r p1
r p2
::
1
ð4:30Þ
Both matrices contain the correlations between each pair of variables, and they are symmetric about the diagonal since, say, r12 = r21, and so on. This redundancy is simply meant to allow easier reading. These matrices provide a convenient visual representation of the extent to which the different sets of variables are correlated with each other, thereby allowing strongly correlated sets to be easily identified. Note that correlations are not affected by shifting and scaling the data. Thus, standardizing the variables obtained by subtracting each observation by the mean and dividing by the standard deviation will still retain the correlation structure
of the original data set while providing certain convenient interpretations of the results. Underlying assumptions for multivariate tests of significance are that the two samples have close to multivariate normal distributions with equal population covariance matrices. The multivariate normal distribution is a generalization of the univariate normal distribution when p ≥ 2 where p is the number of dimensions or parameters. Figure 4.12 illustrates how the bivariate normal distribution is distorted in the presence of correlated variables. The contour lines are circles for uncorrelated variables and ellipses for correlated ones.
4.4.2
Hotteling T2 Test
The simplest extension of univariate statistical tests is the situation when two or more samples are evaluated to determine whether they originate from populations with: (i) different means, and (ii) different variances/covariances. One can distinguish between the following types of multivariate inference tests involving more than one parameter (Manly 2005): (a) comparison of several mean values of factors from two samples is best done using the Hotteling T2-test; (b) comparison of variance for two samples (several procedures have been proposed; the best known are
144
4
the Box’s M-test, the Levene’s test based on T2-test, and the Van Valen test); (c) comparison of mean values for several samples (several tests are available; the best known are the Wilks’ lambda statistic test, Roy’s largest root test, and Pillai’s trace statistic test); (d) comparison of variance for several samples (using the Box’s M-test). Only case (a) will be described here, while the others are treated in texts such as Manly (2005). Consider two samples with sample sizes n1 and n2. One wishes to compare differences in p random variables among the two samples. Let X 1 and X2 be the mean vectors of the two samples. A pooled estimate of covariance matrix is: C = fðn1 - 1ÞC1 þ ðn2 - 1ÞC2 g=ðn1 þ n2 - 2Þ
Making Statistical Inferences from Samples
157:381 241:000 X 1 = 31:433
and
18:500 20:810
C1 =
11:048
9:100
1:557
0:870
1:286
9:100
17:500
1:910
1:310
0:880
1:557
1:910
0:531
0:189
0:240
0:870
1:310
0:189
0:176
0:133
1:286
0:880
0:240
0:133
0:575
ð4:31Þ 158:429
where C1 and C2 are the covariance vectors given by Eq. (4.29). Then, the Hotteling T2-statistic is defined as:
241:571 X2 = 31:479
0
T2 =
n1 n2 X1 - X2 C - 1 X1 - X2 ð n1 þ n2 Þ
18:446
ð4:32Þ
20:839
A large numerical value of this statistic suggests that the two population mean vectors are different. The null hypothesis test uses the transformed statistic: ðn þ n2 - p - 1ÞT 2 F= 1 ðn1 þ n2 - 2Þp
C2 = ð4:33Þ
which follows the F-distribution with the number of p and (n1 + n2 - p - 1) degrees of freedom. Since, the T2 statistic is quadratic, it can also be written in double sum notation as: T2 =
n1 n 2 ð n1 þ n2 Þ
p
15:069
17:190
2:243
1:746
2:931
17:190
32:550
3:398
2:950
4:066
2:243
3:398
0:728
0:470
0:559
1:746
2:950
0:470
0:434
0:506
2:931
4:066
0:559
0:506
1:321
If one performed paired t-tests with each parameter taken one at a time (as described in Sect. 4.2.3), one would compute the pooled variance for the first parameter as:
p
ðx1i - x2i Þcik ðx1k - x2k Þ ð4:34Þ i=1 k=1
The following solved example serves to illustrate the use of the above equations. Example 4.4.111 Comparing mean values of two samples by pairwise and by Hotteling T2 procedures Consider two samples of 5 parameters ( p = 5) with paired samples. Sample 1 has 21 observations and sample 2 has 28. The mean and covariance matrices of both these samples have been calculated and shown below: 11
and
From Manly (2005) by permission of CRC Press.
s21 = ½ð21 - 1Þð11:048Þ þ ð28 - 1Þð15:069Þ=ð21 þ 28 - 2Þ = 13:36 And the t-statistic as: t=
ð157:381 - 158:429Þ = - 0:99 1 1 13:36 21 þ 28
with (21 + 28 - 2 =) 47 degrees of freedom. This is not significantly different from zero as one can note from the p-value indicated in Table A.4. Table 4.7 assembles similar results for all other parameters. One would conclude that none of the five parameters in both data sets are statistically different.
4.4 Tests of Significance of Multivariate Data
145
Table 4.7 Paired t-tests for each of the five parameters taken one at a time (Example 4.4.1) Parameter
First data set Mean 157.38 241.00 31.43 18.50 20.81
1 2 3 4 5
Second data set Mean 158.43 241.57 31.48 18.45 20.84
Variance 11.05 17.50 0.53 0.18 0.58
To perform the multivariate test, one first calculates the pooled sample covariance matrix (Eq. 4.31): 20C 1 þ 27C 2 47
C=
13:358
13:748
1:951
1:373 2:231
13:748
26:146
2:765
2:252 2:710
= 1:951
2:765
0:645
0:350 0:423
1:373
2:252
0:350
0:324 0:347
2:231
2:710
0:423
0:347 1:004
where, for example, the first entry is: (20 × 11.048 + 27 × 15.069)/47 = 13.358. The inverse of the matrix C yields - 0:0694 - 0:2395 0:0785
0:2061
- 0:0694 0:1234 C
-1
- 0:0376 - 0:5517 0:0277
= - 0:2395 - 0:0376 4:2219
- 3:2624 -0:0181
- 0:5517 - 3:2624 11:4610
0:0785
- 0:1969 0:0277
-0:1969
-1:2720
- 0:0181 - 1:2720 1:8068
Substituting the elements of the above matrix in Eq. (4.34) leads to: ð21Þð28Þ ½ð157:381 - 158:429Þð0:2061Þ ð21 þ 28Þ
T2 =
ð157:381 - 158:429Þ - ð241:000 - 241:571Þ ð0:0694Þð241:000 - 241:571Þ þ ⋯ þ ð20:810 - 20:839Þ ð1:8068Þð20:810 - 20:839Þ = 2:824 which
from
Eq. (4.33) results in a F-statistic = = 0:517 with d.f. = (5, 43).
ð21þ28 - 5 - 1Þð2:824Þ ð21þ28 - 2Þð5Þ
Variance 15.07 32.55 0.73 0.43 1.32
t-value (d.f. = 47)
p-value
-0.99 -0.39 -0.20 0.33 -0.10
0.327 0.698 0.842 0.743 0.921
This is clearly not significant since Fcritical = 2.4 (from Table A.6), and so there is no evidence to support that the population means of the two groups are statistically different when all five parameters are simultaneously considered. In this case one could have drawn such a conclusion directly from Table 4.7 by looking at the pairwise p-values, but this may not happen always. ■ Other than the elegance provided, there are two distinct advantages of performing a single multivariate test as against a series of univariate tests. The probability of finding a type-I or false negative result purely by accident increases as the number of variables increase, and the multivariate test takes proper account of the correlation between variables. The above example illustrated the case where no significant differences in population means could be discerned either from univariate tests performed individually or from an overall multivariate test. However, there are instances, when the latter test turns out to be significant as a result of the cumulative effects of all parameters while any one parameter is not significantly different. The converse may also hold, the evidence provided from one significantly different parameter may be swamped by lack of differences between the other parameters. Hence, it is advisable to perform tests as illustrated in the example above. Sections 4.2, 4.3, and 4.4 treated several cases of hypothesis testing. An overview of these cases is provided in Fig. 4.13 for greater clarity. The specific sub-section of each of the cases is also indicated. The ANOVA case corresponds to the lower right box, namely testing for differences in the means of a single factor or variable which is sampled from several populations, while the Hotteling T2test corresponds to the case when the mean of several variables from two samples are evaluated. As noted above, formal use of statistical methods can become very demanding mathematically and computationally when multivariate and multiple samples are considered, and hence the advantage of using numerical based resampling methods (discussed in Sect. 4.8).
146
4
Making Statistical Inferences from Samples
Hypothesis Tests
Two samples
One sample
One variable
Mean/ Proportion 4.2.2/ 4.2.4(a)
Two variables
Variance
Probability distribution
Correlation coefficient
4.2.5(a)
4.2.6
4.2.7
Non-parametric
4.5.3
One variable
Multi samples
Multivariate
One variable
Mean/ Proportion Variance
Mean
Mean
4.2.3(a) 4.2.3(b)/ 4.2.4(b)
4.4.2
4.3
4.5.1
4.2.5(b)
Hotteling T^2
ANOVA
4.5.2
Fig. 4.13 Overview of various types of parametric hypothesis tests treated in this chapter along with section numbers. Non-parametric tests are also covered briefly as indicated. The term “variable” is used instead of “parameters” since the latter are treated as random variables
4.5
Non-Parametric Tests
The parametric tests described above have implicit built-in assumptions regarding the distributions from which the samples are taken. Comparison of samples from populations using the t-test and F-test can yield misleading results when the random variables being measured are not normally distributed and do not have equal variances. It is obvious that fewer the assumptions, the broader would be the potential applications of the test. One would like that the significance tests used lead to sound conclusions, or that the risk of coming to misleading/wrong conclusions be minimized. Two concepts relate to the latter aspect. The concept of robustness of a test is inversely proportional to the sensitivity of the test and to violations of the underlying assumptions. The power of a test, on the other hand, is a measure of the extent to which cost of experimentation is reduced. Non-parametric tests can be applied to continuous numerical measurements but tend to be more widely used to categorical or ordinal data, i.e., those that can only be ranked in order of magnitude. This occurs frequently in management, social sciences, and psychology. For example, a consumer survey respondent may rate one product as better than another but is unable to assign quantitative values to each product. Data involving such “preferences” cannot also be
subject to the t- and F-tests. It is under such cases that one must resort to nonparametric statistics. Rather than use actual numbers, nonparametric tests usually use relative ranks by sorting the data by rank (or score) and discarding their specific numerical values. Because nonparametric tests do not use all the information contained in the data, they are generally less powerful than parametric ones (i.e., tend to lead to wider confidence intervals). On the other hand, they are more intuitive, simpler to perform, more robust, and less sensitive to outlier points when the data is noisy or when the underlying empirical distribution of the parameters is non-Gaussian. Much of the material discussed below can be found in statistical textbooks; for example, McClave and Benson (1988), Devore and Farnum (2005), Walpole et al. (2007), and Heiberger and Holland (2015). The non-parametric tests described in this section are meant to evaluate whether the medians of drawn samples/ groups are close enough to conclude that they have emanated from the same population. The tests are predicated on the homogeneity of variation assumption, i.e., the population variances of the dependent variable are equal within all groups with the same Gaussian or non-Gaussian distribution shape. Whether this is credible or not could be evaluated by visually inspecting sample distributions. A better approach is to use the Levene’s variance test. It is based on the fact that the absolute differences between all scores and their (group)
4.5 Non-Parametric Tests
147
means should be roughly equal over groups. A larger variance implies that, on average, the data values are “further away” from their mean, and vice versa. Thus, this test is very similar to a one-factor ANOVA test based, not on the actual data, but on the absolute difference scores.
Similarly, B(0;7,0.5) = .57 = 0.0078, and B(1;7,0.5) = 0.0547. Finally, the probability that two or less successes occurredpurelybychance = 0.0078 + 0.0547 + 0.164= 0.227. Since this value of 0.227 < 0.286, the null hypothesis that the median value of the response times is less than 6 h should be rejected. ■
4.5.1
When the number of observations is large, the calculations may become tedious. In such cases, it is convenient to use the binomial distribution tables (Appendix A1); alternatively, the normal distribution is a good assumption for ease in computation.
Signed and Rank Tests for Medians
Some of the common non-parametric tests for the medians of one sample and two sample data are described below. No restriction is placed on the empirical distributions other than they be continuous and symmetric (so that the mean and medians are approximately equal).
(b) Mann–Whitney or Wilcoxon rank sum test for medians of two samples
(a) The Sign Test for medians The median is a value where one expects 50% of the observations to be above and 50% below this value. The magnitude of the observations is converted to either a + sign or a - sign depending on whether the individual observation value is above or below the median value. One would expect half of the signs to be positive (+) and half to be negative (-). A statistical test is performed on the actual number of + and - values to draw an inference. The procedure is best illustrated by means of an example. Example 4.5.1 A building maintenance technician claims that he can repair most non-major mechanical breakdowns of the HVAC system within a median time of 6 h. The maintenance supervisor records the following response times in the last seven events (assumed independent): Response time (hour) Difference from median Sign of difference
5.2 -0.8
6.7 +0.7
5.5 -0.5
5.8 -0.2
6.3 +0.3
5.6 -0.4
-
+
-
-
+
-
In the observed sample of 7 events, there are 2 instances when the response time is greater than the median value of 6 h, i.e., a probability of (2/7 = 0.286). The probability that two “successes” occurred purely by chance can be ascertained by using the binomial distribution (Eq. 2.37a) with p = 0.5 since only one of two possible outcomes is possible (the response time is either greater or less than the stated median): Bð2; 7, 0:5Þ = =
7 2
1 2
2
1-
7×6 1 1 × × 2 4 4
5
1 2
7-2
= 0:164, i:e:, 16:4%
The sign test can be said to discard too much information in the data. A stronger test is the Mann–Whitney test (also referred to as Wilcoxon rank sum) used to evaluate whether the medians of two separate samples have been drawn from the same population or not. It is thus the non-parametric version of the two-sample t-test for sample means (Sect. 4.2.3). The sampled data must be continuous, and the sampled population should be close to symmetric and can be either Gaussian or non-Gaussian (if Gaussian, parametric tests are preferable). Strictly speaking, the Mann–Whitney test involves ranking the individual observations of both samples combined, and then summing the ranks of both groups separately. A test is performed on the two sums to deduce whether the two samples come from the same population or not. While simple and intuitive, the test is nonetheless grounded on statistical theory. The following example illustrates the approach. Example 4.5.2 Ascertaining whether oil company researchers and academics differ in their predictions of future atmospheric carbon dioxide levels The intent is to compare the predictions in the change of atmospheric carbon dioxide levels between researchers who are employed by oil companies and those who are in academia. The gathered data shown in Table 4.8 in percentage increase in carbon dioxide from the current level over the next 10 years as predicted by six oil company researchers and seven academics. Perform a statistical test at the 0.05 significance level to evaluate the following hypotheses: (i) Predictions made by oil company researchers differ from those made by academics. (ii) Predictions made by oil company researchers tend to be lower than those made by academics.
148
4
Table 4.8 Wilcoxon rank test calculation for paired independent samples (Example 4.5.2)
1 2 3 4 5 6 7 Sum
Oil company researchers Prediction (%) 3.5 5.2 2.5 5.6 2.0 3.0 –
(i) First, both groups are combined into a single group and ranks are assigned on the combined group starting from rank = 1 for the value with the lowest algebraic value and the rest in ascending order. Next, the ranks of each group are separately tabulated and summed. Since there are 13 predictions, the ranks run from 1 through 13 as shown in the table. The test statistic is based on the sum totals of each group separately (and hence its name). If they are close, the implication is that there is no evidence that the probability distributions of both groups are different; and vice versa. Let TA and TB be the rank sums of either group. Then, the sum of all the individual ranks is: TA þ TB =
nðn þ 1Þ 13ð13 þ 1Þ = = 91 2 2
ð4:35Þ
where n = n1 + n2 with n1 = 6 and n2 = 7. Note that n1 should be selected as the one with fewer observations. Since (TA + TB) is fixed, a small value of TA implies a large value of TB, and vice versa. Hence, the greater the difference between both the rank sums, greater the evidence that the samples come from different populations. Since one is testing whether the predictions by both groups are different or not, the two-tailed significance test is appropriate. Table A.11 provides the lower and upper cutoff values for different values of n1 and n2 for both the one-tailed and the two-tailed tests. Note that the lower and higher cutoff values are (28, 56) at 0.05 significance level for the two-tailed test. The computed statistics of TA = 25 and TB = 66 are outside the range, the null hypothesis is rejected, and one would conclude that the predictions from the two groups are different. (ii) Here one wishes to test the hypothesis that the predictions by oil company researchers is lower than those made by academics. Then, one uses a one-tailed test whose cutoff values are given in part (b) of Table A.11. These cutoff values at 0.05 significance
Rank 4 7 2 8 1 3 – 25
Making Statistical Inferences from Samples Academics Prediction (%) 4.7 5.8 3.6 6.2 6.1 6.3 6.5
Rank 6 9 5 11 10 12 13 66
level are (30, 54) but only the lower value of 30 is used for the problem specified. The null hypothesis will be rejected only if TA < 30. Since this is so, the above data suggests that the null hypothesis can be rejected at a significance level of 0.05. ■ (c) The Wilcoxon signed rank sum test for medians of two paired samples This test is meant for paired data where samples taken are not independent. This is analogous to the two-sample paired difference test treated in Sect. 4.2.3b. where the paired differences in data are converted to a single random variable. Again, the sampled data must be continuous, and the sampled population should be close to symmetric. The Wilcoxon signed rank sum test involves calculating the difference between the paired data, ranking them and summing the positive values and negative values separately. A statistical test is finally applied to reach a conclusion on whether the two distributions are significantly different or not. This is illustrated by the following example. Example 4.5.3 Evaluating predictive accuracy of two climate change models from expert elicitation A policy maker wishes to evaluate the predictive accuracy of two different climate change models for predicting shortterm (e.g., 30 years) carbon dioxide changes in the atmosphere. He consults 10 experts and asks them to grade these models on a scale from 1 to 10, with 10 being extremely accurate. Clearly, these data are not independent since the same expert is asked to make two value judgments about the models being evaluated. The data shown in Table 4.9 are obtained (note that these are not ranked values, except for the last column, but are grades from 1 to 10 assigned by the experts). The tests are on the medians which are approximately close to the means for symmetric distributions.
4.5 Non-Parametric Tests
149
Null hypothesis H 0 : μ1 - μ2 = 0 ðthere is no significant difference in the distributionsÞ Alternative hypothesis H a : μ1 - μ2 ≠ 0 ðthere is significant difference in the distributionsÞ The paired differences are first computed (as shown in Table 4.9) from which the ranks are generated based on the absolute differences, and finally the sums of the positive and negative ranks are computed. Note how the ranking has been assigned since there are repeats in the absolute difference values. There are three “1” in the absolute difference column. Hence a mean value of rank “2” has been assigned for all 3. Similarly, for the three absolute differences of “2,” the rank is given as “5,” and so on. For the highest absolute difference of “6,” the rank is assigned as “10.” The values shown in last two rows of the table are also simple to deduce. The values of the difference (A - B) column are either positive or negative. One simply adds up all the rank values corresponding to the cases when (A - B) is positive; and when they are negative. These are found to be 46 and 9 respectively. The test statistic for the null hypothesis is T = min (T-, T+). In our case, T = 9. The smaller the value of T, the stronger the evidence that the difference between both distributions is important. The rejection region for T is determined from Table A.12. The two-tailed critical value for n = 10 at 0.05 significance level is 8. Since the computed value for T is higher, one cannot reject the null hypothesis, and so one would conclude that there is not enough evidence to suggest that one of the models is more accurate than the other at the 0.05 significance level. Note that if a significance level of 0.10 were selected, the null hypothesis would have been rejected. Looking at the ratings shown in Table 4.9, one notices that these seem to be generally higher for model A than model B. In case one wishes to test the hypothesis, at a significance level of 0.05, that researchers deem model B to be less
accurate than model A, one would have used T- as the test statistic and compared it to the critical value of a one-tailed column values of Table A.12. Since the critical value is 11 for n = 10, which is greater than 9, one would reject the null hypothesis. This example illustrates the fact that it is important to frame the problem correctly in terms of whether a one-tailed or a two-tailed test is more appropriate. ■
4.5.2
Kruskal–Wallis Multiple Samples Test for Medians
Recall that the single-factor ANOVA test was described in Sect. 4.3.1 for inferring whether mean values from several samples emanate from the same population or not, with the necessary assumption of normal distributions. The Kruskal– Wallis H test (Kruskal and Wallis, 1952) is the nonparametric or distribution-free equivalent of the F-test used in one-factor ANOVA but applies to the medians. It can also be taken to be the extension or generalization of the rank-sum test to more than two groups. Hence, the test applies to the case when one wishes to compare more than two groups which should be symmetrical but may not be normally distributed. Again, the evaluation is based on the rank sums where the ranking is made based on samples of all k groups combined. The test is framed as follows: Null hypothesis H 0 : All populations have identical probability distributions Alternative hypothesis H a : Probability distributions of at least two populations are different Let R1, R2, R3 denote the rank sums of, say, three samples. The H-test statistic measures the extent to which the three samples differ with respect to their relative ranks, and is given by:
Table 4.9 Wilcoxon signed rank sum test calculation for paired non-independent samples (Example 4.5.3) Expert 1 2 3 4 5 6 7 8 9 10
Model A 6 8 4 9 4 7 6 5 6 8
Model B 4 5 5 8 1 9 2 3 7 2
Difference (A - B) 2 3 -1 1 3 -2 4 2 -1 6
Absolute difference 2 3 1 1 3 2 4 2 1 6 Sum of positive ranks T+ Sum of negative ranks T-
Rank 5 7.5 2 2 7.5 5 9 5 2 10 =46 =9
150
4
Making Statistical Inferences from Samples
Table 4.10 Data table for Example 4.5.4a Agriculture # Employees 10 350 4 26 15 106 18 23 62 8
1 2 3 4 5 6 7 8 9 10 a
Rank 5 27 2 13 8 21 11 12 17 4 R1 = 120
Manufacturing # Employees 244 93 3532 17 526 133 14 192 443 69
Rank 25 19 30 9.5 29 22 7 23 28 18 R2 = 210.5
Service # Employees 17 249 38 5 101 1 12 233 31 39
Rank 9.5 26 15 3 20 1 6 24 14 16 R3 = 134.5
Data available electronically on book website
H=
12 nð n þ 1 Þ
k j=1
R2j nj
- 3 ð n þ 1Þ
12 1202 210:52 134:52 þ þ - 3ð31Þ 10 10 30ð31Þ 10 = 99:097 - 93 = 6:097
H=
ð4:36Þ
where k is the number of groups, nj is the number of observations in the jth sample and n is the total sample size (n = n1 + n2 + . . . + nk). The number 12 occurs naturally from the expression for the sample variance of the ranks of the outcomes (for the mathematical derivation see Kruskal and Wallis, 1952). Thus, if the H-statistic is close to zero, one would conclude that all groups have the same mean rank, and vice versa. The distribution of the H-statistic is approximated by the chi-square distribution, which is used to make statistical inferences. The following example illustrates the approach. Example 4.5.412 Evaluating probability distributions of number of employees in three different occupations using a non-parametric test One wishes to compare, at a significance level of 0.05, the number of employees in companies representing each of three different business classifications, namely agriculture, manufacturing, and service. Samples from ten companies each were gathered which are shown in Table 4.10. Since the distributions are unlikely to be normal (for example, one detects some large numbers in the first and third columns), a nonparametric test is appropriate. First, the individual ranks for all samples from the three classes combined are generated as shown tabulated under the 2nd, 4th, and 6th columns. The values of the sums Rj are also computed and shown in the last row. Note that n = 30, while nj = 10. The test statistic H is computed next:
The degrees of freedom are the number of groups minus one, or d.f. = 3 - 1 = 2. From the chi-square tables (Table A.5), the two-tailed critical value at α = 0.05 is 5.991. Since the computed H value exceeds this threshold, one would reject the null hypothesis at 95% CL, and conclude that at least two of the three probability distributions describing the number of employees in the sectors are different. However, the verdict is marginal since the computed H statistic is close to the critical value. It would be wise to consider the practical implications of the statistical inference test and perform a decision analysis study. ■
4.5.3
Test on Spearman Rank Correlation Coefficient
The Pearson correlation coefficient (Sect. 3.4.2) was a parametric measure meant to quantify the correlation between two quantifiable variables. The Spearman rank correlation coefficient rs is similar in definition to the Pearson correlation coefficient but uses relative ranks of the data instead of the numerical values itself. The same equation as Eq. (3.12) can be used to compute this measure, with its magnitude and sign interpreted in the same fashion. However, a simpler formula is often used to calculate the Spearman rank correlation coefficient (McClave and Benson 1988): r Sp = 1 -
From McClave and Benson (1988) by # permission Pearson Education. 12
6 d 2i nð n 2 - 1Þ
ð4:37Þ
4.5 Non-Parametric Tests
151
Table 4.11 Data table for Example 4.5.5 showing how to conduct the non-parametric correlation test Faculty 1 2 3 4 5 6 7 8 9 10
research grants ($) 1,480,000 890,000 3,360,000 2,210,000 1,820,000 1,370,000 3,180,000 930,000 1,270,000 1,610,000
Teaching evaluation (out of 10) 7.05 7.87 3.90 5.41 9.02 6.07 3.20 5.25 9.50 4.45
Research rank (ui) 5 1 10 8 7 4 9 2 3 6
where n is the number of paired measurements, and the difference between the ranks for the ith measurement for ranked variables u and v is di = ui - vi. Example 4.5.5 Non-parametric testing of correlation between the sizes of faculty research grants and teaching evaluations The provost of a major university wants to determine whether a statistically significant correlation exists between the research grants and teaching evaluation rating of its senior faculty. Data over 3 years have been collected as assembled in Table 4.11, which also shows the manner in which ranks have been generated and the quantities di = (ui - vi) computed. Using Eq. (4.37) with n = 10, the Spearman rank correlation coefficient is: r Sp = 1 -
6ð260Þ = - 0:576 10ð100 - 1Þ
Thus, one notes that there exists a negative correlation between the sample data. However, whether this is significant at the population level requires that a statistical test be performed for the correlation coefficient rSp: Null hypothesis H 0 : r Sp = 0 ðthere is no significant population correlationÞ Alternative hypothesis H a : r Sp ≠ 0 ðthere is significant population correlationÞ Table A.10 in Appendix A gives the absolute cutoff values for different significance levels of the Spearman rank correlation. For n = 10, the one-tailed absolute critical value for α = 0.05 is rSp,α = 0.564. This implies that there is a negative
Teaching rank (vi) 7 8 2 5 9 6 1 4 10 3
Difference (di) -2 -7 8 3 -2 -2 8 -2 -7 3 Total
Diff squared (di2) 4 49 64 9 4 4 64 4 49 9 260
correlation between research grants and teaching evaluations which differs statistically from 0 at a significance level of 0.05 (albeit barely). It is interesting to point out that had a parametric analysis been undertaken, the corresponding Pearson correlation coefficient (Sect. 3.4.2) would have been r = -0.620 and deemed significant at α = 0.05 (the critical value from Table A.7 is 0.549). The correlation coefficients by both methods are quite close (-0.576 and -0.620) with the parametric method indicating stronger correlation. However, non-parametric tests are distribution-free and, in that sense, are more robust. It is advisable, as far as possible, to perform both types of tests and then draw conclusions. ■ The aspect related to how the confidence intervals widen as n decreases has been previously discussed in Sect 4.2.7 for the Pearson correlation coefficient. The number of data points n also has a large effect on whether the Spearman correlation coefficient rSp determined from a data set is significant or not (Wolberg, 2006). For values of n greater than about 10, the random variable z defined below (assuming Gaussian distribution): z = r sp =sqrtðn - 1Þ
ð4:38Þ
From Table A.3 for a one-tailed distribution, the critical value zα = 1.645 for a 5% significance level. From Eq. (4.38), for a sample n = 101, the critical value of rSp,α = 1.645/sqrt (101 - 1) = 0.1645. However, for a sample size of n = 10, the critical rSps = 1.645/sqrt (10 - 1) = 0.548 which is 3.3 times greater than the previous estimate! This simple example serves to illustrate the importance of the number of data points on the significance test of a sample correlation coefficient.
152
4
4.6
Bayesian Inferences
4.6.1
Background
Bayes’ theorem and how it can be used for probability related problems has been treated in Sect. 2.5. Its strength lies in the fact that it provides a framework for including prior information in a two-stage (or multi-stage) experiment whereby one could draw stronger conclusions than one could with observational data alone. It is especially advantageous for small data sets, and it was shown that its predictions converge with those of the classical method for two cases: (i) as the data set of observations gets larger; and (ii) if the prior distribution is modeled as a uniform distribution. It was pointed out that advocates of the Bayesian approach view probability as a degree of belief held by a person about an uncertainty issue as compared to the objective view of long run relative frequency held by traditionalists. This section will discuss how the Bayesian approach can be used to make statistical inferences from samples about an uncertain population parameter, and for addressing hypothesis testing problems.
4.6.2
Estimating Population Parameter from a Sample
Consider the case when the population mean μ is to be estimated (point and interval estimates) from the sample mean x with the population distribution assumed to be Gaussian with a known standard deviation σ. This case is given by the sampling distribution of the mean x treated in Sect. 4.2.1. The probability P of a two-tailed distribution at significance level α can be expressed as: P x - zα=2
σ σ < μ < x þ zα=2 1=2 = 1 - α n1=2 n
ð4:39Þ
where n is the sample size and z is the value from the standard normal tables. The traditional or frequentist interpretation is that one can be (1 - α) confident that the above interval contains the true population mean (see Sect. 4.2.1c). However, the interval itself should not be interpreted as a probability interval for the parameter. The Bayesian approach uses the same formula, but the mean is modified since the posterior distribution is now used which includes the sample data as well as the prior belief. The confidence interval is referred to as the credible interval (also, referred to as the Bayesian confidence interval). The Bayesian interpretation is that the value of the mean is fixed but has been chosen from some known (or assumed) prior probability distribution. The data collected allows one to recalculate the probability of different values of the mean (i.e., the posterior probability) from which the (1 - α)
Making Statistical Inferences from Samples
credible interval can be surmised. Thus, the traditional approach leads to a probability statement about the interval, while the Bayesian approach about the population mean parameter (Phillips 1973). The credible interval is usually narrower than the traditional confidence interval. The relevant procedure to calculate the credible intervals for the case of a Gaussian population and a Gaussian prior is presented without proof below (Wonnacutt and Wonnacutt 1985). Let the prior distribution, assumed normal, be characterized by a mean μ0 and variance σ 20 , while the sample mean and standard deviation values are x and sx. Selecting a prior distribution is equivalent to having a quasi-sample of size n0 whose size is given by: n0 =
s2x σ 20
ð4:40Þ
The posterior mean and standard deviation μ and σ are then given by: μ =
n0 μ0 þ nx n0 þ n
and
σ =
sx ðn0 þ nÞ1=2
ð4:41Þ
Note that the expression for the posterior mean is simply the weighted average of the sample and the prior mean and is likely to be less biased than the sample mean alone. Similarly, the standard deviation is divided by the total normal sample size and will result in increased precision. However, had a different prior rather than the normal distribution been assumed above, a slightly different interval would have resulted which is another reason why traditional statisticians (so-called frequentists) are uneasy about fully endorsing the Bayesian approach. Example 4.6.1 Comparison of classical and Bayesian confidence intervals A certain solar PV module is rated at 60 W with a standard deviation of 2 W. Since the rating varies somewhat from one shipment to the next, a sample of 12 modules has been selected from a shipment and tested to yield a mean of 65 W and a standard deviation of 2.8 W. Assuming a Gaussian distribution, determine the two-tailed 95% CI by both the traditional and the Bayesian approaches. (a) Traditional approach μ = x ± 1:96
sx 2:8 = 65 ± 1:96 1=2 = 65 ± 1:58 n1=2 12
Note that a value of 1.96 is used from the z tables even though the sample is small since the distribution is assumed to be Gaussian.
4.6 Bayesian Inferences
153
(b) Bayesian approach. Using Eq. (4.40) to calculate the quasi-sample size inherent in the prior: n0 =
2:82 = 1:96 ≃ 2:0 22
i.e., the prior is equivalent to information from testing an additional 2 modules. Next, Eq. (4.41) is used to determine the posterior mean and standard deviation: μ =
2ð60Þ þ 12ð65Þ 2:8 = 64:29 and σ = = 0:748 2 þ 12 ð2 þ 12Þ1=2
The Bayesian 95% CI is then:
μ = μ ± 1:96 σ = 64:29 ± 1:96ð0:748Þ = 64:29 ± 1:47 Since prior information has been used, the Bayesian interval is likely to be better centered and more precise (with a narrower interval) than the traditional or classical interval.
4.6.3
Hypothesis Testing
Section 4.2 dealt with the traditional approach to hypothesis testing where one frames the problem in terms of two competing claims. The application areas discussed involved testing for single sample mean, testing for two sample means and paired differences, testing for single and two sample variances, testing for distributions, and testing on the Pearson correlation coefficient. In all these cases, one proceeds by defining two hypotheses: • The null hypothesis (H0), which represents the status quo, i.e., that the hypothesis will be accepted unless the data provides convincing evidence of the contrary. • The research or alternative hypothesis (Ha), which is the premise that the variation observed in the data sample cannot be ascribed to random variability or chance alone, and that there must be some inherent structural or fundamental cause. Thus, the traditional or frequentist approach is to divide the sample space into an acceptance region and a rejection region and posit that the null hypothesis can be rejected only if the probability of the test statistic lying in the rejection region can be ascribed to chance or randomness at the preselected significance level α. Advocates of the Bayesian
approach have several objections to this line of thinking (Phillips 1973): (i) The null hypothesis is rarely of much interest. The precise specification of, say, the population mean is of limited value; rather, ascertaining a range would be more useful. (ii) The null hypothesis is only one of many possible values of the uncertain variable, and undue importance being placed on this value is unjustified. (iii) As additional data are collected, the inherent randomness in the collection process would lead to the null hypothesis being rejected in most cases. (iv) Erroneous inferences from a sample may result if prior knowledge is not considered. The Bayesian approach to hypothesis testing is not to base the conclusions on a traditional significance level like α < 0.05. Instead it makes use of the posterior credible interval for the population mean μ of the sample collected against a prior mean value μ0. The procedure is described in texts such as Bolstad (2004) and illustrated in the following example. Example 4.6.2 Traditional and Bayesian approaches to determining confidence intervals The life of a certain type of smoke detector battery is specified as having a mean of 32 months and a standard deviation of 0.5 months. A building owner decides to test this claim at a significance level of 0.05. He tests a sample of 9 batteries and finds a mean of 31 and a sample standard deviation of 1 month. Note that this is a one-side hypothesis test case. (a) The traditional approach would entail testing H0: μ = 32 versus Ha: μ > 32. The Student-t value: = - 3:0. From Table A.4, the critical value t = 311=-p32 9 for d.f. = 8 is t0.05 = - 1.86. Thus, he can reject the null hypothesis, and state that the claim of the manufacturer is incorrect. (b) The Bayesian approach, on the other hand, would require calculating the posterior probability of the null hypothesis. The prior distribution has a mean μ0 = 32 and variance σ 20 = 0.52 = 0.25. 2
1 First, use Eq. (4.40) and determine n0 = 0:5 2 = 4, i.e., the prior information is “equivalent” to increasing the sample size by 4. Next, use Eq. (4.41) to determine the posterior mean and standard deviation:
154
μ =
4
4ð32Þ þ 9ð31Þ 1:0 = 31:3 and σ = = 0:277: 4þ9 ð4 þ 9Þ1=2
- 32:0 From here: t = 31:30:277 = - 2:53: This t-value is outside the critical value t0.05 = -1.86. The building owner can reject the null hypothesis. In this case, both approaches gave the same result, but this is not always true ■
4.7
Some Considerations About Sampling
4.7.1
Random and Non-Random Sampling Methods
A sample is a limited portion, or a finite number of items/ elements/members drawn from a larger entity called population of which information and characteristic traits are sought. Point and interval estimation as well as notions of inferential statistics covered in the previous sections involved the use of samples drawn from some underlying population. The premise was that finite samples would reduce the expense associated with the estimation, while the associated uncertainty which would consequently creep into the estimation process could be estimated and managed. It is quite clear that the sample drawn must be representative of the population, and that the sample size should be such that the uncertainty is within certain preset bounds. However, there are different ways by which one could draw samples; this aspect falls under the purview of sampling design. Since these methods have different implications, they are discussed in this section. There are three general rules of sampling design: (i) The more representative the sample of the population, the better the results. (ii) All else being equal, larger samples yield better results, i.e., more precise estimates with narrower uncertainty bands. (iii) Larger samples cannot compensate for a poor sampling design plan or a poorly executed plan. Some of the common sampling methods are described below: (a) Random sampling (also called simple random sampling) is the simplest conceptually and is most widely used. It involves selecting the sample of n elements in such a way that all possible samples of n elements have the same chance of being selected. Two important strategies of random sampling involve: (i) Sampling with replacement, in which the object selected is put back into the population pool and
Making Statistical Inferences from Samples
has the possibility to be selected again in subsequent picks, and (ii) Sampling without replacement, where the object picked is not put back into the population pool prior to picking the next item. Random sampling without replacement of n objects from a population N could be practically implemented in one of several ways. The most common is to order the objects of the population (e.g., 1, 2, 3, . . ., N ), use a random number generator to generate n numbers from 1 to N without replication, and pick only the objects whose numbers have been generated. This approach is illustrated by means of the following example. A consumer group wishes to select a sample of 5 cars from a lot of 500 cars for crash testing. It assigns integers from 1 to 500 to every car on the lot, uses a random number generator to select a set of 5 integers, and then select the 5 cars corresponding to the 5 integers picked randomly. Dealing with random samples has several advantages: (i) any random sub-sample of a random sample or its complement is also a random sample; (ii) after a random sample has been selected, any random sample from its complement can be added to it to form a larger random sample. (b) Non-random sampling occurs when the selection of members from the population is done according to some method or pre-set process which is not random. Often it occurs unintentionally or unwittingly with the experimenter thinking that he is dealing with random samples while he is not. In such cases, bias or skewness is introduced, and one obtains misleading confidence limits which may lead to erroneous inferences depending on the degree of non-randomness in the data set. However, in some cases, the experimenter intentionally selects the samples in a non-random manner and analyzes the data accordingly. This can result in the required conclusions being reached with reduced sample sizes, thereby saving resources. There are different types of nonrandom sampling (ASTM E 1402 1996), and some of the important ones are listed below: (i) Stratified sampling in which the target population is such that it is amenable to partitioning into disjoint subsets or strata based on some criterion. Samples are selected independently from each stratum, possibly of different sizes. This improves efficiency of the sampling process in some instances and is discussed at more length in Sect. 4.7.4. (ii) Cluster sampling in which natural occurring strata or clusters are first selected, then random sampling is done to identify a subset of clusters, and finally all the elements in the picked clusters are selected for analysis. For example, a state can be divided into
(iv)
(v)
(vi)
4.7.2
Desirable Properties of Estimators
The parameters from a sample are random variables since different sets of samples will result in different values of the parameters. Recall the definition of two seemingly analogous, but distinct, terms: estimators are mathematical expressions to be applied to sample data which yield random variables while an estimate is a specific number or value of this random variable. Commonly encountered estimators are the mean, median, standard deviation, etc. Since the search for estimators is the crux of the parameter estimation process, certain basic notions and desirable properties of estimators need to be explicitly recognized (a good discussion is provided by Pindyck and Rubinfeld 1981). Many of these concepts are logical extensions of the concepts applicable to
Unbiased Biased
Actual value
Fig. 4.14 Concept of biased and unbiased estimators
Efficient estimator Probability distributions
(iii)
districts and then into municipalities for final sample selection. This approach is used often in marketing research. Sequential sampling is a quality control procedure where a decision on the acceptability of a batch of products is made from tests done on a sample of the batch. Tests are done on a preliminary sample, and depending on the results, either the batch is accepted, or further sampling tests are performed. This procedure usually requires, on an average, fewer samples to be tested to meet a pre-stipulated accuracy. Composite sampling where elements from different samples drawn over a designated time period are combined together. An example is mixing water samples drawn hourly to form a composite sample over a day. Multistage or nested sampling which involves selecting a sample in stages. A larger sample is first selected, and then subsequently smaller ones. For example, for testing indoor air quality in a population of office buildings, the design could involve selecting individual buildings during the first stage of sampling, choosing specific floors of the selected buildings in the second stage of sampling, and finally, selecting specific rooms in the floors chosen to be tested during the third and final stage. Convenience sampling, also called opportunity sampling, is a method of choosing samples arbitrarily following the manner in which they are acquired. If the situation is such that a planned experimental design cannot be followed, the analyst must make do with the samples collected in this manner. Though impossible to treat rigorously, it is commonly encountered in many practical situations.
155
Probability distributions
4.7 Some Considerations About Sampling
Inefficient estimator
Actual value
Fig. 4.15 Concept of efficiency of estimators
errors and also apply to regression models treated in Chap. 5. For example, consider the case where inferences about the population mean parameter μ are to be made from the sample mean estimator x. (a) Lack of bias: A very desirable property is for the distribution of the estimator to have the parameter as its mean value (see Fig. 4.14). Then, if the experiment were repeated many times, one would at least be assured that one would be right on an average. In such a case, the bias in Eðx - μÞ = 0, where E represents the expected value. An example of bias in the estimator is when (n) is used rather than (n - 1) while calculating the standard deviation following Eq. (3.8). (b) Efficiency: Lack of bias provides no indication regarding the variability. Efficiency is a measure of how small the dispersion can possibly get. The value of the mean x is said to be an efficient unbiased estimator if, for a given sample size, the variance of x is smaller than the variance of any unbiased estimator (see Fig. 4.15) and is the smallest limiting variance that can be achieved. More often a relative order of merit, called the relative efficiency, is used which is defined as the ratio of both variances. Efficiency is desirable since the greater the efficiency associated with an estimation process, the stronger the statistical or inferential statements one can make about the estimated parameters. Consider the following example (Wonnacutt and Wonnacutt 1985). If a population being sampled is
156
4 Minimum mean square error
Making Statistical Inferences from Samples
4 n=200
Probability distributions
3
n=50
2 Unbiased
1
n=10 n=5
Actual value
Fig. 4.16 Concept of mean square error which includes bias and efficiency of estimators
symmetric, its center can be estimated without bias by either the sample mean x or its median ~x. For some populations x is more efficient; for others ~x is more efficient. In case of a normal parent distribution, the standard error of p p ~x = SE(~x) = 1.25σ= n. Since SE(x) = σ= n, efficiency of x relative to ~x Efficiency
var ~x = 1:252 = 1:56: var x
ð4:42Þ
(c) Mean square error: There are many circumstances in which one is forced to trade off bias and variance of estimators. When the goal of a model is to maximize the precision of predictions, for example, an estimator with very low variance and some bias may be more desirable than an unbiased estimator with high variance (see Fig. 4.16). One criterion which is useful in this regard is the goal of minimizing mean square error (MSE), defined as: MSEðxÞ = Eðx - μÞ2 = ½BiasðxÞ2 þ varðxÞ
ð4:43Þ
where E(x) is the expected value of x. Thus, when x is unbiased, the mean square error and variance of the estimator x are equal. MSE may be regarded as a generalization of the variance concept. This leads to the generalized definition of the relative efficiency of two estimators, whether biased or unbiased: “efficiency is the ratio of both MSE values.” (d) Consistency: Consider the properties of estimators as the sample size increases. In such cases, one would like the estimator x to converge to the true value, or stated differently, the probability limit of x (plim x) should equal μ as sample size n approaches infinity (see Fig. 4.17). This leads to the criterion of consistency: x is a consistent
0 True value
Fig. 4.17 A consistent estimator is one whose distribution becomes gradually peaked as the sample size n is increased
estimator of μ if plim (x ) = μ. In other words, as the sample size grows larger, a consistent estimation would tend to approximate the true parameters, i.e., the mean square error of the estimator approaches zero. Thus, one of the conditions that make an estimator consistent is that both its bias and variance approach zero in the limit. However, it does not necessarily follow that an unbiased estimator is a consistent estimator. Although consistency is an abstract concept, it often provides a useful preliminary criterion for sorting out estimators. Generally, one tends to be more concerned with consistency than with lack of bias as an estimation criterion. A biased yet consistent estimator may not equal the true parameter on average but will approximate the true parameter as the sample information grows larger. This is more reassuring practically than the alternative of finding a parameter estimate which is unbiased initially yet will continue to deviate substantially from the true parameter as the sample size gets larger. However, to finally settle on the best estimator, the efficiency is a more powerful criterion. As discussed earlier, the sample mean is preferable to the median for estimating the center of a normal population because the former is more efficient though both estimators are clearly consistent and unbiased.
4.7.3
Determining Sample Size During Random Surveys
Population census, market surveys, pharmaceutical field trials, etc. are examples of survey sampling. These can be done in one of two ways which are discussed in this section and in the next. The discussion and equations presented in the previous sub-sections pertain to random sampling. Survey sampling frames the problem using certain terms slightly
4.7 Some Considerations About Sampling
157
different from those presented above. Here, a major issue is to determine the sample size which can meet a certain pre-stipulated precision at predefined confidence levels. The estimates from the sample should be close enough to the population characteristic to be useful for drawing conclusions and taking subsequent decisions. One generally assumes the underlying probability distribution to be normal (this may not be correct since lognormal distributions are also encountered often). Let RE be the relative error (often referred to as the margin of error) of the population mean μ at a confidence level (1 - α) where α is the significance level. For a two-tailed distribution, it is defined as: RE 1 - α = zα=2
SEðxÞ μ
ð4:44Þ
where SE ðxÞ is the standard error of the sample mean given by Eq. (4.3). A measure of variability in the population needs to be introduced, and this is done through the coefficient of variation (CV) defined as:13 CV =
std:dev: s = x μ true mean
ð4:45Þ
where sx is the sample standard deviation (if the population standard deviation is known, it is better to use that value). From a practical point of view, one should ascertain that sx is lower than the maximum value sx,max at a confidence level of significance (1 - α) given by: sx, max,1 - α = zα=2 :CV 1 - α :μ
ð4:46Þ
Let N be the population size. One could deduce the required sample size n from the above equation to reach the target RE1 - α. Replacing (N - 1) by N in Eq. (4.3), which is the expression for the standard error of the mean for small samples without replacement, results in: SEðxÞ2 =
s2 s2 s2x N - n = x - x n N n N
ð4:47Þ
The required sample size n is found by readjusting terms and using Eqs. (4.44) and (4.46): n=
1 SE ðxÞ2 s2x
þ
1 N
=
1 RE1 - α zα=2 CV 1 - α
2
ð4:48Þ þ N1
This is the functional form normally used in survey sampling to determine sample size provided some prior estimate of the 13
Note that this definition is slightly different from that of CV defined by Eq. (3.9a) since population mean rather than sample mean is used.
population mean and standard deviation are known. In summary, sample sizes relative to the population are determined from three considerations: the margin of error, the confidence level and the expected variability. Example 4.7.1 Determination of random sample size needed to verify peak reduction in residences at preset confidence levels An electric utility has provided financial incentives to many customers to replace their existing air-conditioners with high efficiency ones. This rebate program was initiated to reduce the aggregated electric peak during hot summer afternoons which is dangerously close to the peak generation capacity of the utility. The utility analyst would like to determine the sample size necessary to assess whether the program has reduced the peak as projected such that the relative error RE ≤ 10% at 95% CL. The following information is given: N = 20,000
The total number of customers: Estimate of the mean peak saving Estimate of the standard deviation
μ = 2 kW (from engineering calculations) sx = 1 kW (from engineering calculations)
This is a one-tailed distribution problem with 95% CL. Then, from Table A.4, z0.05 = 1.645 for large sample sizes. Inserting values of RE = 0.1 and CV = sμx = 12 = 0:5 in Eq. (4.48), the required sample size is: n=
1 0:1 ð1:645Þð0:5Þ
2
1 þ 20,000
= 67:4 ≈ 70
It would be advisable to perform some sensitivity runs given that many of the assumed quantities are based on engineering calculations. It is simple to use the above approach to generate figures such as Fig. 4.18 for assessing tradeoff between reducing the margin of error versus increasing the cost of verification (instrumentation, installation, and monitoring) as sample size is increased. Note that accepting additional error reduces sample size in a hyperbolic manner. For example, lowering the requirement that RE ≤10% to ≤15% decreases n from 70 to about 30; while decreasing RE requirement to ≤5% would require a sample size of about 270 (outside the range of the figure). On the other hand, there is not much one could do about varying CV since this represents an inherent variability in the population if random sampling is adopted. However, non-random stratified sampling, described next, could be one approach to reduce sample sizes. ■
158
4
Fig. 4.18 Size of random sample needed to satisfy different relative errors of the population mean for two different values of population variability (CV of 25% and 50%). Data correspond to Example 4.7.1
Making Statistical Inferences from Samples
Population size = 20,000 One tailed 95% CL
CV(%)
Sample size n
25
50
Relative Error (%)
4.7.4
Stratified Sampling for Variance Reduction
Variance reduction techniques are a special type of sample estimating procedures which rely on the principle that prior knowledge about the structure of the model and the properties of the input can be used to increase the precision of estimates for a fixed sample size, or, conversely, to decrease the sample size required to obtain a fixed degree of precision. These techniques distort the original problem so that special techniques can be used to obtain the desired estimates at a lower cost. Variance can be decreased by considering a larger sample size which involves more work. So, the effort with which a parameter is estimated can be evaluated as (Shannon, 1975): efficiency = ðvariance × workÞ - 1 :
ð4:49Þ
This implies that a reduction in variance is not worthwhile if the work needed to achieve it is excessive. A common recourse among social scientists to increase efficiency per unit cost in statistical surveys is to use stratified sampling, which counts as a variance reduction technique. In stratified sampling, the distribution function to be sampled is broken up into several pieces, each piece is then sampled separately, and the results are later combined into a single estimate. The specification of the strata to be used is based on prior knowledge about the characteristics of the population to be sampled. Often an order of magnitude variance reduction is achieved by stratified sampling as compared to the standard random sampling approach.
Example 4.7.214 Example of stratified sampling for variance reduction Suppose a home improvement center wishes to estimate the mean annual expenditure of its local residents in the hardware section and the drapery section. It is known that the expenditures by women differ more widely than those by men. Men visit the store more frequently and spend annually approximately $50; expenditures of as much as $100 or as little as $25 per year are found occasionally. Annual expenditures by women can vary from nothing to over $ 500. The variance for expenditures by women is therefore much greater, and the mean expenditure more difficult to estimate. Assume that 80% of the customers are men and that a sample size of n = 15 is to be taken. If simple random sampling were employed, one would expect the sample to consist of approximately 12 men (original male fraction of the population f1 = 12/15 = 0.8) and 3 women (original female fraction f2 = 0.2). However, assume that a sample that included n1 = 5 men and n2 = 10 women was selected instead (more women have been preferentially selected because their expenditures are more variable). Suppose the annual expenditures of the members of the sample turned out to be: Men: Women:
45, 50, 55, 40, 90 80, 50, 120, 80, 200, 180, 90, 500, 320, 75
It is intuitively clear that such data will lead to a more accurate estimate of the overall average than would the expenditures of 12 men and 3 women. 14
From Shannon (1975).
4.8 Resampling Methods
159
where 0.80 and 0.20 are the original weights in the population, and 0.33 and 0.67 the sample weights respectively. This value is likely to be a more realistic estimate than if the sampling had been done based purely on the percentage of the gender of the customers. The above example is a simple case of stratified sampling where the customer base was first stratified into the two genders, and then these were sampled disproportionately. There are statistical formulae which suggest near-optimal size of selecting stratified samples, for which the interested reader can refer to Devore and Farnum (2005) and other texts.
numerical methods have, in large part, replaced closed forms solution techniques of differential equations in almost all fields of engineering mathematics. Thus, versatile numerical techniques allow one to overcome such problems as the lack of knowledge of the probability distribution of the errors of the variables, and even determine sampling distributions of such quantities as the median or of the inter-quartile range or even the 5th and 95th percentiles for which no traditional tests exist. The methods are conceptually simple, requiring low levels of mathematics, can be used to determine any estimate whatsoever of any parameter (not just the mean or variance) and even allows the empirical distribution of the parameter to be obtained. Thus, they have clear advantages when assumptions of traditional parametric tests (such as normal distributions) are not met. Note that the estimation must be done directly and uniquely from the samples, and none of the parametric equations related to standard error, etc., discussed in Sect. 4.2 (such as Eq. 4.2 or Eq. 4.5), should be used. Since the needed additional computing power is easily provided by present-day personal computers, resampling methods have become increasingly popular and have even supplanted classical/traditional parametric tests.
4.8
Resampling Methods
4.8.2
4.8.1
Basic Concept
The appropriate sample weights must be applied to the original sample data if one wishes to deduce the overall mean. Thus, if Mi and Wi are used to designate the ith sample of men and women, respectively, X= =
1 f1 n ðn1 =nÞ
5
Mi þ i=1
f2 ðn2 =nÞ
10
Wi i=1
ð4:50Þ
1 0:80 0:20 280 þ 1695 ≈ 79 15 0:33 0:67
Resampling methods reuse a single available sample to make statistical inferences about the population. The precision of a population-related estimate can be improved by drawing multiple samples from the population and inferring the confidence limits from these samples rather than determining them from classical analytical estimation formulae based on a single sample only. However, this is infeasible in most cases because of the associated cost and time of assembling multiple samples. The basic rationale behind resampling methods is to draw one single sample, treat this original sample as a surrogate for the population, and generate numerous sub-samples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. It is obvious that the sample must be unbiased and be reflective of the population (which it will be if the sample is drawn randomly), otherwise the precision of the method is severely compromised. Efron and Tibshirani (1982) have argued that given the available power of computing, one should move away from the constraints of traditional parametric theory with its overreliance on a small set of standard models for which theoretical solutions are available and substitute computational power for theoretical analysis. This parallels the way
Application to Probability Problems
How resampling methods can be used for solving probability type of problems are illustrated below (Simon 1992). Consider a simple example, where one has six balls labeled 1–6. What is the probability that three balls will be picked such that they have 1, 2, 3 in that order if this is done with replacement? The traditional probability equation would yield (1/6)3. The same result can be determined by simulating the 3-ball selection a large number of times. This approach, though tedious, is more intuitive since this is exactly what the traditional probability of the event is meant to represent; namely, the long run frequency. One could repeat this 3-ball selection say a million times, and count the number of times one gets 1, 2, 3 in sequence, and from there infer the needed probability. The procedure rules or the sequence of operations of drawing samples must be written in computer code, after which the computer does the rest. Much more difficult problems can be simulated in this manner, and its advantages lie in its versatility, its low level of mathematics required, and most importantly, its direct bearing with the intuitive interpretation of probability as the long-run frequency.
4.8.3
Different Methods of Resampling
The creation of multiple sub-samples from the original sample can be done in several ways and distinguishes one method
160
against the other (an important distinction is whether the sampling is done with or without replacement). The three most common resampling methods are: (a) Permutation method (or randomization method without replacement) is one where all possible subsets of r items (which is the sub-sample size) out of the total n items (the sample size) are generated and used to deduce the population estimate and its confidence levels or its percentiles. This may require some effort in many cases, and so, an equivalent and less intensive deviant of this method is to use only a sample of all possible subsets. The size of the sample is selected based on the accuracy needed, and about 1000 samples are usually adequate. The use of the permutation method when making inferences about the medians of two populations is illustrated below. The null hypothesis is that the there is no difference between the two populations. First, we sample both populations to create two independent random samples. The difference in the medians between both samples is computed. Next, two subsamples without replacement (say 10–20 cases per subgroup) are created from the two samples, and the difference in the medians between both resampled subgroups recalculated. This is done a large number of times, say 1000 times. The resulting distribution contains the necessary information regarding the statistical confidence in the null hypothesis of the parameter being evaluated. For example, if the difference in the median between the two original samples was lower in 50 of 1000 possible subgroups, then one concludes that the one-tailed probability of the original event was only 0.05. It is clear that such a sampling distribution can be done for any statistic of interest, not just the median. However, the number of randomizations become quickly very large, and so one must select the number of randomizations with some care. (b) The jackknife method creates subsamples without replacement. The jackknife method, introduced by Quenouille in 1949 and later extended by Tukey in 1958, is a technique of universal applicability and great flexibility that allows confidence intervals to be determined of an estimate calculated from sub-groups while reducing bias of the estimator. There are several numerical schemes for implementing the jackknife scheme. The original “leave one out” method of implementation is to simply create n subsamples with (n - 1) data points wherein a single different observation is omitted in each subgroup or subsample. There is no randomness in the results since the same parameter estimates and confidence intervals will be obtained if repeated several times. However, if n is large, this
4
Making Statistical Inferences from Samples
process may be time consuming. A more recent and popular version is the “k-fold cross-validation” method where (i) one divides the random sample of n observations into k groups of equal size (ii) omits one group at a time and determines what are called pseudo-estimates from the (k - 1) groups, (iii) estimates the actual confidence intervals of the parameters.15 (c) The bootstrap method (popularized by Efron in 1979) is similar but differs in that no groups are formed but the different sets of data sequences are generated by repeated sampling with replacement from the observational data set (Davison and Hinkley 1997). Individual estimators deduced from such samples directly permit estimates and confidence intervals to be determined. The analyst must select the number of randomizations while the sample size is selected to be equal to that of the original sample. The method would appear to be circular; i.e., how can one acquire more insight by resampling the same sample? The simple explanation is that “the population is to the sample as the sample is to the bootstrap sample.” Though the jackknife is a viable method, it has been supplanted by the bootstrap method, which has emerged as the most efficient of the resampling methods in that better estimates of parameters such as the mean, median, variance, percentiles, empirical distributions and confidence limits are obtained. Several improvements to the naïve bootstrap have been proposed (such as the bootstrapt method) especially for long-tailed distributions or for time series data. There is a possibility of confusion between the bootstrap method and the Monte Carlo approach (presented in Sect. 3.7.2). The tie between them is obvious: both are based on repetitive sampling and then direct examination of the results. A key difference between the methods, however, is that bootstrapping uses the original or initial sample as the population from which to resample, whereas Monte Carlo simulation is based on setting up a sample data generation process for the inputs of the simulation or computational model.
4.8.4
Application of Bootstrap to Statistical Inference Problems
The use of the bootstrap method to two types of instances is illustrated in this section: determining confidence intervals and for correlation analysis. At its simplest, the algorithm of the bootstrap method consists of the following steps (Devore and Farnum 2005): 15
The k-fold cross-validation is also used in regression modeling (Sect. 5.8) and in tree-based classification problems (Sect. 11.5).
4.8 Resampling Methods
161
1. Obtain a random sample of size n from the population. 2. Generate a random sample of size n with replacement from the original sample in step 1. 3. Calculate the statistic of interest for the sample in step 2. 4. Repeat steps 2 and 3 many times to form an approximate sampling distribution of the statistic. It is important to note that bootstrapping requires that sampling be done with replacement, and about 1000 samples are often required. It is advised that the analyst performs a few evaluations with different number of samples in order to be more confident about his results. The following example illustrates the implementation of the bootstrap method. Example 4.8.116 Using the bootstrap method for deducing confidence intervals The data in Table 4.12 correspond to the breakdown voltage (in kV) of an insulating liquid, which is indicative of its dielectric strength. Determine the 95% CI. First, use the large sample confidence interval formula to estimate the two-tailed 95% CI of the mean. Summary quantities are: sample size n = 48, ∑xi = 2646 and x2i = 144,950 from which x = 54:7 and standard deviation s = 5.23. The 95% CI following the traditional parametric approach is then:
Further, the bootstrap resampling approach can also provide 95% CI for other quantities not shown in the figure, such as for the standard deviation (4.075, 6.238) and for the median (53.0, 56.0). ■ The following example illustrates the versatility of the bootstrap method for determining correlation between two variables, a problem which is recast as comparing two sample means. Example 4.8.217 Using the bootstrap method with a nonparametric test to ascertain correlation of two variables One wishes to determine whether there exists a correlation between athletic ability and intelligence level of teenage students. A sample of 10 high school athletes was obtained involving their athletic and IQ scores. The data are listed in terms of descending order of athletic scores in the first two columns of Table 4.13. A nonparametric approach is adopted to solve this problem, the parametric version would be the test on the Pearson correlation coefficient (Sect. 4.2.7). The athletic scores and the IQ scores are rank ordered from 1 to 10 as shown in the
5:23 54:7 ± 1:96 p = 54:7 ± 1:5 = ð53:2,56:2Þ 48 The confidence intervals using the bootstrap method are now recalculated to evaluate differences. A histogram of 1000 samples of n = 48 each, drawn with replacement, is shown in Fig. 4.19. The 95% CI correspond to the two-tailed 0.05 significance level. Thus, one selects 1000(0.05/2) = 25 units from each end of the distribution, i.e., the value of the 25th and that of the 975th largest values which yield (53.33, 56.27) which are very close to the parametric range determined earlier. This example illustrates the fact that bootstrap intervals usually agree with traditional parametric ones when all the assumptions underlying the latter are met. It is when they do not that the power of the bootstrap stands out.
Fig. 4.19 Histogram of bootstrap sample means with 1000 samples (Example 4.8.1) Table 4.13 Data table for Example 4.8.2 along with ranksa Athletic score 97 94 93 90 87 86 86 85 81 76
Table 4.12 Data table for Example 4.8.1 62 59 54 46 57 53
50 64 55 55 48 52
53 50 57 53 63 50
57 53 50 54 57 55
41 64 55 52 57 60
53 62 50 47 55 50
55 50 56 47 53 56
61 68 55 55 59 58 a
From Devore and Farnum (2005) by # permission of Cengage Learning.
IQ score 114 120 107 113 118 101 109 110 100 99
Athletic rank 1 2 3 4 5 6 7 8 9 10
Data available electronically on book website
16
17
From Simon (1992) by # permission of Duxbury Press.
IQ rank 3 1 7 4 2 8 6 5 9 10
162
4
Making Statistical Inferences from Samples
Fig. 4.20 Histogram based on 100 trials of the sum of 5 random IQ ranks from the sample of 10. Note that in only 2% of the trials was the sum equal to 17 or lower (Example 4.8.2)
last two columns of the table. The two observations (athletic rank, IQ rank) are treated together since one would like to determine their joint behavior. The table is split into two groups of five “high” and five “low.” An even split of the group is advocated since it uses the available information better and usually leads to greater “efficiency.” The sum of the observed IQ ranks of the five top athletes = (3 + 1 + 7 + 4 + 2) = 17. The resampling scheme will involve numerous trials where a subset of 5 numbers is drawn randomly from the set {1. . .10}. One then adds these five IQ score numbers for each individual trial. If the observed sum across trials is consistently higher than 17, this will indicate that the best athletes have not earned the observed IQ scores purely by chance. The probability can be directly estimated from the proportion of trials whose sum exceeded 17. Figure 4.20 depicts the histogram of the IQ score sum of 5 random observations using 100 trials (a rather low number of trials meant for illustration purposes). Note that in only 2% of the trials was the sum 17 or lower. Hence, one can state to within 98% CL, that there does exist a correlation between athletic ability and IQ score. It is instructive to compare this conclusion against one from a parametric method. The Pearson correlation coefficient (Sects. 3.4.2 and 4.2.7) between the raw athletic scores and the IQ scores has been determined to be r = 0.7093 for this sample with a p-value of 2.2% which is almost identical to the approximate p-value of 2.0% determined by the bootstrap method.
4.8.5
Closing Remarks
Resampling methods can be applied to diverse problems (Good 1999): (i) for determining probability in complex
situations, (ii) to estimate confidence intervals (CI) of an estimate during univariate sampling of a population, (iii) hypothesis testing to compare estimates of two samples, (iv) to estimate confidence bounds during regression, and (v) for classification. These problems can all be addressed by classical methods provided one makes certain assumptions regarding probability distributions of the random variables. The analytic solutions can be daunting to those who use these statistical analytic methods rarely, and one can even select the wrong formula by error. Resampling is much more intuitive and provides a way of simulating the physical process without having to deal with the, sometimes obfuscating, statistical constraints of the analytic methods. A big virtue of resampling methods is that they extend classical statistical evaluation to cases which cannot be dealt with mathematically. The downside to the use of these methods is that they require larger computing resources (two or three orders of magnitude). This issue is no longer a constraint because of the computing power of modern-day personal computers. Resampling methods are also referred to as computer-intensive methods, although other techniques discussed in Sect. 10.6 are more often associated with this general appellation. It has been suggested that one should use a parametric test when the samples are large, say number of observations is greater than 40, or when they are small ( 350 at the 0.05 significance level. Pr. 4.15 Comparison of human comfort correlations between Caucasian and Chinese subjects Human indoor comfort can be characterized by to the occupants’ feeling of well-being in the indoor environment. It depends on several interrelated and complex phenomena involving subjective as well as objective criteria. Research initiated over 50 years back and subsequent chamber studies have helped define acceptable thermal comfort ranges for indoor occupants. Perhaps the most widely used standard is ASHRAE Standard 55-2004 (ASHRAE 2004) which is described in several textbooks (e.g., Reddy et al., 2016). The basis of the standard is the thermal sensation scale determined by the votes of the occupants following the scale in Table 4.23. The individual votes of all the occupants are then averaged to yield the predicted mean vote (PMV). This is one of the two indices relevant to define acceptability of a large population of people exposed to a certain indoor environment. PMV = 0 is defined as the neutral state (neither cool nor warm), while positive values indicate that occupants feel warm, and vice versa. The mean scores from the chamber studies are then regressed against the influential environmental parameters to yield an empirical correlation which can be used as a means of prediction: PMV = a T db þ b Pv þ c
ð4:51Þ
where Tdb is the indoor dry-bulb temperature (degrees C), Pv is the partial pressure of water vapor (kPa), and the numerical values of the coefficients a*, b*, and c* are dependent on such factors as sex, age, hours of exposure, clothing levels, type of activity, . . . . The values relevant to healthy adults in an office setting for a 3 h exposure period are given in Table 4.24.
Fig. 4.21 Predicted percentage of dissatisfied (PPD) people as function of predicted mean vote (PMV) following Eq. (4.52)
In general, the distribution of votes will always show considerable scatter. The second index is the percentage of people dissatisfied (PPD), defined as people voting outside the range of -1 to +1 for a given value of PMV. When the PPD is plotted against the mean vote of a large group characterized by the PMV, one typically finds a distribution such as that shown in Fig. 4.21. This graph shows that even under optimal conditions (i.e., a mean vote of zero), at least 5% are dissatisfied with the thermal comfort. Hence, because of individual differences, it is impossible to specify a thermal environment that will satisfy everyone. A curve fit expression between PPD and PMV has also been suggested: PPD=100-95 exp -0:03353:PM V 4 þ 0:2179:PM V 2 Þ ð4:52Þ Note that the overall approach is consistent with the statistical approach of approximating distributions by the two primary measures, the mean and the standard deviation. However, in this instance, the standard deviation (characterized by PPD) has been empirically found to be related to the mean value, namely PMV (Eq. 4.51). A research study was conducted in China by Jiang (2001) in order to evaluate whether the above types of correlations, developed using American and European subjects, are applicable to Chinese subjects as well. The environmental chamber test protocol was generally consistent with previous Western studies. The total number of Chinese subjects in
References
167
better visualized if plotted on a psychrometric chart shown in Fig. 4.22. Based on this data, one would like to determine whether the psychological responses of Chinese people are different from those of American/European people. (a) Formulate the various types of statistical tests one would perform stating the intent of each test. (b) Perform some or all these tests and draw relevant conclusions. (c) Prepare a short report describing your entire analysis. Hint One of the data points is suspected. Also use Eqs. (4.51) and (4.52) to generate the values pertinent to Western subjects prior to making comparative evaluations.
References
Fig. 4.22 Chamber test conditions plotted on a psychrometric chart for Chinese subjects (Problem 4.15)
the pool was about 200, and several tests were done with smaller batches (about 10–12 subjects per batch evenly split between males and females). Each batch of subjects first spent some time in a pre-conditioning chamber after which they were moved to the main chamber. The environmental conditions (dry-bulb temperature Tdb, relative humidity RH and air velocity) of the main chamber were controlled such that: Tdb(±0.3 ° C), RH(±5%) and air velocity CV, this would indicate that the model deviates more at the lower range, and vice versa. There is a third-way RMSE that can be normalized, though it is not used as much in the statistical literature. It is simply to divide the RMSE by the range of variation in the response variable y: CV00 = RMSE=ðymax - ymin Þ
ð5:10cÞ
This measure has a nice intuitive appeal, and its range is bounded between [0,1]. 6
Parsimony in the context of regression model building is a termmeant to denote the most succinct model, i.e., one without any statistically superfluous regressors.
5.3 Simple OLS Regression
175
SFð%Þ = ½ðRMSEmodel A - RMSEmodel B Þ=RMSEmodel A
(e) Mean Bias Error The mean bias error (MBE) is defined as the mean difference between the actual data values and model predicted values: MBE =
n i = 1 ð yi
- yi Þ n-p
ð5:11aÞ
Note that when a model is identified by OLS using the original set of data, MBE should be zero (to within round-off errors of the computer). Only when, say, the model identified from a first set of observations is used to predict the value of the response variable under a second set of conditions will MBE be different than zero (see Sect. 5.8 for further discussion). Under the latter circumstances, the MBE is also called the mean simulation or prediction error. A normalized MBE (or NMBE) is often used and is defined as the MBE given by Eq. 5.9a divided by the mean value of the response variable y : NMBE =
MBE y
ð5:11bÞ
(f) Mean Absolute Deviation The mean absolute deviation (MAD) is defined as the mean absolute difference between the actual data values and model predicted values: MAD =
n i = 1 jyi
- yi j ð n - pÞ
ð5:12Þ
This metric is also called mean absolute error (MAE) and is a measure of the systematic bias in the model. Example 5.3.2 Using the data from Example 5.3.1 repeat the exercise using a spreadsheet program. Calculate R2, RMSE, and CV values. From Eq. 5.2, SSE = 323.3 and SSR = 3390.5. From this SST = SSE + SSR = 3713.9. Then from Eq. 5.7a, R2 = 91.3%, while from Eq. 5.8, RMSE = 3.2295, from which CV = 0.095 = 9.5%. ■
× 100
- 1 < SF ≤ 100 ð5:13Þ
Hence, SF < 0 % would indicate that model B is poorer than model A. Conversely, SF> 0 would mean that model B is an improvement. If, say, SF = 1.35, this could be interpreted as the predictive accuracy of model B is 35% better than that of model A. The limit of SF = 100% is indicative of the upper limit of perfect predictions, that is, RMSE (model B) = 0. The SF metric is a conceptually appealing measure that allows several potential models to be directly evaluated and ranked compared to a baseline or reference model. The Adj. R-square, the RMSE (or CV), MBE (or NMBE), and MAD are perhaps the most widely used metrics to evaluate competing regression model fits to data. Under certain circumstances, one model may be preferable to another in terms of one index but not the other. The analyst is then perplexed as to which index to pick as the primary one. In such cases, the specific intent of how the model is going to be subsequently applied should be considered which may suggest the model selection criterion.
5.3.3
Inferences on Regression Coefficients and Model Significance
Once an overall regression model is identified, is the model statistically significant? If it is not, the entire identification process loses its value. The F-statistic, which tests for the significance of the overall regression model (not that of a particular regressor), is defined as: variance explained by the regression variance not explained by the regression MSR SSR n - p = = MSE p - 1 SSE
F=
ð5:14aÞ
Note that the degrees of freedom for SSR = ( p - 1) while that for SSE = (n - p). Thus, the smaller the value of F, the poorer the regression model. It will be noted that the Fstatistic is directly related to R2 as follows:
(g) Skill Factor Another measure is sometimes used to compare the relative improvement of one model over another when applied to the same data set. The relative score or skill factor (SF) allows one to quantify the improvement in predictive accuracy when, say model B would bring compared to another, say model A. It is common to use the RMSE of both models as the basis:
F=
ð n - pÞ R2 2 ð p - 1Þ 1-R
ð5:14bÞ
Hence, the F-statistic can alternatively be viewed as being a measure to test the R2 significance itself. During univariate regression, the F-test is really the same as a Student t-test for
176
5
the significance of the slope coefficient. In the general case, the F-test allows one to test the joint hypothesis of whether all coefficients of the regressor variables are equal to zero or not. Example 5.3.3 Calculate the F-statistic for the model identified in Example 5.3.1. What can you conclude about the significance of the fitted model? From Eq. 5.14a, F=
3390:5 33 - 2 = 325 323:3 2-1
which clearly indicates that the overall regression fit is significant. The reader can verify that Eq. 5.14b also yields an identical value of F. ■ Note that the values of coefficients a and b based on the given sample of n observations are only estimates of the true model parameters α and β. If the experiment is repeated, the estimates of a and b are likely to vary from one set of experimental observations to another. OLS estimation assumes that the model residual ε is a random variable with zero mean. Further, the residuals εi at specific values of x are taken to be randomly distributed, which is akin to saying that the distributions shown in Fig. 5.3 at specific values of x are normal and have equal variance. After getting an overall picture of the regression model, it is useful to study the significance of each individual regressor on the overall statistical fit in the presence of all other regressors. The Student t-statistic is widely used for this purpose and is applied to each regression parameter. For the slope parameter b in Eq. 5.1: Student t-value t=
b-0 sb
ð5:15aÞ
The estimated standard deviation (also referred to as “the standard error of the sampling distribution”) of the slope parameter b is given by sb = RMSE=
Sxx :
ð5:15bÞ
with Sxx being the sum of squares given by Eq. 5.6b and RMSE by Eq. 5.8. For the intercept parameter a in Eq. 5.1, Student t-value a-0 t= sa
ð5:16aÞ
where the estimated standard deviation of the intercept parameter a is
Linear Regression Analysis Using Least Squares n
sa = RMSE
1=2
xi 2
i
ð5:16bÞ
n:Sxx
Basically, the t-test as applied to regression model building is a formal statistical test to determine how significantly different an individual coefficient is from zero in the presence of the remaining coefficients. Stated simply, it enables an answer to the following question: would the fit become poorer if the regressor variable in question is not used in the model at all? Recall that the confidence intervals CI refer to the limits for the mean response at specified values of the predictor variables for a specified confidence level (CL). Let β and α denote the hypothesized true values of the slope and intercept coefficients. The CI for the model parameters are determined as follows. For the slope term: b-
t α=2 RMSE t α=2 RMSE p p Þ < 21:9025 þ ð2:04Þð0:87793Þ
20
30 40 Solids Reduction
50
60
(ii) Single-equation or multi-equation depending on whether only one or several interconnected response variables are being considered. (iii) Linear or nonlinear, depending on whether the model is linear or nonlinear in its parameters (and not its functional form). Thus, a regression equation such as y = a + b x + c x2 is said to be linear in its parameters {a, b, c} though it is nonlinear in its functional structure (see Sect. 1.4.4 for a discussion on the classification of mathematical models).
or,
Certain simple univariate equation models are shown in Fig. 5.7. Frame (a) depicts simple linear models (one with a 20:112 < ð < y20 > Þ < 23:693 at 95% CL: ■ positive slope and another with a negative slope), while frames (b) and (c) are higher-order polynomial models which, though nonlinear in the function, are models linear Example 5.3.9 Calculate the 95% PI for predicting the individual response in their parameters. The other figures depict nonlinear for x = 20 using the linear model identified in Example 5.3.1. models. Analysts often approximate a linear model (especially over a limited range) even if the relationship of the Using Eq. 5.21, data is not strictly linear. If a function such as that shown in 2 1=2 frame (d) is globally nonlinear, and if the domain of the 1 ð20 - 33:4545Þ = 3:3467 experiment is limited say to the right knee of the curve Varðy0 Þ = ð3:2295Þ 1 þ þ 4152:18 33 (bounded by points c and d), then a linear function in this region could be postulated. Models tend to be preferentially Further, t0.05/2 = 2.04. Using an analogous expression as framed as linear ones largely due to the simplicity in Eq. 5.20 yields the PI for the mean response subsequent model building and the prevalence of solution methods based on matrix algebra. 21:9025 - ð2:04Þð3:3467Þ < y20 < 21:9025 þ ð2:04Þð3:3467Þ or
15:075 < y20 < 28:730:at 95%CL ■
5.4
Multiple OLS Regression
Regression models can be classified as: (i) Univariate or multivariate, depending on whether only one or several regressor variables are being considered.
5.4.1
Higher Order Linear Models
When more than one regressor variable is known to influence the response variable, a multivariate model will explain more of the variation and provide better predictions than a univariate model. The parameters of such a model can be identified using multiple regression techniques. This section will discuss certain important issues regarding multivariate, singleequation models linear in the parameters using the OLS
5.4 Multiple OLS Regression
179
Fig. 5.7 General shape of regression curves (From Shannon 1975 by # permission of Pearson Education)
approach. For now, the treatment is limited to regressors, which are uncorrelated or independent. Consider a data set of n readings that include k regressor variables. The number of model parameters p will then be (k+1) because one of the parameters is the intercept or constant term. The corresponding form, called the additive multivariate linear regression (MLR) model, is: y = β 0 þ β 1 x1 þ β 2 x2 þ ⋯ þ β k xk þ ε
ð5:22aÞ
where ε is the error or unexplained variation in y. Due to the lack of any interaction terms, the model is referred to as “additive.” The simple interpretation of the numerical value of the model parameters is that βi represents the unit influence dy ). Note that this is strictly valid only of xi on y (i.e., slope dx i when the variables are independent or uncorrelated, which, often, is not true.
The same model formulation is equally valid for a kth degree polynomial regression model which is a special case of Eq. 5.22a with x1 = x, x2 = x2, etc. y = β 0 þ β 1 x þ β 2 x2 þ ⋯ þ β k x k þ ε
ð5:23Þ
Polynomial models are commonly used to represent empirical behavior and can capture a variety of shapes. Usually, they are limited to second-order (quadratic) or thirdorder (cubic) functional forms. Let xij denote the ith observation of parameter j. Then Eq. 5.22a can be re-written as yi = β0 þ β1 xi1 þ β2 xi2 þ ⋯ þ βk xik þ ε
ð5:22bÞ
Often, it is most convenient to transform the regressor variables and express them as a difference from the mean
180
5
(this approach is used in Sect. 5.4.4 and also in Sect. 6.4 while dealing with experimental design methods). This transformation is also useful to reduce the ill-conditioning effects of multicollinearity, which introduces errors and large uncertainties in the model parameter estimates (discussed in Sect. 9.3). Specifically, Eq. 5.22a can be transformed into: y = β 0 0 þ β 1 ð x1 - x1 Þ þ β 2 ð x2 - x2 Þ þ ⋯ þ β k ð xk - xk Þ þ ε ð5:24Þ An important special case is the second-order or quadratic regression model when ( p = 3) in Fig. 5.7b. The straight line is now replaced by parabolic curves depending on the value of β (i.e., either positive or negative). Multivariate model development utilizes some of the same techniques as discussed in the univariate case. The first step is to identify all variables that can influence the response as predictor variables. It is the analyst’s responsibility to identify these potential predictor variables based on his or her knowledge of the physical system. It is then possible to plot the response against all possible predictor variables to identify any obvious trends. The greatest single disadvantage to this approach is the sheer labor involved when the number of possible regressor variables is high. A situation that arises in multivariate regression is the concept of variable synergy, or commonly called interaction between variables (this is a consideration in other problems; for example, when dealing with the design of experiments). This occurs when two or more variables interact and impact system response to a degree greater than when the variables operate independently. In such a case, the first-order linear model with two interacting regressor variables takes the form: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1 x 2 þ ε
ð5:25Þ
The term (β3x1 x2) is called the interaction term. How the interaction parameter affects the shape of the family of Fig. 5.8 Plots illustrating the effect of interaction among two regressor variables due to the presence of cross-product terms. (a) Non-interacting. (b) Interacting (From Neter et al. 1983)
Linear Regression Analysis Using Least Squares
curves is illustrated in Fig. 5.8. The origin of this model function is easy to derive. The lines for different values of regressor x1 are essentially parallel, and so the slope terms for both models are equal. Let the model with the first regressor be: y = a′ + bx1, while the intercept be given by: a′ = f(x2) = a + cx2. Combining both equations results in: y = a + bx1 + cx2. This corresponds to Fig. 5.8a. For the interaction case, both the slope and the intercept terms are functions of x2. Hence, representing a′ = a + bx1 and b′ = c + dx1, then: y = a þ bx1 þ ðc þ dx1 Þx2 = a þ bx1 þ cx2 þ dx1 x2 which is identical in structure to Eq. 5.25. Simple linear functions have been assumed above. It is straightforward to derive expressions for higher-order models by analogy. For example, the second-order (or quadratic) model without interacting variables is: y = β0 þ β1 x1 þ β2 x2 þ β3 x21 þ β4 x22 þ ε
ð5:26Þ
For a second-order model with interacting terms, the corresponding expression can be easily derived. Consider the linear polynomial model with one regressor (with the error term dropped): y = b0 þ b1 x1 þ b2 x1 2
ð5:27Þ
If the parameters {b0, b1, b2} can themselves be expressed as second-order polynomials of another regressor x2, the full model will have nine regression parameters: y = b00 þ b10 x1 þ b01 x2 þ b11 x1 x2 þb20 x21 þ b02 x22 þ b21 x21 x2 þb12 x1 x22
þ
ð5:28Þ
b22 x21 x22
The functional dependence of a response variable (mortality ratio) on two independent variables (age and percent of
5.4 Multiple OLS Regression
181
Fig. 5.9 Mortality ratio of men as a function of age and percent of normal weight (Ezekiel and Fox 1959)
Fig. 5.10 Response contour diagrams (Neter et al. 1983). (a) Noninteracting independent variables: y = 20 + 0.95x1 0.50x2. (b) Interacting independent variables: y = 5x1 +7x2 + 3x1x2
normal weight) is perhaps better illustrated using 3-D plots as shown in Fig. 5.9. Clearly, any model used to fit the shape of this curved surface would require higher-order functional models with interacting terms. For example, men who are either underweight or overweight at age 22 seem to have higher mortality rates than normal (y-axis=100 is normal), but this is not so for the 52-year age group where overweight is the only high-risk factor. This example also illustrates the fact that such polynomial models can fit simple ridges, peaks, valleys, and saddles. It is important to emphasize that the analyst should strive to identify the simplest model possible with the model order as low as possible in the case of multivariate regression. 3-D plots, such as in Fig. 5.9, are sometimes hard to read, and response contour plots for two independent variable situations are often more telling. Instead of plotting the
dependent or response variable on the z-axis, the two independent variables are shown on the x- and y-axis, and discrete values of the response variables are shown as contour lines. Fig. 5.10 illustrates this type of graphical presentation for two cases: without and with interaction between the two independent variables. The synergistic behavior of independent variables can result in two or more variables working together to “overpower or usurp” another variable’s prediction capability. As a result, it is necessary to always check the importance of each individual predictor variable while performing multivariate regression. Those variables with low absolute values of the tstatistic should be omitted from the model and the remaining predictors used to re-estimate the model parameters. The stepwise regression method described in Sect. 5.4.6 is based on this concept.
182
5
5.4.2
∂L = - 2XT Y þ 2XT Xβ = 0 ∂β
Matrix Formulation
When dealing with multiple regression, it is advantageous to resort to matrix algebra because of the compactness and ease of manipulation it offers without loss in clarity. Though the solution is conveniently provided by a computer, a basic understanding of matrix formulation is nonetheless useful. In matrix notation (with YT denoting the transpose of Y), the linear model given by Eq. 5.22(b) can be expressed as follows for the n data points (with the matrix dimension shown in subscripted brackets for better understanding): Yðn,1Þ = Xðn,pÞ βðp,1Þ þ εðn,1Þ
ð5:29Þ
βT = ½β0 β1 . . . βk ,
ε T = ½ ε1 ε2 . . . εn
ð5:30Þ
ð5:33Þ
which leads to the system of normal equations XT X b = XT Y
ð5:34Þ
with n
n i=1 n
n
XT X =
i=1 n
where p is the number of parameters in the model (=k + 1 for a linear model) k is the order of the model, and n is the number of data points which consists of one response and p regressors observations. The individual terms can be expressed as: Y T = ½y1 y2 . . . yn ,
Linear Regression Analysis Using Least Squares
xi1
i=1
::
i=1
n
xik
i=1
xi1
::
x2i1
::
::
::
xik :xi1
::
n i=1 n i=1
xik
xi1 :xik
n
::
i=1
ð5:35Þ
x2ik
The above matrix is a symmetric matrix with the main diagonal elements being the sum of squares of the elements in the columns of X and the off-diagonal elements being the sum of the cross-products. From here, the regression model coefficient vector b is the least square estimator vector of β given by: b = XT X
1
XT Y = C XT Y
ð5:36Þ
and
X=
1 1
x11 x21
x1k
1
xn1
xnk
ð5:31Þ
The first column of 1 is meant for the constant term; it is strictly not needed but is convenient for matrix manipulation. The interpretation of the matrix elements is simple. For example, x21 refers to the second observation of the first regressor x1, and so on. The descriptive measures applicable for a single variable can be extended to multiple variable models of order k and written in compact matrix notation.
5.4.3
The approach involving the minimization of SSE for the univariate case (Sect. 5.3.1) can be generalized to multivariate linear regression. Here, the parameter set β is to be identified such that the sum of squares function L is minimized:
or
n
ε2 i=1 i
= εT ε = ðY - XβÞT ðY - XβÞ
VarðbÞ = σ 2 XT X
-1
= σ2 C
ð5:37Þ
where σ 2 is the mean square error of the model error terms = ðsum of square errorsÞ=ðn - pÞ
ð5:38Þ
An unbiased estimator of σ 2 is the sample s2 or residual mean square
Point and Interval Estimation
L=
provided matrix (XTX) is not singular. Note that the matrix C = (XTX)-1 called the variance-covariance matrix of the estimated regression coefficients is also a symmetric matrix with the main diagonal elements being the variances of the model coefficient estimators and the off-diagonal elements being the sum of the covariances. Under OLS regression, the variance of the model parameters is given by:
s2 =
εT ε SSE = n-p n-p
ð5:39Þ
For predictions within the range of variation of the original data, the mean and individual response values are normally distributed with the variance given by the following:
ð5:32Þ (a) For the mean response at a specific set of x0 values, or the confidence interval CI, under OLS
5.4 Multiple OLS Regression
183
varðy0 Þ = s2 X0 XT X
-1
(b) The variance of an individual prediction, or the prediction level, is varðy0 Þ = s2 1 þ X0 XT X
-1
1
0:2
0:04
1 1
0:3 0:4
0:09 0:16
1
0:5
0:25
X= 1 1
0:6 0:7
0:36 0:49
1 1
0:8 0:9
0:64 0:81
1
1
1
ð5:40Þ
XT0
XT0
ð5:41Þ
where 1 is a column vector of unity. Two-tailed confidence intervals CI at a significance level α are: y0 ± t ðn - k, α=2Þ var1=2 ðy0 Þ
ð5:42Þ
Example 5.4.1 Part load performance of fans (and pumps) Part-load performance curves do not follow the idealized fan laws due to various irreversible losses. For example, decreasing the flow rate by half of the rated flow does not result in a (1/8)th decrease in its rated power consumption as predicted by the fan laws. Hence, actual tests are performed for such equipment under different levels of loading. The performance tests of the flow rate and the power consumed are then normalized by the rated or 100% load conditions called part load ratio (PLR) and fractional full-load power (FFLP) respectively. Polynomial models can then be fit between these two quantities with PLR as the regressor and FFLP as the response variable. Data assembled in Table 5.2 were obtained from laboratory tests on a variable speed drive (VSD) control, which is a very energy efficient device and increasingly installed. (a) What is the matrix X in this case if a second-order polynomial model is to be identified of the form y = β0 þ β1 x1 þ β2 x21 ? (b) Using the data given in the table, identify the model and report relevant statistics on both parameters and overall model fit. (c) Compute the confidence interval and the prediction interval at 0.05 significance level for the response at values of PLR = 0.2 and 1.00 (i.e., the extreme points). Solution (a) The independent Eq. 5.26b is:
variable
matrix
X
given
0.7 0.51
0.8 0.68
0.9 0.84
by
Table 5.2 Data table for Example 5.4.1 PLR FFLP
0.2 0.05
0.3 0.11
0.4 0.19
0.5 0.28
0.6 0.39
1.0 1.00
(b) The results of the regression are shown below: Parameter Constant PLR PLR^2
Estimate - 0.0204762 0.179221 0.850649
Standard error - 0.0173104 0.0643413 0.0526868
t-statistic - 1.18288 2.78547 16.1454
p-value 0.2816 0.0318 0.0000
Analysis of Variance Source Model Residual Total (Corr.)
Sum of squares 0.886287 0.000512987 0.8868
Df 2 6 8
Mean square 0.443144 0.0000854978
F-ratio 5183.10
p-value 0.0000
Goodness-of-fit R2 = 99.9 % , Adj ‐ R2 = 99.9 %, RMSE = 0.009246, and mean absolute error (MAD) = 0.00584. The equation of the fitted model is (with appropriate rounding) FFLP = - 0:0205 þ 0:1792 PLR þ 0:8506 PLR2 ð5:43Þ Since the model p-value in the ANOVA table is less than 0.05, there is a statistically significant relationship between FFLP and PLR at 95% CL. However, the p-value of the constant term is large (>0.05), and a model without an intercept term is more appropriate as physical considerations suggest since power consumed by the pump is zero if there is no flow. The values shown are those provided by the software package. Note that the standard errors and the Student t-values for the model coefficients shown in the table cannot be computed from Eq. 5.14 to 5.15, which apply for the simple linear model. The equations for polynomial regression are rather complicated to solve by hand, and the interested reader can refer to texts such as Neter et al. (1983) for more details. The 95% CI and PI are shown in Fig. 5.11. Because the fit is excellent, these are very narrow and close to each other. The predicted values as well as the 95% CI and PI for the two data points are given in the table below. Note that the uncertainty range is relatively larger at the lower value than at the higher range.
184
5
Linear Regression Analysis Using Least Squares
Fig. 5.11 Plot of fitted model along with 95% CI and 95% PI
x 0.2 1.0
Predicted y 0.0493939 1.00939
95% prediction limits Lower Upper 0.0202378 0.0785501 0.980238 1.03855
95% confidence limits Lower Upper 0.0310045 0.0677834 0.991005 1.02778
Example 5.4.2 Table 5.3 gives the solubility of oxygen in water in (mg/L) at 1 atm pressure for different temperatures and different chloride concentrations in (mg/L). (a) Plot the data and formulate two different potential models for oxygen solubility (the response variable) against the two regressors. (b) Evaluate both models and identify the better one. Give justification for your choice Report pertinent statistics for model parameters as well as for the overall model fit. (a) The above data set (28 data points in all) is plotted in Fig. 5.12a. One notes that the series of plots are slightly nonlinear but parallel, suggesting a higher-order model without interaction terms. Hence, the second-order polynomial models without interaction are probably more logical but let us investigate both the first-order and second-order linear models. (b1) Analysis results of the first-order model (n = 28 and p = 3) Goodness-of-fit indicators: R2 = 96.83%, AdjR = 96.57%, RMSE = 0.41318. All three model parameters are statistically significant as indicated by the pvalues ( p, this would indicate a biased model because of underfitting (Walpole et al. 1998). Measures other than AdjR2 and Mallows Cp statistic have been proposed for subset selection. Two of the most widely adopted ones are the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) (see, e.g., James et al. 2013). Another model selection approach, which can be automated to evaluate models with a large number of possible parameters, is the iterative approach which comes in three variants (in many cases, all three may yield slightly different results). (b-1) Backward Elimination Method One begins with selecting an initial model that includes the full set of possible predictor variables from the candidate pool, and then successively dropping one variable at a time Actually, there is no “best” model since random variables are involved. A better term would be “most plausible” and should include mechanistic considerations, if appropriate.
8
Linear Regression Analysis Using Least Squares
based on their contribution to the reduction of SSE (or dropping the variable which results in the smallest decrease in R2) until a sudden large drop in R2 is noticed. The OLS method is used to estimate all model parameters along with t-values for each model parameter. If all model parameters are statistically significant, the model building process stops. If some model parameters are not significant, the model parameter of least significance (lowest t-value) is omitted from the regression equation, and the reduced model is refit. This process continues until all parameters that remain in the model are statistically significant. (b-2) Forward Selection Method One begins with an equation containing no regressors (i.e., a constant model). The model is then augmented by including the regressor variable with the highest simple correlation with the response variable (or one which will increase R2 by the highest amount). If this regression coefficient is significantly different from zero, it is retained, and the search for a second variable is made. This process of adding regressors one by one is terminated when the last variable entering the equation is not statistically significant or when all the variables are included in the model. Clearly, this approach involves fitting many more models than in the backward elimination method. (b-3) Stepwise Regression Method Prior to the advent of resampling methods (Sect. 5.8), stepwise regression was the preferred model-building approach. It combines both the procedures discussed above. Stepwise regression begins by computing correlation coefficients between the response and each predictor variable (like the forward selection method). The variable most highly correlated with the response is then allowed to “enter the regression equation.” The parameter for the single-variable regression equation is then estimated along with a measure of the goodness of fit. The next most highly correlated predictor variable is identified, given the current variable already in the regression equation. This variable is then allowed to enter the equation and the parameters re-estimated along with the goodness of fit. Following each parameter estimation, tvalues for each parameter are calculated and compared to the t-critical value to determine whether all parameters are still statistically significant. Any parameter that is not statistically significant is removed from the regression equation. This process continues until no more variables “enter” or “leave” the regression equation. In general, it is best to select the model that yields a reasonably high “goodness of fit” for the fewest parameters in the model (referred to as model parsimony). The final decision on model selection requires the judgment of the model builder based on mechanistic insights into the problem. Again, one must guard against
5.5 Applicability of OLS Parameter Estimation
189
the danger of overfitting by performing a cross-validation check (Sect. 5.8). When a black-box model is used containing several regressors, stepwise regression would improve the robustness of the model identified by reducing the number of regressors and, thus, hopefully reduce the adverse effects of multicollinearity between the remaining regressors. Many packages use the F-test indicative of the overall model instead of the t-test on individual parameters to perform the stepwise regression. It is suggested that stepwise regression not be used in case the regressors is highly correlated since it may result in nonrobust models. However, the backward procedure is said to better handle such situations than the forward selection procedure. A note of caution is warranted in using stepwise regression for engineering models based on mechanistic considerations. In certain cases, stepwise regression may omit a regressor which ought to be influential when using a particular data set, while the regressor is picked up when another data set is used. This may be a dilemma when the model is to be used for subsequent predictions. In such cases, discretion based on physical considerations should trump purely statistical model building. Resampling methods, such as the cross-validation method (Sect. 5.8.2), can be used as the final judge to settle on the most appropriate model among a relatively small number of variable subsets found by automatic model selection. A sounder estimate of the model prediction error is also directly provided. Example 5.7.39 Proper model identification with multivariate regression models An example of multivariate regression is the development of model equations to characterize the performance of refrigeration compressors. It is possible to regress the compressor manufacturer’s tabular data of compressor performance using the following simple bi-quadratic formulation (see Fig. 5.14 for nomenclature): y = b0 þ b1 T cho þ b2 T cdi þ b3 T 2cho þ
b4 T 2cdi
þ b5 T cho T cdi
ð5:48aÞ
where y represents either the compressor power (Pcomp) or the cooling capacity (Qch). OLS is then used to develop estimates of the six model parameters, b0 to b5, based on the compressor manufacturer’s data. The biquadratic model was used to estimate the parameters for compressor cooling capacity (in refrigeration From ASHRAE (2005) # American Society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org
Table 5.5 Results of the first and second stage model building (Example 5.7.3) With all parameters Coefficient Value 152.50 b0 3.71 b1 - 0.335 b2 b3 0.0279 b4 - 0.000940 - 0.00683 b5
t-value 6.27 36.14 - 0.62 52.35 - 0.32 - 6.13
With significant parameters only Value t-value 114.80 73.91 3.91 11.17 – – 0.027 14.82 – – - 0.00892 - 2.34
tons)10 for a screw compressor. The model and its corresponding parameter estimates are given below. Although the overall curve fit for the data was excellent (R2 = 99.96%), the t-values of b2 and b4 were clearly insignificant indicating that the first-order term of Tcdi and the second-order term of Tcdi2 should be dropped. As an aside, it has been suggested by some authors that one should “maintain hierarchy,” that is, include in the model lower order terms of a specific variable even if found to be statistically insignificant by stepwise analysis or analysis akin to the one done above. This practice is not universally accepted though and is better left to the discretion of the analyst. A second stage regression is done by omitting these regressors, resulting in the following model and coefficient t-values shown in Table 5.5. y = b0 þ b1 T cho þ b3 T 2cho þ b5 T cho T cdi
ð5:48bÞ
All the parameters in the simplified model are significant and the overall model fit remains excellent: R2 = 99.5%. ■
5.5
Applicability of OLS Parameter Estimation
5.5.1
Assumptions
The term “least squares” regression is generally applied to linear models, although the concept can be extended to nonlinear functions as well. The ordinary least squares (OLS) regression method is a special and important sub-class whose parameter estimates are best only when a number of conditions regarding the functional form and the model residuals or errors are met (discussed below). It enables simple (univariate) or multivariate linear regression models to be identified from data, which can then be used for future prediction of the response variable along with its uncertainty intervals. It also allows statistical statements to be made
9
10
1 Ton of refrigeration = 12,000 Btu/h.
190
about the estimated model parameters- a process known as “inference”. No statistical assumptions are used to obtain the OLS estimators for the model coefficients. When nothing is known regarding measurement errors, OLS is often the best choice for estimating the parameters. However, to make statistical statements about these estimators and the model predictions, it is necessary to acquire information regarding the measurement errors. Ideally, one would like the error terms (or residuals) εi to be normally distributed, without serial correlation, with mean zero and constant variance. The implications of each of these four assumptions, as well as a few additional ones, will be briefly addressed below since some of these violations may lead to biased coefficient estimates and to distorted estimates of the standard errors, of the, confidence intervals, and to improper conclusions from statistical tests. (a) Errors should have zero mean: If this is not true, the OLS estimator of the intercept will be biased. The impact of this assumption not being correct is generally viewed as the least critical among the various assumptions. Mathematically, this implies that expected error values E(εi) = 0. (b) Errors should be normally distributed: If this is not true, statistical tests and confidence intervals are incorrect for small samples though the OLS coefficient estimates are unbiased. Fig. 5.3 which illustrates this behavior has already been discussed. This problem can be avoided by having larger samples and verifying that the model is properly specified. (c) Errors should have constant variance: var (εi) = σ 2. This violation of the basic OLS assumption results in increasing the standard errors of the estimates and widening the model prediction confidence intervals (though the OLS estimates themselves are unbiased). In this sense, there is a loss in statistical power. This condition in which the variance of the residuals or error terms is not constant is called heteroscedasticity and is discussed further in Sect. 5.6.3. (d) Errors should not be serially correlated: This violation is equivalent to having fewer independent data and also results in a loss of statistical power with the same consequences as (c) above. Serial correlations may occur due to the manner in which the experiment is carried out. Extraneous factors, that is, factors beyond our control (such as the weather) may leave little or no choice as to how the experiments are executed. An example of a reversible experiment is the classic pipefriction experiment where the flow through a pipe is varied to cover both laminar and turbulent flows, and the associated friction drops are observed. Gradually increasing the flow one way (or decreasing it the other
5
Linear Regression Analysis Using Least Squares
way) may introduce biases in the data, which will subsequently also bias the model parameter estimates. In other circumstances, certain experiments are irreversible. For example, the loading on a steel sample to produce a stress–strain plot must be performed by gradually increasing the loading till the sample breaks, one cannot proceed in the other direction. Usually, the biases brought about by the test sequence are small, and this may not be crucial. In mathematical terms, this condition, for a first-order case, can be written as the expected value of the product of two consecutive errors E(εi. εi + 1) = 0. This assumption, which is said to be hardest to verify, is further discussed in Sect. 5.6.4. (e) Errors should be uncorrelated with the regressors: The consequences of this violation result in OLS coefficient estimates being biased and the predicted OLS confidence intervals understated, that is, narrower. This violation is a very important one and is often due to “mis-specification error” or underfitting. Omission of influential regressor variables and improper model formulation (assuming a linear relationship when it is not) are likely causes. This issue is discussed at more length in Sect. 5.6.5. (f) Regressors should not have any measurement error: Violation of this assumption in some (or all) regressors will result in biased OLS coefficient estimates for those (or all) regressors. The model can be used for prediction, but the confidence intervals will be understated. Strictly speaking, this assumption is hardly ever satisfied since there is always some measurement error. However, in most engineering studies, measurement errors in the regressors are not large compared to the random errors in the response, and so this violation may not have important consequences. As a rough rule of thumb, this violation becomes important when the errors in x reach about a fifth of the random errors in y, and when multicollinearity is present. If the errors in x are known, there are procedures that allow unbiased coefficient estimates to be determined (see Sect. 9.4.6). Mathematically, this condition is expressed as Var (xi) = 0. (g) Regressor variables should be independent of each other: This violation applies to models identified by multiple regression when the regressor variables are correlated with each other (called multicollinearity). This is true even if the model provides an excellent fit to the data. Estimated regression coefficients, though unbiased, will tend to be unstable (their values tend to change greatly when a data point is dropped or added), and the OLS standard errors and the prediction intervals will be understated. Multicollinearity is likely to be a problem only when one (or more) of the correlation coefficients among the regressors exceeds 0.8–0.85 or so. Sect. 9.3 deals with this issue at more length.
5.6 Model Residual Analysis and Regularization
5.5.2
191
Sources of Errors During Regression
Perhaps the most crucial issue during parameter identification is the type of measurement inaccuracy present. This has a direct influence on the estimation method to be used. Though statistical theory has neatly classified this behavior into a finite number of groups, the data analyst is often stymied by data which does not fit into any one category. Remedial action advocated does not seem to entirely remove the adverse data conditioning. A certain amount of experience is required to surmount this type of adversity, which, further, is circumstance specific. As discussed earlier, there can be two types of errors: (a) Measurement error. The following sub-cases can be identified depending on whether the error occurs: (i) In the dependent variable, in which case the model form is: yi þ δi = β0 þ β1 xi
ð5:49aÞ
or in the regressor variable, in which case the model form is: yi = β 0 þ β 1 ð xi þ γ i Þ
ð5:49bÞ
or, in both dependent and regressor variables: y i þ δ i = β 0 þ β 1 ð xi þ γ i Þ
ð5:49cÞ
Further, the errors δ and γ (which will be jointly represented by ε) can have an additive error, in which case, εi ≠ f(yi, xi), or a multiplicative error: εi = f(yi, xi), or worst still, a combination of both. Section 9.4.1 discusses this issue further. (b) Model misspecification error: How this would affect the model residuals εi is difficult to predict and is extremely circumstance specific. Misspecification could be due to several factors, for example, (i) One or more important variables have been left out of the model (ii) The functional form of the model is incorrect Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, identifiability constraints may require that a simplified or macroscopic model be used for parameter identification rather than the detailed model (see Sect. 9.2). This is likely to introduce both bias and random noise in the parameter estimation process except when model R2 is very high (R2 > 0.9). This issue is further discussed in Sect. 5.6. Formal statistical procedures do not explicitly treat this case but
limit themselves to type (a) errors and, more specifically, to case (i) assuming purely additive or multiplicative errors. The implicit assumptions in OLS and their implications, if violated, are described below.
5.6
Model Residual Analysis and Regularization11
5.6.1
Detection of Ill-Conditioned Behavior
The availability of statistical software has resulted in routine and easy application of OLS to multiple linear models. However, there are several underlying assumptions that affect the individual parameter estimates of the model as well as the overall model itself. Once a model has been identified, the general tendency of the analyst is to hasten and use the model for whatever purpose intended. However, it is extremely important (and this phase is often overlooked) that an assessment of the model be done to determine whether the OLS assumptions are met, otherwise the model is likely to be deficient or misspecified and yield misleading results. In the last 50 years or so, there has been much progress made on how to screen model residual behavior to gain diagnostics insight into model deficiency or misspecification, take remedial action, or adopt more advanced regression techniques.12 Some of the simple methods to screen and correct for ill-behaved model residuals are presented in this chapter, while more advanced statistical concepts and regression methods are addressed in Chap. 9. A few idealized plots illustrate some basic patterns of improper residual behavior, which are addressed in more detail in the later sections of this chapter. Fig. 5.15 illustrates the effect of omitting an important dependence, which suggests that an additional variable is to be introduced in the model which distinguishes between the two groups. The presence of outliers and the need for more robust regression schemes which are immune to such outliers are illustrated in Fig. 5.16. The presence of nonconstant variance (or heteroscedasticity) in the residuals is a very common violation and one of several possible manifestations is shown in Fig. 5.17. This particular residual behavior is likely to be remedied by using a log transform of the response variable instead of the variable itself. Another approach is to use weighted least squares estimation procedures described later in this section. Though nonconstant variance is easy to detect visually, its cause is difficult to identify. Fig. 5.18 illustrates a typical behavior that arises when a Herschel: “. . . almost all of the greatest discoveries in astronomy have resulted from the consideration of what . . . (was) termed residual phenomena.” 12 Unfortunately, many of these techniques are not widely used by those involved in energy-related data analysis. 11
++ + +++ + +
.. .. .... .
.
Model Residuals
5
Fig. 5.15 The residuals can be separated into two distinct groups (shown as crosses and dots) which suggest that the response variable is related to another regressor not considered in the regression model. This improper residual pattern can be rectified by reformulating the model to include this additional variable. One example of such a timebased event system change is shown in Fig. 8.17
Model Residuals
Linear Regression Analysis Using Least Squares
.. ......... . ... . . . . . .
192
Fig. 5.18 Bow-shaped residuals can often be rectified by evaluating higher-order linear models
+ Outliers
... ......... +
.
Model Residuals
+
Model Residuals
+
Fig. 5.19 Serial correlation is indicated by a pattern in the residuals when plotted in the sequence the data was collected, that is, when plotted against time even though time may not be a regressor in the model Fig. 5.16 Outliers indicated by crosses suggest that data should be checked and/or robust regression used instead of OLS
.
Model Residuals
. . . . . . . . . . . . . .. . .
Fig. 5.17 Residuals with bow shape and increased variability (i.e., the error increases as the response variable y increases) can be often rectified by a log transformation of y
linear function is used to model a quadratic variation. The proper corrective action will increase the predictive accuracy of the model (RMSE will be lower), result in the estimated parameters being more efficient (i.e., lower standard errors), and most importantly, allow more sound and realistic interpretation of the model prediction uncertainty bounds. Figure 5.19 illustrates the occurrence of serial correlations in time series data which arises when the error terms are not independent. Such patterned residuals occur commonly during model development and provide useful insights into
model deficiency. Serial correlation (or autocorrelation) has special pertinence to time series data (or data ordered in time) collected from in-situ performance of mechanical and thermal systems and equipment. Autocorrelation is present if adjacent residuals show a trend or a pattern of clusters above or below the zero value that can be discerned visually. Such correlations can either suggest that additional variables have been left out of the model (model-misspecification error) or could be due to the nature of the process itself (dynamic nature of the process—further treated in Chap. 8 on time series analysis). The latter is due to the fact that equipment loading over a day would follow an overall cyclic curve (as against random jumps from say full load to half load) consistent with the diurnal cycle and the way the system is operated. In such cases, positive residuals would tend to be followed by positive residuals, and vice versa. Problems associated with model underfitting and overfitting are usually the result of a failure to identify the non-random pattern in time series data. Underfitting does not capture enough of the variation in the response variable which the corresponding set of regressor variables can possibly explain. For example, all four models fit to their respective sets of data as shown in Fig. 5.20 have identical R2 values and t-statistics but are distinctly different in how they capture the data variation. Only plot (a) can be described by a linear model. The data in (b) need to be fitted by a higher-order model, while one data point in (c) and (d) distorts the entire
5.6 Model Residual Analysis and Regularization
193
Fig. 5.20 Plot of the data (x, y) with the fitted lines for four data sets. The models have identical R2 and t-statistics but only the first model is a realistic model (From Chatterjee and Price 1991, with permission from John Wiley and Sons)
model. Blind model fitting (i.e., relying only on model statistics) is, thus, inadvisable. This aspect is further discussed in Sect. 5.6.5. Overfitting implies capturing randomness in the model, that is, attempting to fit the noise in the data. A rather extreme example is when one attempts to fit a model with six parameters to six data points, which have some inherent experimental error. The model has zero degrees of freedom and the set of six equations can be solved without error (i.e., RMSE = 0). This is clearly unphysical because the model parameters have also “explained” the random noise in the observations in a deterministic manner. Both underfitting and overfitting can be detected by performing certain statistical tests on the residuals. The most used test for white noise (i.e., uncorrelated residuals) involving model residuals is the Durbin-Watson (DW) statistic defined by: DW =
n 2 i = 2 ð εi - εi - 1 Þ n 2 i = 1 εi
ð5:50aÞ
where εi is the residual at time interval i, defined as εi = y i - yi : An approximate relationship can also be used (Chatterjee and Price (1991) DW ≈ 2 ð1 - r Þ
ð5:50bÞ
where r is the correlation coefficient (Eq. 3.12) between timelagged residuals. If there is no serial or autocorrelation present, the expected value of DW = 2 (the limiting range being 0–4). The closer DW is to 2, the stronger the evidence that there is no autocorrelation in the data. If the model underfits DW < 2; while DW > 2 indicates an overfitted model. Tables are available for approximate significance tests with different numbers of regressor variables and a number of data points. Table A.13 assembles lower and upper critical values of DW statistics to test autocorrelation. These apply to positive DW values; if, however, a test is to be conducted with negative DW values, the quantity (4—DW) should be used. For example, if n = 20, and the model has three variables ( p = 3), the null hypothesis that the correlation coefficient is equal to zero can be rejected at the 0.05 significance level if its value is either below 1.00 or above 1.68. Note that the critical values in the table are one-sided, that is, apply to a one-tailed distribution. It is important to note that the DW statistic is only sensitive to correlated errors in adjacent observations, that is, when only first-order autocorrelation is present. For example, if the time series has seasonal patterns, then higher autocorrelations may be present which the DW statistic will be unable to detect. More advanced concepts and modeling are discussed in Sect. 8.5.3 while treating stochastic time series data.
194
5.6.2
5
Leverage and Influence Data Points
Most of the aspects discussed above relate to identifying general patterns in the residuals of the entire data set. Another issue is the ability to identify subsets of data that have an unusual or disproportionate influence on the estimated model in terms of parameter estimation. Being able to flag such influential subsets of individual points allows one to investigate their validity, or to glean insights for better experimental design since they may contain the most interesting system behavioral information. Note that such points are not necessarily “bad” data points which should be omitted, while are to be viewed as “distinctive” observations in the overall data set. It is useful to provide a geometrical understanding of outlier points and their potential impact on the model parameter estimates. No matter how carefully an experiment is designed and performed, there always exists the possibility of serious errors. These errors could be due to momentary instrument malfunction (say, dirt sticking onto a paddle wheel of a flow meter), power surges (which may cause data logging errors), or the engineering system deviating from its intended operation due to random disturbances. Usually, it is difficult to pinpoint the cause of the anomalies. The experimenter is often not fully sure whether the outlier is anomalous, or whether it is a valid or legitimate data point which does not conform to what the experimenter “thinks” it should. In such cases, throwing out a data point may amount to data “tampering” or fudging of results. Usually, data that exhibit such anomalous tendencies are a minority. Even then, if the data analyst retains these questionable observations, they can bias the results of the entire analysis since they exert an undue influence and can dominate a computed relationship between two variables.
Fig. 5.21 Illustrating different types of outliers. Point A is very probably a doubtful point; point B might be bad but could potentially be a very important point in terms of revealing unexpected behavior; point C is close enough to the general trend and should be retained until more data is collected
Linear Regression Analysis Using Least Squares
Consider the case of outliers during regression for the univariate case. Data points are said to be outliers when their model residuals are large relative to the other points. A visual investigation can help one distinguish between endpoints and center points (this is the intent of exploratory data analysis Sect. 3.5). For example, point A of Fig. 5.21 is quite obviously an outlier, and if the rejection criterion orders its removal, one should proceed to do so. On the other hand, point B, which is near the end of the data domain, may not be a bad point at all, but merely the beginning of a new portion of the curve (say, the onset of turbulence in an experiment involving laminar flow). Similarly, even point C may be valid and important. Hence, the only way to remove this ambiguity is to take more observations at the lower end. Thus, a simple heuristic is to reject points only when they are center points. Several advanced books present formal statistical treatment of outliers in a regression context. One can diagnose whether the data set is ill-conditioned or not, as well as identify and reject, if needed, the necessary outliers that cause ill-conditioning during the model-building process (e.g., Belsley et al. 1980). Consider Fig. 5.22a. The outlier point will have little or no influence on the regression parameters identified, and in fact retaining it would be beneficial since it would lead to a reduction in model parameter variance. The behavior shown in Fig. 5.22b is more troublesome because the estimated slope is almost wholly determined by the extreme point. In fact, one may view this situation as a data set with only two data points, or one may view the single point as a spurious point and remove it from the analysis. Gathering more data at that range would be advisable but may not be feasible; this is where the judgment of the analyst or prior information about the underlying trend line is useful. How and the extent to which each of the data points will affect the outcome of the regression line will
5.6 Model Residual Analysis and Regularization
195
Fig. 5.22 Two other examples of outlier points. While the outlier point in (a) is most probably a valid point, it is not clear for the outlier point in (b). Either more data must be collected, failing which it is advisable to delete this data from any subsequent analysis (From Belsley et al. (1980) by permission of John Wiley and Sons)
determine whether that particular point is an influence point or not. Scatter plots often reveal such outliers easily for single regressor situations but are inappropriate for multivariate cases. Hence, several statistical measures have been proposed to deal with multivariate situations, the influence and leverage indices being widely used (Belsley et al. 1980; Cook and Weisberg 1982; Chatterjee and Price 1991). The leverage of a datum point quantifies the extent to which that point is “isolated” in the x-space, that is, its distinctiveness in terms of the regressor variables. Consider the following symmetric matrix (called the hat matrix): H = X XT X
-1
XT = pij
ð5:51Þ
where X is a data matrix with n rows (n is the number of observations) and p columns (given by Eq. 5.31). The order of the H matrix would be (n x n), that is, equal to the number of observations. The diagonal element pii is defined as the leverage of the ith data point. Since the diagonal elements can be related to the distance between Xi and x, with values between 0 and 1, their average value is equal to (p/n). Points with pii > [3 (p/n)] are regarded as points with high leverage (sometimes the threshold is taken as [2 (p/n)]). Large residuals are traditionally used to highlight suspect data points or data points unduly affecting the regression model. Instead of looking at residuals εi, it is more meaningful to study a normalized or scaled value, namely the R-student residuals, where R - student =
εi 1
RMSE:½1 - pii 2
ð5:52Þ
Thus, studentized residuals measure how many standard deviations each observed value deviates from a model fitted using all of the data except that observation. Points with | R-student| > 3 can be said to be influence points that
correspond to a significance level of 0.01. Sometimes a less conservative value of 2 is used corresponding to the 0.05 significance level, with the underlying assumption that residuals or errors are Gaussian. A data point is said to be influential if its deletion, singly or in combination with a relatively few others, causes statistically significant changes in the fitted model coefficients. There are several measures used to describe influence, a common one is DFITS: 1
DFITSi =
ei ðpii Þ2 1
si ð1 - pii Þ2
ð5:53Þ
where εi is the residual error of observation i, and si is the standard deviation of the residuals without considering the ith residual. Points with DFITS ≥ 2 [p/(n - p)]1/2 are flagged as influential points. It is advisable to identify points with high leverage, and then examine them in terms of R-student statistics and the DFITS index for final determination. Influential observations can impact the final regression model in different ways (Hair et al. 1998). For example, in Fig. 5.23a, the model residuals are not significant, and the two influential observations shown as filled dots reinforce the general pattern in the model and lower the standard error of the parameters and of the model prediction. Thus, the two points would be considered to be leverage points that are beneficial to our model building. Influential points that adversely impact model building are illustrated in Fig. 5.23b and c. In the former, the two influential points almost totally account for the observed relationship but would not have been identified as outlier points. In Fig. 5.23c, the two influential points have totally altered the model identified, and the actual data points would have shown up as points with large residuals which the analyst would probably have identified as spurious. The next frame (d) illustrates the instance when an influential point changes the intercept of the model but leaves the
196
5
Linear Regression Analysis Using Least Squares
Fig. 5.23 (a–f) Common patterns of influential observations (From Hair et al. 1998 by # permission of Pearson Education)
slope unaltered. The two final frames, Fig. 5.23e, f, illustrate two, hard to identify and rectify, cases when two influential points reinforce each other in altering both the slope and the intercept of the model though their relative positions are very much different. Note that data points that satisfy both these statistical criteria, that is, are both influential and have high leverage, are the ones worthy of closer scrutiny. Most statistical programs have the ability to flag such points, and hence performing this analysis is fairly straightforward. Thus, in summary, individual data points can be outliers, leverage, or influential points. Leverage of a point is a measure of how unusual the point lies in the x-space. As mentioned above, just because a point has high leverage does not make it influential. An influence point is one that has an important effect on the regression model if that particular point were to be removed from the data set. Influential points are the ones that need particular attention since they provide insights about the robustness of the fit. In any case, all three measures (leverage pii, DFITS, and R-student) provide indications as to the role played by different observations toward the overall model fit. Ultimately, the decision to either retain or reject such points is somewhat based on judgment.
Table 5.6 Data table for Example 5.6.1a x 1 2 3 4 5 6 7 8 9 10 a
y[0,1] 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 28.9133
y1 11.69977 12.72232 16.24426 19.27647 21.19835 23.73313 21.81641 25.76582 29.09502 50
Data available electronically on book website
Example 5.6.1 Example highlighting different characteristics of residuals versus influence points. Consider the following made-up data (Table 5.6) where x ranges from 1 to 10, and the model is y = 10 + 1.5 * x to which random normal noise ε = [0, σ = 1] has been added to give y1 (second column). The response of the last observation has been intentionally corrupted to a value of 50 as shown (say, due to a momentary spike in power supply to the instrument).
5.6 Model Residual Analysis and Regularization
197
Fig. 5.24 (a) Observed vs predicted plot. (b) Residual plot versus regressor plot
How well a linear model fits the data is depicted in Fig. 5.24. Not surprisingly, the table of unusual residuals shown below does include the last observation since its Studentized absolute residual value is greater than 3.0 (99% CL). It has been flagged as an influential point since it has a major impact on the model coefficients. However, the same point has not been flagged as a leverage one since the point is not “isolated” in the x-space. This example serves to highlight the different impacts of leverage versus influence points. ■ Influential points flagged by the statistical package Row 10
5.6.3
x 10.0
y 50.0
Predicted Y Residual 37.2572 12.743
Studentized residual 11.43
Remedies for Nonuniform Residuals
Nonuniform model residuals or heteroscedasticity can be due to: (i) the nature of the process investigated, (ii) noise in the data, or (iii) the method of data collection from samples that are known to have different variances. Three possible generic remedies for nonconstant variance are to (Chatterjee and Price 1991):
(a) Introduce additional variables into the model and collect new data The physics of the problem along with model residual behavior can shed light into whether certain key variables, left out in the original fit, need to be introduced or not. This aspect is further discussed in Sect. 5.6.5. (b) Transform the dependent variable This is appropriate when the errors in measuring the dependent variable may follow a probability distribution whose variance is a function of the mean of the distribution. In such cases, the model residuals are likely to exhibit heteroscedasticity that can be removed by using exponential, Poisson, or binomial transformations. For example, a variable that is distributed binomially with parameters “n and p” has mean (n.p) and variance [n.p.(1 - p)] (Sect. 2.4.2). For a Poisson variable, the mean and variance are equal. The transformations shown in Table 5.7 will stabilize variance, and the distribution of the transformed variable will be closer to the normal distribution. The logarithmic transformation is also widely used in certain cases to transform a nonlinear model into a linear
198
5
Table 5.7 Transformations in dependent variable y likely to stabilize nonuniform model variance Poisson Binomial
Variance of y in terms of its mean μ μ μ(1-μ)/n
Transformation y1/2 sin-1(y)1/2
Table 5.8 Data table for Example 5.6.2 Obs # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 a
x 294 247 267 358 423 311 450 534 438 697 688 630 709 627
Y 30 32 37 44 47 49 56 62 68 78 80 84 88 97
Obs # 15 16 17 18 19 20 21 22 23 24 25 26 27
x 615 999 1022 1015 700 850 980 1025 1021 1200 1250 1500 1650
y 100 109 114 117 106 128 130 160 97 180 112 210 135
Data available electronically on book website
one (see Sect. 9.4.3). When the variables have a large standard deviation compared to the mean, working with the data on a log scale often has the effect of dampening variability and reducing asymmetry. This is often an effective means of removing heteroscedasticity as well. However, this approach is valid only when the magnitude of the residuals increases (or decreases) with that of one of the variables. Example 5.6.2 Example of variable transformation to remedy improper residual behavior The following example serves to illustrate the use of variable transformation. Table 5.8 shows data/observations from 27 departments in a university with y as the number of faculty and staff and x the number of students. A simple linear regression yields a model with R2 = 77.6% and a RMSE = 21.73. However, the residuals reveal an unacceptable behavior with a strong funnel behavior (see Fig. 5.25a). Instead of a linear model in y, a linear model in ln(y) is investigated. In this case, the model R2 = 76.1% and RMSE =0.25. However, these statistics should NOT be compared directly with the previous indices since the y variable is no longer the same (in one case, it is “y”; in the other “ln(y)”). Leaving this issue aside for now, notice that a first-order model does reduce some of the improper residual variances but an inverted “U” shape behavior can still be detected—indicating model misspecification (see Fig. 5.25b).
Linear Regression Analysis Using Least Squares
Finally, using a quadratic model along with the ln transformation results in a model: lnðyÞ = 2:8516 þ 0:00311267 x‐0:00000110226 x2 ð5:54Þ The residuals shown in Fig. 5.25c are now quite well behaved as a result of such a transformation. ■ (c) Perform weighted least squares This approach is more flexible, and several variants exist (Chatterjee and Price 1991). As described earlier, OLS model residual behavior can exhibit nonuniform variance (called heteroscedasticity) even if the model is structurally complete, that is, the model is not misspecified. This violates one of the standard OLS assumptions. In a multiple regression model, the detection of heteroscedasticity may not be very straightforward since only one or two variables may be the culprits. Examination of the residuals versus each variable in turn along with intuition and understanding of the physical phenomenon being modeled can be of great help. Otherwise, the OLS estimates will lack precision, and the estimated standard errors of the model parameters will be wider. If this phenomenon occurs, the model identification should be redone with explicit recognition of this fact. During OLS, the sum of the model residuals of all points is minimized with no regard to the values of the individual points or to points from different domains of the range of variability of the regressors. The basic concept of weighted least squares (WLS) is to simply assign different weights to different points according to a certain (rational) statistical scheme. The magnitude of the weight of an observation indicates the importance to be given to that observation. A note of caution is that the weights are not known exactly and that assigning values to them is sort of circumstance specific. The general formulation of WLS is that the following function should be minimized: WLS function =
wi yi - β0 - β1 x1i ⋯ - βp xpi
2
ð5:55Þ where wi are the weights of individual points. These are formulated differently depending on the weighting scheme selected. (c-i) Errors Are Proportional to x Resulting in FunnelShaped Residuals Consider the simple model y = α + βx + ε whose residuals ε have a standard deviation that increases as the regressor variable (resulting in the funnel-like shape in Fig. 5.26). Dividing the terms of the model by x results in:
5.6 Model Residual Analysis and Regularization
199
Fig. 5.25 (a) Residual plot of linear model. (b) Residual plot of log-transformed linear model. (c) Residual plot of log-transformed quadratic model (Eq. 5.54)
y α ε = þ β þ or y0 = αx0 þ β þ ε0 x x x
Fig. 5.26 Type of heteroscedastic model residual behavior which arises when errors are proportional to the magnitude of the x variable
ð5:56Þ
with the variance of ε′ becoming constant and equal to a constant k2. This is akin to weighting different vertical slices of the regressor variable by varðεi Þ = k 2 x2i . If the assumption about the weighting scheme is correct, the transformed model will be homoscedastic, and the model parameters α and β will be efficiently estimated by OLS (i.e., the standard errors of the estimates will be optimal). The above transformation is only valid when the model residuals behave as shown in Fig. 5.26. If residuals behave differently, then different transformations or weighting schemes should be explored. Whether a particular transformation is adequate or not can only be gauged by the behavior of the variance of the residuals. Note that the analyst must
200
5
Linear Regression Analysis Using Least Squares
Table 5.9 Measured x and y variables, OLS residuals deduced from Eq. 5.58a and the weights calculated from Eq. 5.58b (Example 5.6.3)a x 1.15 1.90 3.00 3.00 3.00 3.00 3.00 5.34 5.38 5.40 5.40 5.45 7.70 7.80 7.81 7.85 7.87 7.91 7.94 a
y 0.99 0.98 2.60 2.67 2.66 2.78 2.80 5.92 5.35 4.33 4.89 5.21 7.68 9.81 6.52 9.71 9.82 9.81 8.50
OLS residual εi 0.26329 - 0.59826 - 0.2272 - 0.1572 - 0.1672 - 0.0472 - 0.0272 0.435964 - 0.17945 - 1.22216 -0.66216 - 0.39893 - 0.48358 1.53288 - 1.76847 1.37611 1.463402 1.407986 0.063924
wi 0.9882 1.7083 6.1489 6.1489 6.1489 6.1489 6.1489 15.2439 13.6185 12.9092 12.9092 11.3767 0.9318 0.8768 0.8716 0.8512 0.8413 0.8219 0.8078
x 9.03 9.07 9.11 9.14 9.16 9.37 10.17 10.18 10.22 10.22 10.22 10.18 10.50 10.23 10.03 10.23
OLS residual εi - 0.20366 1.730922 2.375506 1.701444 0.828736 0.580302 - 1.18802 1.410628 0.005212 - 3.02479 0.875212 - 2.29937 - 4.0927 2.423858 - 0.61906 - 1.10614
y 9.47 11.45 12.14 11.50 10.65 10.64 9.78 12.39 11.03 8.00 11.90 8.68 7.25 13.46 10.19 9.93
wi 0.4694 0.4614 0.4535 0.4477 0.4440 0.4070 0.3015 0.3004 0.2963 0.2963 0.2963 0.3004 0.2696 0.2953 0.3167 0.2953
Data available electronically on book website
perform two separate regressions: First, an OLS regression to determine the residual amounts of the individual data points, and then a WLS regression for final parameter identification. This is often referred to as two-stage estimation. (c-ii) Replicated Measurements with Different Variance It could happen, especially with designed experiments involving one regressor variable only that one obtains replicated measurements on the response variable corresponding to a set of fixed values of the explanatory variables. For example, consider the case when the regressor variable x takes several discrete values. If the physics of the phenomenon cannot provide any theoretical basis on how to select a particular weighting scheme, then this must be determined heuristically from studying the data. If there is an increasing pattern in the heteroscedasticity present in the data, this could be modeled either by a logarithmic transform (as illustrated in Example 5.6.2) or a suitable variable transformation. Another more versatile approach that can be applied to any pattern of the residuals is illustrated. Each observed residual εij (where the index for discrete x values is i, and the number of observations at each discrete x value is j = 1, 2, . . . ni) is made up of two parts, that is, εij = yij - yi þ yi - yij . The first part is referred to as pure error while the second part measures lack of fit. An assessment of heteroscedasticity is based on pure error. Thus,
the WLS weight may be estimated as wi = 1=s2i where the mean square error is: s2i
=
yij - yi ð n i - 1Þ
2
ð5:57Þ
Alternatively, a model can be fit to the mean values of x and the s2i values in order to smoothen out the weighting function, and this function is used instead. Thus, this approach would also qualify as a two-stage estimation process. The following example illustrates this approach. Example 5.6.313 Example of two-stage weighted regression for replicate measurements Consider the x-y data given in Table 5.9 noting that replicate measurements of y are taken at different values of x (which vary slightly). Step 1: An OLS model is identified from the data. y = - 0:578954 þ 1:1354 x with R2 = 0:841 and RMSE = 1:4566
ð5:58aÞ
From the summary tables of the regression analysis, one notes that the intercept term in the model is not statistically 13
From Draper and Smith (1981), with permission from John Wiley and Sons.
5.6 Model Residual Analysis and Regularization
201
Fig. 5.27 (a) Data set and OLS regression line of observations with nonconstant variance and replicated observations in x. (b) Residuals of a simple linear OLS model fit (Eq. 5.58a) during step 1. (c) Residuals and regression line of a second order polynomial OLS fit to the mean x and mean square error (MSE) of the replicate values during step 2 (Eq. 5.58b). d Residuals of the weighted regression model identified during step 3 (Eq. 5.58c)
significant ( p-value = 0.4 for the t-statistic), while the overall model fit given by the F-ratio is significant. A scatter plot of these data and the simple OLS linear model are shown in Fig. 5.27a. The residuals of a simple linear OLS model shown in Fig. 5.27b reveal, as expected, marked
heteroscedasticity. Hence, the OLS model is bound to lead to misleading uncertainty bands even if the model predictions themselves are not biased. The residuals from the above OLS model are also shown in the 3rd column of Table 5.9.
202
5
Step 1: OLS model coefficients Parameter Intercept
Least squares estimate - 0.578954
Standard error 0.679186
Slope
1.1354
0.086218
t-statistic 0.852423 13.169
pvalue 0.4001
Sum of squares 367.948 70.0157 437.964
D. f. 1 33 34
Mean square 367.948 2.12169
F-ratio 173.42
pvalue 0.0000
Step 2: The residuals of the OLS model are heteroscedastic. One needs to identify a regression model for the OLS model residuals. The data range of the regressor variable can be partitioned into five ranges (these are the discrete values of x for which observations were taken if the first two rows are omitted from the analysis). These five values of x and the corresponding average of the mean square error s2i following Eq. 5.58a are shown in the Step 2 table. s2i 0.0072 0.373 1.6482 0.8802 4.1152
x 3 5.39 7.84 9.15 10.22
The data pattern exhibits a quadratic pattern (see Fig. 5.27c) and so a second-order polynomial model is regressed to this data to yield: s2i = 1:887 - 0:8727x þ 0:9967x2 with R2 = 0:743%
ð5:58bÞ
Step 3: The regression weights wi can thus be deduced by using individual values of xi instead of x in the above equation. The values of the weights are also shown in Table 5.9 under the 4th column. Step 4: Finally, a weighted regression is performed following the functional form given by Eq. 5.55 (most statistical packages have this capability) using the data under the 1st, 2nd, and 4th columns. y = - 0:942228 þ 1:16252x R = 0:896 and 2
with
RMSE = 1:2725:
the real advantage is that this model will have better prediction accuracy and more realistic (unbiased) prediction errors ■ than Eq. 5.58a. (c-iii) Nonpatterned Variance in the Residuals
0.0000
Step 1: OLS analysis of variance Source Model Residual Total (Corr.)
Linear Regression Analysis Using Least Squares
ð5:58cÞ
The residual plot is shown as Fig. 5.27d. Though the goodness of fit is only slightly better than the OLS model,
A third type of nonconstant residual variance is one when no pattern is discerned with respect to the regressors which can be discrete or vary continuously. In this case, a practical approach is to look at a plot of the model residuals against the response variable, divide the range in the response variable into as many regions as seem to have different variances, and calculate the standard deviation of the residuals for each of these regions. In that sense, the general approach parallels the one adopted in case (c-ii) when dealing with replicated values with a non-constant variance; however, now, no model such as Eq. 5.58b is needed. The general approach would involve the following steps: • First, fit an OLS model to the data. • Next, discretize the domain of the regressor variables into a finite number of groups and determine εi2 from which the weights wi for each of these groups can be deduced. • Finally, perform a WLS regression to estimate the efficient model parameters. Though this two-stage estimation approach is conceptually easy and appealing for simple models, it may become rather complex for multivariate models, and moreover, there is no guarantee that heteroscedasticity will be removed entirely.
5.6.4
Serially Correlated Residuals
Another manifestation of improper residual behavior is serial correlation. As stated earlier (Sect. 5.6.1). one should distinguish between the two different types of autocorrelation, namely pure autocorrelation and model-misspecification, although it is often difficult to distinguish between them. The latter is usually addressed using the weight matrix approach (Pindyck and Rubinfeld 1981) which is fairly formal and general, but somewhat demanding. Pure autocorrelation relates to the case of “pseudo” patterned residual behavior, which arises because the regressor variables have strong serial correlation. This serial correlation behavior is subsequently transferred over to the model, and hence to its residuals, even when the regression model functional form is close to “perfect.” The remedial approach to be adopted is to transform the original data set prior to regression itself. There are several techniques for doing so, and the widely used Cochrane-Orcutt (CO) procedure is described. It involves the use of generalized differencing to alter the linear model into one in which the errors are independent. The two-stage first-order CO procedure involves:
5.6 Model Residual Analysis and Regularization
203
(i) Fitting an OLS model to the original variables (ii) Computing the first-order serial correlation coefficient r of the model residuals (Eq. 3.12) (iii) Transforming the original variables y and x into a new set of pseudo-variables: yt = yt - r:yt - 1 and xt = xt - r:xt - 1
ð5:59Þ
(iv) OLS regression on the pseudo variables y* and x* to re-estimate the parameters (b0* and b1*) of the model (v) Finally, obtaining the fitted regression model in the original variables by a back transformation of the pseudo regression coefficients: b 0 = b0
1 and b1 = b1 1-r
ð5:60Þ
Though two estimation steps are involved, the entire process is simple to implement. This approach, when originally proposed, advocated that this process be continued till the residuals become random (say, based on the DurbinWatson test). However, the current recommendation is that alternative estimation methods should be attempted if one iteration proves inadequate. This approach can be used during model parameter estimation of MLR models provided only one of the regressor variables is the cause of the pseudo-correlation. Also, a more sophisticated version of the CO procedure has been suggested by Hildreth and Lu (Chatterjee and Price 1991) involving only one estimation process where the optimal value of r is determined along with the parameters. This, however, requires nonlinear estimation methods. Example 5.6.4 Using the Cochrane-Orcutt (CO) procedure to remove firstorder autocorrelation Consider the case when observed pre-retrofit data of either cooling or heating energy consumption in a commercial building support a linear regression model as follows: E i = b0 þ b1 T i
ð5:61Þ
where Ti = daily average outdoor dry-bulb temperature Ei = daily total energy use predicted by the model i = subscript representing a particular day bo and b1 are the least-square regression coefficients How the above transformation yields a regression model different from OLS estimation is illustrated in
Fig. 5.28 How serial correlation in the residuals affects model identification (Example 5.6.4)
Fig. 5.28 with year-long daily cooling energy use from a large institutional building in central Texas. The first-order autocorrelation coefficients of cooling energy and average daily temperature were both equal to 0.92, while that of the OLS residuals was 0.60. The Durbin-Watson statistic for the OLS residuals (i.e. untransformed data) was DW = 3 indicating strong residual autocorrelation, while that of the CO transform was 1.89 indicating little or no autocorrelation. Note that the CO transform is inadequate in cases of model misspecification and/or seasonal operational changes. ■
5.6.5
Dealing with Misspecified Models
An important source of error during model identification is model misspecification error. This is unrelated to measurement error and arises when the functional form of the model is not appropriate. This can occur due to: (i) Inclusion of irrelevant variables: Does not bias the estimation of the intercept and slope parameters, but generally reduces the efficiency of the slope parameters, that is, their variance will be larger. This source of error can be eliminated by, say, stepwise regression or simple tests such as t-tests. (ii) Exclusion of an important variable: Will result in the slope parameters being both biased and inconsistent. (iii) Assumption of a linear model: When a linear model is erroneously assumed. (iv) Incorrect model order: When one assumes a lower or higher model than what the data warrants.
204
5
Linear Regression Analysis Using Least Squares
Fig. 5.29 Improvement in residual behavior for a model of hourly energy use of a variable air volume HVAC system in a commercial building as influential regressors are incrementally added to the model. (From Katipamula et al. 1998)
The latter three sources of errors are very likely to manifest themselves in improper residual behavior (the residuals will show serial correlation or non-constant variance behavior). The residual analysis may not identify the exact cause, and several attempts at model reformulations may be required to overcome this problem. Even if the physics of the phenomenon or of the system is well understood and can be cast in mathematical terms, experimental or identifiability constraints may require that a simplified or macroscopic model be used for parameter identification rather than the detailed model. This could cause model misspecification. Example 5.6.5 Example to illustrate how the inclusion of additional regressors can remedy improper model residual behavior
Energy use in commercial buildings accounts for about 19% of the total primary energy use in the United States and consequently, it is a prime area of energy conservation efforts. For this purpose, the development of baseline models, that is, models of energy use for a specific end-use before energy conservation measures are implemented, is an important modeling activity for monitoring and verification studies. Let us illustrate the effect of improper selection of regressor variables or model misspecification for modeling measured thermal cooling energy use of a large commercial building operating 24 hours a day under a variable air volume HVAC system (Katipamula et al. 1998). Figure 5.29 illustrates the residual pattern when hourly energy use is modeled with only the outdoor dry-bulb temperature (To). The residual pattern is blatantly poor exhibiting both non-constant variances as well as systematic bias in the low
5.7 Other Useful OLS Regression Models
205
range of the x-variable. Once the outdoor dew point tempera14 the global horizontal solar radiation (qsol) and ture T þ dp ,
the internal building heat loads qi (such as lights and equipment) are introduced in the model, the residual behavior improves significantly but the lower tail is still improper. Finally, when additional terms involving indicator variables I to both intercept and To are introduced (described in Sect. 5.7.2), is an acceptable residual behavior achieved. ■
5.7
Other Useful OLS Regression Models
5.7.1
Zero-Intercept Models
Sometimes the physics of the system dictates that the regression line passes through the origin. For the linear case, the model assumes the form: y = β1 x þ ε
ð5:62Þ
The interpretation of R2 under such a case is not the same as for the model with an intercept, and this statistic cannot be used to compare the two types of models directly. Recall that for linear models the R2 value indicates the percentage variation of the response variable about its mean explained by that of the regressor variable. For the no-intercept case, the R2 value relates to the percentage variation of the response variable about the origin explained by the regressor variable. Thus, when comparing both models, one should decide on which is the better model based on their RMSE values.
5.7.2
Indicator Variables for Local Piecewise Models— Linear Splines
Spline functions are an important class of functions, described in numerical analysis textbooks in the framework of interpolation, which allows distinct functions to be used over different ranges while maintaining continuity in the function. They are extremely flexible functions in that they allow a wide range of locally different behavior to be captured within one elegant functional framework. In addition to interpolation, splines have been used in a regression context as well as for data smoothing (discussed below and in Sect. 9.6)
Fig. 5.30 Piece-wise linear model or first-order spline fit with hinge point at xc. Such models are referred to as change point models in building energy modeling terminology
Thus, a globally nonlinear function can be decomposed into simpler local patterns. Two common cases are discussed below. (a) The simpler case is one where it is known which points lie on which trend, that is, when the physics of the system is such that the location of the structural break or “hinge point” xc of the regressor is known. One could represent the two regions by piece-wise linear spline (as shown in Fig. 5.30); otherwise, the third-degree polynomial spline is often used to capture highly nonlinear trends (see Sect. 9.6.4). The objective here is to formulate a linear model and identify its parameters that best describe data points in Fig. 5.30. One cannot simply divide the data into two regions and fit each region with a separate linear model since the two segments are unlikely to intersect at exactly the hinge point (a constraint that the model be continuous at the hinge point would be violated). A model of the following form would be acceptable: y = b 0 þ b1 x þ b2 ð x - x c Þ I
ð5:63aÞ
where the indicator (also called dummy or binary) variable I=
1
if x > xc
0
otherwise
ð5:63bÞ
Hence, for the region x ≤ x c , y = b0 þ b1 x
ð5:64Þ
and for the region 14
Actually, the outdoor humidity impacts energy use only when the dew point temperature Tdp exceeds a certain threshold which many studies have identified to be about 55°F (this is related to how the HVAC cooling coil is controlled to meet indoor occupant comfort). This conditional variable indicated by a + superscript is equal to (Tdp- 55) when the term is positive, and zero otherwise.
x > xc , y = ðb0 - b2 xc Þ þ ðb1 þ b2 Þx Thus, the slope of the model is b1 before the break and (b1 + b2) afterward. The intercept term changes as well from
206
5
b0 before the break to (b0 - b2xc) after the break. The logical extensions to linear spline models with two structural breaks or to higher order splines involving quadratic and cubic terms are fairly straightforward and treated further in Sect. 9.6.4. (b) The second case arises when the change point is not known. A simple approach is to look at the data, identify a “ball-park” range for the change point, perform numerous regression fits with the data set divided according to each possible value of the change point in this ball-park range, and pick that value which yields the best overall R2 or RMSE. Alternatively, the more accurate but more complex approach is to cast the problem as a nonlinear estimation method with the change point variable as one of the parameters. Example 5.7.1 Change point models for building utility bill analysis The theoretical basis of modeling monthly energy use in buildings is discussed in several papers (e.g., Reddy et al., 1997, 2016). The interest in this particular time scale is obvious—such information is easily obtained from utility bills, which are usually available on a monthly time scale. The models suitable for this application are similar to linear spline models and are referred to as change point models by building energy analysts. A simple example is shown below to illustrate the above equations. Electricity utility bills of a residence in Houston, TX have been normalized by the number of days in the month and assembled in Table 5.10 along with the corresponding month and monthly mean outdoor temperature values for Houston (the first three columns of the table). The intent is to use Eq. 5.63a to model this behavior. The scatter plot and the trend lines drawn in Fig. 5.31 suggest that the change point is in the range 17–19 °C. Let us
Linear Regression Analysis Using Least Squares
perform the calculation assuming a value of 17 °C. Defining an indicator variable: I=
1
if x > 17 ° C
0
otherwise
Based on this assumption, the last two columns of the table have been generated to correspond to the two regressor variables in Eq. 5.63a. A linear multiple regression yields: y = 0:1046 þ 0:005904x þ 0:00905ðx - 17ÞI with R2 = 0:996 and RMSE = 0:0055
ð5:65Þ
with all three parameters being statistically significant. The reader can repeat this analysis assuming a different value for the change point (say xc = 18 °C) in order to study the
Fig. 5.31 Piece-wise linear regression lines for building electric use with outdoor temperature. The change point is the point of intersection of the two lines. The combined model is called a change point model, which in this case, is a four-parameter model given by Eq. 5.65
Table 5.10 Measured monthly energy use data and calculation step for deducing the change point independent variable assuming a base value of 17°C. Data for Example 5.7.1 Month Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec a
Mean outdoor temperature (°C) 11 13 16 21 24 27 29 29 26 22 16 13
Data available electronically on book website
Monthly mean daily electric use (kWh/m2/day) 0.1669 0.1866 0.1988 0.2575 0.3152 0.3518 0.3898 0.3872 0.3315 0.2789 0.2051 0.1790
x (°C) 11 13 16 21 24 27 29 29 26 22 16 13
(x - 17 ° C)I (°C) 0 0 0 4 7 10 12 12 9 5 0 0
5.7 Other Useful OLS Regression Models
207
sensitivity of the model to the choice of the change point value. Though only three parameters are determined by regression, this is an example of a four-parameter (or 4-P) model in building science terminology. The fourth parameter is the change point xc which also needs to be selected/ determined. Specialized software programs have been developed to determine the optimal value of xc (i.e., that which results in minimum RMSE of different possible choices of xc) following a numerical search process akin to the one described in this example. ■
and model 2 be for energy - efficient buildings : y = a 2 þ b2 x1 þ c2 x2
The complete model (or model 3) would be formulated as: y = a 1 þ b 1 x 1 þ c 1 x 2 þ I ð a2 þ b 2 x 1 þ c 2 x 2 Þ
Indicator Variables for Categorical Regressor Models
The use of indicator (also called dummy) variables has been illustrated in the previous section when dealing with spline models. Indicator variables are also used in cases when shifts in either the intercept or the slope are to be modeled with the condition of continuity now being relaxed. Most variables encountered in mechanistic models are quantitative and continuous, that is, the variables are measured on a numerical scale. Variables which cannot be controlled are often called “covariates”. Some examples are temperature, pressure, distance, energy use, and age. Occasionally, the analyst comes across models involving qualitative or categorical variables, that is, regressor data that belong in one of two (or more) possible categories. One would like to evaluate whether differences in intercept and slope between categories are significant enough to warrant two separate models or not. This concept is illustrated by the following example. Whether the annual energy use of a regular commercial buildings is markedly higher than that of another certified as being energy efficient is to be determined. Data from several buildings which fall in each group are gathered to ascertain whether the presumption is supported by the actual data. Covariates that affect the normalized energy use (variable y) of both experimental groups are conditioned floor area (variable x1) and outdoor temperature (variable x2). Suppose that a linear relationship can be assumed with the same intercept for both groups. One approach would be to separate the data into two groups: one for regular buildings and one for efficient buildings and develop regression models for each group separately. Subsequently, one could perform a t-test to determine whether the slope terms of the two models are significantly different or not. However, the assumption of constant intercept term for both models may be erroneous, and this may confound the analysis. A better approach is to use the entire data and adopt a modeling approach involving indicator variables. Let model 1 be for regular buildings : y = a 1 þ b1 x 1 þ c 1 x 2
ð5:66aÞ
ð5:67aÞ
where I is an indicator variable such that I=
5.7.3
ð5:66bÞ
1 0
for energy efficient buildings for regular buildings
ð5:67bÞ
Note that a basic assumption in formulating this model is that all three model parameters are affected by the building group. Formally, one would like to test the null hypothesis H0: a2 = b2 = c2 = 0. The hypothesis is tested by constructing an Fstatistic for the comparison of the two models. Note that model 3 is referred to as the full model (FM) or pooled model. Model 1, when the null hypothesis holds, is the reduced model (RM). The idea is to compare the goodnessof-fit of the FM and that of the RM using both data sets combined. If the RM provides as good a fit as the FM, then the null hypothesis is valid. Let SSE(FM) and SSE(RM) be the corresponding model sum of squared errors or squared model residuals. Then, the following F-test statistic is defined: F=
½SSEðRMÞ - SSEðFMÞ p-m
SSEðFMÞ n-p
ð5:68Þ
where n is the number of data sets, p is the number of parameters of the FM, and m is the number of parameters of the RM. If the observed F-value is larger than the tabulated value of F with (n -p) and (p - m) degrees of freedom at the prespecified significance level (provided by Table A.6), the RM is unsatisfactory, and the full model has to be retained. As a cautionary note, this test is strictly valid only if the OLS assumptions for the model residuals hold. Example 5.7.2 Combined modeling of energy use in regular and energyefficient buildings Consider the data assembled in Table 5.11. Let us designate the regular buildings by group (A) and the energyefficient buildings by group (B), with the problem simplified by assuming both types of buildings to be located in the same geographic location. Hence, the model has only one regressor variable involving floor area. The complete model with the indicator variable term given by Eq. 5.67a is used to verify whether group B buildings consume less energy than group A buildings.
208
5
Linear Regression Analysis Using Least Squares
Table 5.11 Data table for Example 5.7.2a Energy use ( y) 45.44 42.03 50.1 48.75 47.92 47.79 52.26 50.52 45.58 44.78 a
Floor area (x1) 225 200 250 245 235 237 265 259 221 218
Bldg type A A A A A A A A A A
Energy use ( y) 32.13 35.47 33.49 32.29 33.5 31.23 37.52 37.13 34.7 33.92
Floor area (x1) 224 251 232 216 224 212 248 260 243 238
Bldg type B B B B B B B B B B
Data available electronically on book website
The full model (FM) given by Eq. 5.67a reduces to the following form since only one regressor is involved: y = a + b1x1 + b2Ix1 where the variable I is an indicator variable such that it is 0 for group A and 1 for group B. The null hypothesis is that H0 : b2 = 0. The reduced model (RM) is y = a + bx1. It is identified using the entire data set without distinguishing between the building types. The estimated model : y = 14:2762 þ 0:14115 x1 - 13:2802 ðI x1 Þ while the RM model : y = 5:7768 þ 0:1491 x1 :
ð5:69Þ
The analysis of variance results in SSR(FM) = 7.7943 and SSR(RM) = 889.245. The F-statistic in this case is: F=
ð889:245 - 7:7943Þ=1 = 1922:5 7:7943=ð20 - 3Þ
One can thus safely reject the null hypothesis, and state with confidence that buildings built as energy-efficient ones consume energy which is statistically lower than those of regular buildings. ■
5.8
Resampling Methods Applied to Regression
5.8.1
Basic Approach
The fundamental tasks in regression involve model selection, that is, identifying a suitable model and estimating the values and the uncertainty intervals of the model parameters, and model assessment, that is, deducing the predictive accuracy of the model for subsequent use. The classical OLS equations presented in Sects. 5.3 and 5.4 can be used for this purpose in conjunction with model residual analysis along with the
Durbin-Watson statistic (Sect. 5.6.1) to guard against the dangers or model under-fitting and over-fitting. However, it should be recognized that the data set used is simply a sample of a much larger set of population data characterizing the behavior of the stochastic system under study. In that sense, the model selection and evaluation results are somewhat limited because they do not make full use of the variability inherent in samples drawn from a population. These limitations can be overcome by resampling methods (see Sect. 4.8), which offer distinct advantages in terms of better accuracy, robustness, versatility, and intuitive appeal in the context of regression modeling as well. They are widely regarded as being able to better perform the fundamental tasks involved in regression, and as a result are becoming increasingly popular. Recall from Sect. 4.8.3 that the basic rationale behind resampling methods is to draw one single sample or experimental data set, treat this original sample as a surrogate for the population, and generate numerous sub-samples by simply resampling the sample itself. Thus, resampling refers to the use of given data, or a data generating mechanism, to produce new samples from which the required estimates can be deduced numerically. Note, however, that the resampling methods cannot overcome some of the limitations inherent in the original sample. The resampling samples are to the sample what the sample is to the population. Hence, if the sample does not adequately cover the spatial range or if the sample is not truly random, then the resampling results will be inaccurate as well.
5.8.2
Jackknife and k-Fold Cross-Validation
In most practical situations, it is misleading to use the entire data set to identify the regression model and report the resulting RMSE as the predictive accuracy of the model. This is due to the possibility of model overfitting and associated under-estimation of the predictive RMSE error.
5.8 Resampling Methods Applied to Regression
Overfitting refers to the situation in which the regression model is able to capture the training set with high accuracy but is poor at predicting new data. In other words, the model is over-trained on features specific to the training set, which may differ for new data. A better but somewhat empirical approach is to randomly partition the data set into two samples (say, in the proportion of 80/20), use the 80% portion of the data to train or develop the model, calculate the internal predictive error (say the RMSE following Eq. 5.8), use the 20% portion of the data as the validation data set, and predict or test the y values using the already identified model, and finally calculate the test or external or simulation error magnitude. The competing models can then be compared, and a selection made based on both the internal and external predictive errors pertinent to the training and testing data sets respectively. The test errors indices will generally be greater than the training errors; larger discrepancies are suggestive of greater over-fitting, and vice versa. This general approach is the basis of the two most common resampling methods discussed below. The jackknife method and its more recent version, the cross-validation method, were described in Sect. 4.8.3. The latter method of model evaluation, also referred to as holdout sample validation, can avoid model over-fitting. With the advent of powerful computers, a variant, namely the “k-fold cross-validation” method has become popular. It involves: (i) Dividing the random sample of n observations into k groups of equal size. (ii) Omitting one group at a time and performing the regression with the other (k-1) groups. (iii) Determining and saving the internal or modeling errors as well as the external or simulation or predictive errors (say, in terms of the RMSE values) of both sets of sub-samples. (iv) And using the saved parameters values to deduce the mean and uncertainty intervals for the model parameters and computing the mean and confidence levels of the modeling and simulation prediction errors. The model parameters and the test values determined in the last step are likely to be less biased, much more robust, and more representative of the actual model behavior than using classical methods.
209
Note, however, that though the same equations are used to compute the RMSE indices, the degrees of freedom (d.f.) are different. Let n be the total number of data points. Then d. f. = {[(k-1)(n-p)]/k]} while computing the internal errors for model building or selection, and d.f. = [(n/k)] while computing the RMSE of the external predictive errors. There is a trade-off between high bias error and high variance in the choice of the number k. It is recommended that k be selected as either k = 5 or k = 10 (James et al. 2013), colloquially referred to the “magic” numbers of folds. Example 5.8.1 k-fold cross-validation Consider Example 5.3.1, which involved fitting the simple OLS regression with 33 data observations of solids reduction (x-variable) and oxygen demand (y-variable). The regression analysis will be redone to illustrate the insights provided by k-fold cross-validation; k = 3 has been assumed in this simplified illustration. The data set has been first randomized to remove the monotonic increase in the regressor variable and broken up into 3 sub-sets of 11 observations each. Three data samples with different combinations of the sub-sets involving two sub-sets (i.e., 22 data points) can now be created. The analysis is performed with 22 data points sets for training, that is, to identify an OLS regression model, and the remaining sub-set of 11 data points used for testing, that is, to compute the simulation or external prediction error. The results are summarized in Table 5.12. One notes that the model parameters vary from one run to another indicating their random variability as different samples of data are selected. If the k-fold analysis was done with a greater number of folds (say k = 5), one could have deduced the variance of these estimates which would most likely be less biased and more accurate than those determined from classical methods. The final model determined from kfold regression analysis is the same as the one using all data points while the estimate of the predictive error (as indicated by the test RMSE = 3.447) is the average of the RMSE values from the three samples. Thus, the extra effort involving creating k-fold samples, identifying regression models, and calculating the prediction errors was simply to get a better estimate of the prediction RMSE error. It is a more
Table 5.12 Summary of the OLS regression analysis results using threefold cross-validation using data from Example 5.3.1 Data All-data (Example 5.3.1) Threefold sample 1 Threefold sample 2 Threefold sample 3 Final model
OLS model y = 3.830 + 0.9036 x y = 1.797 + 0.9734 x y = 3.962 + 0.8941 x y = 6.079 +0.8339 x y = 3.830 + 0.9036 x
R2 0.913 0.933 0.949 0.911 0.913
Training RMSE 3.229 3.176 3.442 2.980 3.229
Test RMSE – 3.628 2.826 3.886 3.447
210
5
representative value which, as stated earlier, is usually greater than the training or internal RMSE.
5.8.3
error variance in the regression if normal errors can be assumed (this is analogous to the concept behind the Monte Carlo approach), or (ii) nonparametrically, by resampling residuals from the original regression. One would then regress the bootstrapped values of the response variable on the fixed X matrix to obtain bootstrap replications of the regression coefficients. This approach is often adopted with data from designed experiments.
Bootstrap Method
Recall that in Sect. 4.8.3, the use of the bootstrap method (one of the most powerful and popular methods currently in use) was illustrated to infer variance and confidence intervals of parametric statistical measures in a univariate context, and also in a situation involving a nonparametric approach, where the correlation coefficient between two variables was to be deduced. Bootstrap is a statistical method where random resampling with replacement is done repeatedly from an original or initial sample, and then each bootstrapped sample is used to compute a statistic (such as the mean, median, or the interquartile range). The resulting empirical distribution of the statistic is then examined and interpreted as an approximation to the true sampling distribution. Thus, bootstrapping is an ensemble training method, which “bags” the results from numerous data sets into an ensemble average. It is often used as a robust nonparametric alternative to inference-type problems when parametric assumptions are in doubt (e.g., knowledge of the probability distribution of the errors), or where parametric inference is impossible or requires very complicated formulas for the calculation of variance. Say, one has a data set of multivariate observations: zi = {yi, x1i, x2i,. . .} with i = 1, . . .n (this can be viewed as a sample with n observations in the bootstrap context taken from a population of possible observations). One distinguishes between two approaches: (i) Case resampling, where the predictors and response observations i are random and change from sample to sample. One selects a certain number of bootstrap sub-samples (say 1000) from zi, fits the model and saves the model coefficients from each bootstrap sample. The generation of the confidence intervals for the regression coefficients is now similar to the univariate situation and is quite straightforward (Sects. 5.3.3 and 5.3.4). One of the benefits is that the correlation structure between the regressors is maintained; (ii) Model-based resampling or fixed- X resampling, where the regressor data structure is already imposed or known with confidence. Here, the basic idea is to generate or resample the model residuals and not the observations themselves. This preserves the stochastic nature of the model structure and so the variance is better representative of the model’s own assumption. The implementation involves attaching a random error to each yi, and thereby producing a fixed X- bootstrap sample. The errors could be generated: (i) parametrically from a normal distribution with zero mean and variance equal to the estimated
Linear Regression Analysis Using Least Squares
The reader can refer to Efron and Tibshirani (1985), Davison and Hinkley (1997) and other more advanced papers such as Freedman and Peters (1984) for a more complete treatment.
5.9
Case Study Example: Effect of Refrigerant Additive on Chiller Performance15
The objective of this analysis is to verify the claim of a company, which had developed a refrigerant additive to improve chiller COP. The performance of a chiller before (called pre-retrofit period) and after (called post-retrofit period) addition of this additive was monitored for several months to determine whether the additive results in an improvement in chiller performance, and if so, by how much. The same four variables described in Example 5.4.3, namely two temperatures (Tcho and Tcdi), the chiller thermal cooling load (Qch) and the electrical power consumed (Pcomp) were measured in intervals of 15 min. Note that the chiller COP can be deduced from the last two variables. Altogether, there were 4607 and 5078 data points for the pre-and postperiods respectively. Step 1: Perform Exploratory Data Analysis At the onset, an exploratory data analysis should be performed to determine the spread of the variables, and their occurrence frequencies during the pre- and post-periods, that is, before and after the addition of the refrigerant additive. Further, it is important to ascertain whether the operating conditions during both periods are similar or not. The eight frames in Fig. 5.32 summarize the spread and frequency of the important variables. The chiller outlet water temperature ranges are similar during both periods. However, the condenser water temperature and the chiller cooling load show much larger variability during the post-period. Finally, the cooling load and power consumed are noticeably lower during the post period. The histogram of Figure 5.33 suggests 15
The monitored data were provided by Ken Gillespie, for which we are grateful.
5.9 Case Study Example: Effect of Refrigerant Additive on Chiller Performance
Pre
(X 1000.0) 5
211
Post
(X 1000.0) 5
Tcho 4
percentage
percentage
4 3 2 1
2 1
0
0 40
42
44
46
48
50
40
42
44
46
48
50
66
70
74
78
82
86
0
300
600
900
1200
1500
1000
2400
Tcdi
2000
800
percentage
percentage
3
1600 1200 800
600 400 200
400
0
0 66
70
74
78
82
86 1500
1800
Qch
1200
percentage
percentage
1500 1200 900 600
900 600 300
300
0
0 0
300
600
900
1200
1500
1500
1800
Pcomp
1200
percentage
percentage
1500 1200 900 600
900 600 300
300 0
0 0
200
400
600
800
Fig. 5.32 Histograms depicting the range of variation and frequency of the four important variables before and after the retrofit (pre = 4607 data points, post = 5078 data points). The condenser water temperature
0
200
400
600
800
and the chiller cooling load show much larger variability during the post-period. The cooling load and power consumed are noticeably lower during the post period
212
5
that COPpost > COPpre. An ANOVA test with results shown in Table 5.13 and Fig. 5.34 also indicates that the mean of post-retrofit power use is statistically different at 95% CL as compared to the pre-retrofit power. t -test to Compare Means Null hypothesis: mean (COPpost) = mean (COPpre) Alternative hypothesis: mean (COPpost) ≠ mean (COPpre) assuming equal variances:
Linear Regression Analysis Using Least Squares
Step 2: Use the Entire Pre-retrofit Data to Identify a Model The GN chiller models (Gordon and Ng 2000) are described in Sect. 10.2.3. The monitored data are first used to compute transformed y, x1 and x2 (temperatures are in Kelvin and cooling load and power in kW) of the model given by Eq. 10.14b. Then, a linear regression is performed using Eq. 10.14a, which is given below along with standard errors of the coefficients shown within parenthesis with AdjR2 = 0.998: y = - 0:00187 x1 þ261:2885 x2 þ0:022461 x3
t = 38:8828, p‐value = 0:0
ð0:00163Þ
ð15:925Þ
ð0:000111Þ
with adjusted R = 0:998 2
The null hypothesis is rejected at α = 0.05. Of particular interest is the CI for the difference between the means, which extends from 0.678 to 0.750. Since the interval does not contain the value 0.0, there is a statistically significant difference between the means of the two samples at 95.0% CL. However, it would be incorrect to infer that COPpost > COPpre since the operating conditions are different, and thus one should not use the t-test to draw any conclusions. Hence, a regression model-based approach is warranted.
ð5:70Þ
This model is then re-transformed into a model for power using Eq. 10.15, and the error statistics using the pre-retrofit data are found to be: RMSE = 9.36 kW and CV = 2.24%. Figure 5.35 shows the x–y plot from which one can visually evaluate the goodness of fit of the model. Note that the mean power use is 418.7 kW while the mean model residuals are 0.017 kW (very close to zero, as it should be. This step validates the fact that the spreadsheet cells have been coded correctly with the right formulas). Step 3: Calculate Savings in Electrical Power The above chiller model representative of the thermal performance of the chiller without refrigerant additive is used to estimate savings by first predicting power consumption for each 15 min interval using the two operating temperatures and the load corresponding to the 5078 postretrofit data points. Subsequently, savings in chiller power are deduced for each of the 5078 data points: Power savings = Model predictedpre‐retrofit - Measuredpost‐retrofit
Fig. 5.33 Histogram plots of the coefficient of performance (COP) of the chiller before and after the retrofit. Clearly, there are several instances when COPpost > COPpre, but that could be due to operating conditions. Hence, a regression modeling approach is clearly warranted
ð5:71Þ
It is found that the mean power savings are - 21.0 kW (i.e., an increase in power use) in the measured mean power use of 287.5 kW. Overlooking the few outliers, one can detect two distinct patterns from the x-y plot of Fig. 5.36: (i) the lower range of data (for chiller power < 300 kW or so) when the differences between model predicted and postmeasurements are minor (or nil), and (ii) the higher range of data for which post-retrofit electricity power usage was
Table 5.13 Results of the ANOVA Test of comparison of means at a significance level of 0.05 95.0% CI for mean of COPpost: 95.0% CI for mean of COPpre: 95.0% CI for the difference between the means assuming equal variances:
8.573 ± 0.03142 = [8.542, 8.605] 7.859 ± 0.01512 = [7.844, 7.874] 0.714 ± 0.03599 = [0.678, 0.750]
5.10
Parting Comments on Regression Analysis and OLS
213
higher than that of the model identified from pre-retrofit data. This is the cause of the negative power savings determined above. The reason for the onset of two distinct patterns in operation is worthy of a subsequent investigation.
Step 4: Calculate Uncertainty in Savings and Draw Conclusions The uncertainty arises from two sources: prediction model and power measurement errors. The latter are usually small, about 0.1% of the reading, which in this particular case is less than 1 kW. Hence, this contribution can be neglected during an initial investigation such as this one. The model uncertainty is given by: Fig. 5.34 ANOVA test results in the form of box-and-whisker plots for chiller COP before and after addition of refrigerant additive
absolute uncertainty in power use savings or reduction = ðt value × RMSEÞ ð5:72Þ The t-value at 90% CL = 1.65 and RMSE of model (for pre-retrofit period) = 9.36 kW. Hence, the calculated increase in power due to refrigerant additive = - 21.0 kW ± 15.44 kW at 90% CL. Thus, one would conclude that the refrigerant additive is actually penalizing chiller performance by 7.88% since electric power use has increased.
5.10
Fig. 5.35 Measured vs modeled plot of chiller power during pre-retrofit period. The overall fit is excellent (RMSE = 9.36 kW and CV = 2.24%), and except for a few data points, the data seem well behaved. Total number of data points = 4607
Fig. 5.36 Difference in post-period measured vs pre-retrofit model predicted data of chiller power indicating that post-retrofit values are higher than those during pre-retrofit period (mean increase = 21 kW or 7.88%). One can clearly distinguish two operating patterns in the data suggesting some intrinsic behavioral change in chiller operation. Entire data set for the post-period consisting of 5078 observations has been used in this analysis
Parting Comments on Regression Analysis and OLS
Recall that OLS regression is an important sub-class of regression analysis methods. The approach adopted in OLS was to minimize an objective function (also referred to as the loss function) expressed as the sum of the squared residuals (given by Eq. 5.3). One was able to derive closed form solutions for the model parameters and their variance under certain simplifying assumptions as to how noise corrupts measured system performance. Such closed form solutions cannot be obtained for many situations where the function to be minimized has to be framed differently, and these require the adoption of search methods. Thus, parameter estimation problems are, in essence, optimization problems where the objective function is framed in accordance with what one knows about the errors in the measurements and in the model structure. OLS yields the best linear unbiased estimates provided the conditions (called the Gauss-Markov conditions) stated in Sect. 5.5.1 are met. These conditions, along with the stipulation that an additive linear relationship exists between the response and regressor variables, can be summarized as:
214
5
• Data are random and normally distributed, that is, the expected value of model residuals/errors is zero: E{εi} = 0, i = 1, . . ., N. • Model residuals {ε1. . .εN} and regressors {x1, . . ., xN} are independent. • Model residuals are not collinear: cov{εi, εj} = 0, i, j = 1, . . ., N, i ≠ j. • Residuals have constant variance : εi = σ 2 , i = 1, . . . N
ð5:73Þ
Further, OLS applies when the measurement errors in the regressors are small compared to that of the response and when the response variable is normally distributed. These conditions are often not met in practice. This has led to the development of a unified approach called generalized linear models (GLM) which is treated in Sect. 9.4.
Problems Pr. 5.1 Table 5.14 lists various properties of saturated water in the temperature range 0–100°C. (a) Investigate first-order and second-order polynomials that fit saturated vapor enthalpy to temperature in °C. Identify the better model by looking at R2, RMSE, and CV values for both models. Predict the value of saturated vapor enthalpy at 30°C along with 95% CI and 95% prediction intervals. (b) Repeat the above analysis for specific volume but investigate third-order polynomial fits as well. Predict the value of specific volume at 30°C along with 95% CI and 95% prediction intervals. (c) Calculate the skill factors for the second and third-order models with the first order as the baseline model. Pr. 5.2 Regression of home size versus monthly energy use It is natural to expect that monthly energy use in a home increases with the size of the home. Table 5.15 assembles 10 data points of home size (in square feet) versus energy use (kWh/month).
Linear Regression Analysis Using Least Squares
You will analyze this data as follows: (a) Plot the data as is and visually determine linear or polynomial trends in this data. Perform the regression and report the model goodness-of-fit and model parameters values using (i) all 10 points, and (ii) a bootstrap analysis with 10 samples. Compare the results by both methods and draw pertinent conclusions. (b) Repeat the above analysis taking the logarithm of (home area) versus energy use. (c) Which of these two models would you recommend for future use? Provide justification. Pr. 5.3 Tensile tests on a steel specimen yielded the results shown in Table 5.16. (a) Assuming the regression of y on x to be linear, estimate the parameters of the regression line and determine the 95% CI for x = 4.5. (b) Now regress x on y and estimate the parameters of the regression line. For the same value of y predicted in (a) above, determine the value of x. Compare this value with the value of 4.5 assumed in (a). If different, discuss why. (c) Compare the R2 and CV values of both models. Discuss difference with the results of part (b). (d) Plot the residuals of both models and identify the preferable one for OLS. (e) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of (a) and (b) above. Pr. 5.4 The yield of a chemical process was measured at three temperatures (in °C), each with two concentrations of a particular reactant, as recorded in Table 5.17. (a) Use OLS to find the best values of the coefficients a, b, and c assuming the equation: y = a + b.t + c.x. (b) Calculate the R2, RMSE, and CV of the overall model as well as the SE of the parameters. (c) Using the β coefficient concept described in Sect. 5.4.5, determine the relative importance of the two independent variables on the yield.
Table 5.14 Data table for Problem 5.1 Temperature t (°C) Specific volume v (m^3/kg) Sat. vapor enthalpy kJ/kg
0 206.3
10 106.4
20 57.84
30 32.93
40 19.55
50 12.05
60 7.679
70 5.046
80 3.409
90 2.361
100 1.673
2501.6
2519.9
2538.2
2556.4
2574.4
2592.2
2609.7
2626.9
2643.8
2660.1
2676
Problems
215
Table 5.15 Data table for Problem 5.2
Home area (sq. ft) 1290 1350 1470 1600 1710 1840 1980 2230 2400 2930
Table 5.16 Data table for Problem 5.3
Tensile force x Elongation y
1 15
2 35
3 41
4 63
5 77
6 84
Table 5.17 Data table for Problem 5.4
Temperature, t Concentration, x Yield y
40 0.2 38
40 0.4 42
50 0.2 41
50 0.4 46
60 0.2 46
60 0.4 49
Energy use (kWh/mo) 1182 1172 1264 1493 1571 1711 1804 1840 1956 1954
Table 5.18 Data table for Problem 5.5a LF CCoal CEle a
85 15 4.1
80 17 4.5
70 27 5.6
74 23 5.1
67 20 5.0
87 29 5.2
78 25 5.3
73 14 4.3
72 26 5.8
69 29 5.7
82 24 4.9
89 23 4.8
Data available electronically on book website
Table 5.19 Data table of outlet water temperature Tco (°C) for Problem 5.6a
Range R (°C) 10 13 16 19 22 a
Ambient wet-bulb temperature Twb (°C) 20 21.5 23 25.54 26.47 27.31 26.51 27.30 27.69 27.34 27.86 28.14 27.68 28.40 29.24 27.89 29.15 29.19
23.5 27.29 28.18 29.16 29.29 29.34
26 29.08 29.84 29.88 30.98 30.83
Data available electronically on book website
Pr. 5.5 Cost of electric power generation versus load factor and cost of coal The cost to an electric utility of producing power (CEle) in mills per kilowatt-hr ($ 10-3/kWh) is a function of the load factor (LF) in % and the cost of coal (Ccoal) in cents per million Btu. Relevant data are assembled in Table 5.18. (a) Investigate different models (first order and second order with and without interaction terms) and identify the best model for predicting CEle vs LF and CCoal. Use stepwise regression if appropriate. (Hint: plot the data and look for trends first). (b) Perform residual analysis. (c) Calculate the R2, RMSE, CV, and DW of the overall model as well as the SE of the parameters. Is DW relevant?
(d) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identified earlier. Pr. 5.6 Modeling of cooling tower performance Manufacturers of cooling towers often present catalog data showing outlet-water temperature Tco as a function of ambient air wet-bulb temperature (Twb) and range R (which is the difference between inlet and outlet water temperatures). Table 5.19 assembles data for a specific cooling tower. (a) Identify an appropriate model (investigate first-order linear and second-order polynomial models without and with interaction terms for Tco) by looking at R2, RMSE, and CV values, the individual t-values of the parameters as well as the behavior of the overall model residuals.
216
5
(b) Calculate the skill factor of the final model compared to the baseline model (assumed to be the first-order baseline model without interaction effects). (c) Repeat the analysis for the best model using k-fold cross-validation with k=5. (d) Summarize the additional insights which the k-fold analysis has provided. Pr. 5.7 Steady-state performance testing of solar thermal flat plate collector Solar thermal collectors are devices that convert the radiant energy from the sun into useful thermal energy that goes to heating, say, water for domestic or for industrial applications. Because of low collector time constants, heat capacity effects are usually small compared to the hourly time step used to drive the model. The steady-state useful energy qC delivered by a solar flat-plate collector of surface area AC is given by the Hottel-Whillier-Bliss equation (see any textbook on solar energy thermal collectors, e.g., Reddy 1987): qc = Ac F R ½I T ηn - U L ðT Ci - T a Þþ
ð5:74Þ
where FR is called the heat removal factor and is a measure of the solar collector performance as a heat exchanger (since it can be interpreted as the ratio of actual heat transfer to the maximum possible heat transfer); ηn is the optical efficiency or the product of the transmittance and absorptance of the cover and absorber of the collector at normal solar incidence; UL is the overall heat loss coefficient of the collector, which is dependent on collector design only, IT is the radiation intensity on the plane of the collector, Tci is the temperature of the fluid entering the collector, and Ta is the ambient temperature. The + sign denotes that only positive values are to be used, which physically implies that the collector should not be operated if qC is negative, that is, when the collector loses more heat than it can collect (which can happen under low radiation and high Tci conditions).
Linear Regression Analysis Using Least Squares
to the collector. Thus, measurements (of course done as per the standard protocol, ASHRAE 1978) of IT, Tci and Tco are done under a pre-specified and controlled value of fluid flow rate from which ηc can be calculated using Eq. 5.75a. The test data are then plotted as ηc against reduced temperature [(TCi - Ta)/IT] as shown in Fig. 5.37. A linear fit is made to these data points by regression using Eq. 5.75b from which the values of FR.ηn and FR UL are deduced. If the same collector is testing during different days, slightly different numerical values are obtained for the two parameters FR.ηn and FRUL which are often, but not always, within the uncertainty bands of the estimates. Model misspecification (i.e., the model is not perfect, which can occur, for example, the collector heat losses are not strictly linear) is partly the cause of such variability. This is somewhat disconcerting to a manufacturer since this introduces ambiguity as to which values of the parameters to present in his product specification sheet. The data points of Fig. 5.37 are assembled in Table 5.20. Assume that water is the working fluid. (a) Perform OLS regression using Eq. 5.75b and identify the two parameters FRηn and FRUL along with their variance. Plot the model residuals and study their behavior. (b) Repeat the analysis using bootstrapping and compare the model parameter estimates with those of the best model identified earlier. (c) Draw a straight line visually through the data points and determine the x-axis and y-axis intercepts. Estimate the FRηn and FRUL parameters and compare them with those determined from (a). (d) Calculate the R2, RMSE and CV values of the model.
Steady-state collector testing is the best manner for a manufacturer to rate his product. From an overall heat balance on the collector fluid and from Eq. 5.74, the expressions for the instantaneous collector efficiency ηc under normal solar incidence are: mcp C ðT Co - T Ci Þ qC = AC I T AC I T T - Ta = F R ηn - F R U L Ci IT
ηC
ð5.75a, bÞ
where mc is the total fluid flow rate through the collectors, cpc is the specific heat of the fluid flowing through the collector, and Tci and Tco are the inlet and exit temperatures of the fluid
Fig. 5.37 Test data points of thermal efficiency of a double glazed flatplate liquid collector with reduced temperature. The regression line of the model given by Eq. 5.75 is also shown (From ASHRAE (1978) # American Society of Heating, Refrigerating and Air-conditioning Engineers, Inc., www.ashvae.org)
Problems
217
(e) Calculate the F-statistic to test for overall model significance of the model. (f) Perform t-tests on the individual model parameters. (g) Use the model to predict collector efficiency when IT = 800 W/m2, Tci = 35 °C and Ta = 10 °C. (h) Determine the 95% CL intervals for the mean and individual responses for ( f ) above. (i) The steady-state model of the solar thermal collector assumes the heat loss term given by [UA(Tci - Ta] is linear with the temperature difference between collector inlet temperature and the ambient temperature. One wishes to investigate whether the model improves if the loss term is to include an additional second order term: (i) Derive the resulting expression for collector efficiency analogous to Eq. 5.75b? (Hint: start with the fundamental heat balance equation—Eq. 5.74). (ii) Does the data justify the use of such a model?
Table 5.20 Data table for Problem 5.7a x 0.009 0.011 0.025 0.025 0.025 0.025 0.050 a
y (%) 64 65 56 56 52.5 49 35
x 0.051 0.052 0.053 0.056 0.056 0.061 0.062
y (%) 30 30 31 29 29 29 25
X 0.064 0.065 0.065 0.069 0.071 0.071 0.075
y (%) 27 26 24 24 23 21 20
x 0.077 0.080 0.083 0.086 0.091 0.094
y (%) 20 16 14 14 12 10
Data available electronically on book website
Pr. 5.816 Dimensionless model for fans or pumps The performance of a fan or pump is characterized in terms of the head or the pressure rise across the device and the flow rate for a given shaft power. The use of dimensionless variables simplifies and generalizes the model. Dimensional analysis (consistent with fan affinity laws for changes in speed, diameter, and air density) suggests that the performance of a centrifugal fan can be expressed as a function of two dimensionless groups representing flow coefficient and pressure head, respectively: Ψ=
SP D2 ω2 ρ
and Φ =
Q D3 ω
ð5:76Þ
where SP is the static pressure, Pa; D is the diameter of wheel, m; ω is the rotative speed, rad/s; ρ is the density, kg/m3 and Q is the volume flow rate of air, m3/s. For a fan operating at constant density, it should be possible to plot one curve of Ψ vs Φ that represents the performance at all speeds and diameters for this generic class of pumps. The performance of a certain 0.3 m diameter fan is shown in Table 5.21. (a) Convert the given data into the two dimensionless groups defined by Eq. 5.76. (b) Next, plot the data and formulate two or three promising functions. (c) Identify the best function by looking at the R2, RMSE, CV and DW values and at residual behavior. (d) Repeat the analysis for the best model using k-fold cross-validation with k = 5. (e) Summarize the additional insights which the k-fold analysis has provided.
Table 5.21 Data table for Problem 5.8a Rotation ω (Rad/s) 157 157 157 157 157 157 126 126 126 126 126 126 a
Flow rate Q (m3/s) 1.42 1.89 2.36 2.83 3.02 3.30 1.42 1.79 2.17 2.36 2.60 3.30
Static pressure SP (Pa) 861 861 796 694 635 525 548 530 473 428 351 114
Rotation ω (Rad/s) 94 94 94 94 94 63 63 63 63
Flow rate Q (m3/s) 0.94 1.27 1.89 2.22 2.36 0.80 1.04 1.42 1.51
Static pressure SP (Pa) 304 299 219 134 100 134 122 70 55
Data available electronically on book website
16
From Stoecker (1989), with permission from McGraw-Hill.
218
5
Table 5.22 Data table for Problem 5.10a
kT 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 a
(Id/I ) 0.991 0.987 0.982 0.978 0.947 0.903 0.839 0.756
Linear Regression Analysis Using Least Squares kT 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
(Id/I ) 0.658 0.55 0.439 0.333 0.244 0.183 0.164 0.166 0.165
Data available electronically on book website
Table 5.23 Data table for Problem 5.11 Balance point temp. (°C) VBDD (°C-Days)
25 4750
20 3900
15 2000
10 1100
5 500
0 100
-5 0
Assume the density of air at STP conditions to be 1.204 kg/m3.
[Hint: Make sure that the function is continuous at the hinge point].
Pr. 5.9 Consider the data used in Example 5.6.3 meant to illustrate the use of weighted regression for replicate measurements with non-constant variance. For the same data set, identify a model using the logarithmic transform approach similar to that shown in Example 5.6.2
Pr. 5.11 Modeling variable base degree-days with balance point temperature at a specific location Degree-day methods provide a simple means of determining annual energy use in envelope-dominated buildings operated constantly and with simple HVAC systems, which can be characterized by a constant efficiency. Such simple singlemeasure methods capture the severity of the climate in a particular location. The variable base degree day (VBDD) is conceptually similar to the simple degree-day method but is an improvement since it is based on the actual balance point of the house instead of the outdated default value of 65°F or 18.3°C (Reddy et al. 2016). Table 5.23 assembles the VBDD values for New York City, NY from actual climatic data over several years at this location.
Pr. 5.10 Spline models for solar radiation This problem involves using splines for functions with abrupt hinge points. Several studies have proposed correlations to predict different components of solar radiation from more routinely measured components. One such correlation relates the fraction of hourly diffuse solar radiation on a horizontal radiation (Id) and the global radiation on a horizontal surface (I) to a quantity known as the hourly atmospheric clearness index (kT = I/I0) where I0 is the extraterrestrial hourly radiation on a horizontal surface at the same latitude and time and day of the year (Reddy 1987). The latter is an astronomical quantity and can be predicted almost exactly. Data have been gathered (Table 5.22) from which a correlation between (Id/I ) = f(kT) needs to be identified. (a) Plot the data and visually determine the likely locations of hinge points. (Hint: there should be two points, one at either extreme). (b) Previous studies have suggested the following three functional forms: a constant model for the lower range, a second order for the middle range, and a constant model for the higher range. Evaluate with the data provided whether this functional form still holds, and report pertinent models and relevant goodness-of-fit indices.
(a) Identify a suitable regression curve for VBDD versus balance point temperature for this location and report all pertinent statistics (goodness-of-fit and model parameter estimates and their CL). (b) Repeat the analysis using bootstrapping and compare the model parameter estimates and their CL with those of the results from (a). Pr. 5.12 Consider Example 5.7.2 where two types of buildings were modeled following the full-model (FM) and the reduced model (RM) approaches using categorical variables. Whether the model slope parameters of the two types of buildings are different or not were only evaluated. Extend the analysis to test whether both model slope and intercept parameters are affected by the type of building.
Problems
219
Table 5.24 Data table for Pr. 5.13a Year 94 94 94 94 94 95 95 95 95 95 95 95 a
Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
E (W/ft2) 1.006 1.123 0.987 0.962 0.751 0.921 0.947 0.876 0.918 1.123 0.539 0.869
To (°F) 78.233 73.686 66.784 61.037 52.475 49.373 53.764 59.197 65.711 73.891 77.840 81.742
foc 0.41 0.68 0.67 0.65 0.42 0.65 0.68 0.58 0.66 0.65 0 0
Year 95 95 95 95 95 96 96 96 96 96 96 96
Month Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul
E (W/ft2) 1.351 1.337 0.987 0.938 0.751 0.921 0.947 0.873 0.993 1.427 0.567 1.005
To (°F) 81.766 76.341 65.805 56.714 52.839 49.270 55.873 55.200 66.221 78.719 78.382 82.992
foc 0.39 0.71 0.68 0.66 0.41 0.65 0.66 0.57 0.65 0.64 0.1 0.2
Data available electronically on book website
Pr. 5.13 Change point models of utility bills in variable occupancy buildings Example 5.7.1 illustrated the use of linear spline models to model monthly energy use in a commercial building versus outdoor dry-bulb temperature. Such models are useful for several purposes, one of which is for energy conservation. For example, the energy manager may wish to track the extent to which energy use has been increasing over the years, or the effect of a recently implemented energy conservation measure (such as a new chiller). For such purposes, one would like to correct, or normalize, for any changes in weather since an abnormally hot summer could obscure the beneficial effects of a more efficient chiller. Hence, factors that change over the months or the years need to be considered explicitly in the model. Two common normalization factors include changes to the conditioned floor area (e.g., an extension to an existing building wing), or changes in the number of students in a school. A model regressing monthly utility energy use against outdoor temperature is appropriate for buildings with constant occupancy (such as residences) or even offices. However, buildings such as schools are practically closed during summer, and hence, the occupancy rate needs to be included as the second regressor. The functional form of the model, in such cases, is a multivariate change point model given by: y = β0,un þ β0 f oc þ β1,un x þ β1 f oc x þβ2,un ðx - xc ÞI þ β2 f oc ðx - xc ÞI
ð5:77Þ
where x is the monthly mean outdoor temperature (To) and y is the electricity use per square foot of the school (E). Also, foc =(Noc/Ntotal ) represents the fraction of days in the month when the school is in session (Noc) to the total number of days in that particular month (Ntotal). The factor foc can be determined from the school calendar. Clearly, the unoccupied fraction fun = (1 - foc).
The term I represents an indicator variable whose numerical value is given by Eq. 5.71b. Note that the change point temperatures for occupied and unoccupied periods are assumed to be identical since the monthly data does not allow this separation to be identified. Consider the monthly data assembled for an actual school (shown in Table 5.24). (a) Plot the data and look for change points in the data. Note that the model given by Eq. 5.77 has 7 parameters of which xc (the change point temperature) is the one that makes the estimation nonlinear. By inspection of the scatter plot, you will assume a reasonable value for this variable, and proceed to perform a linear regression as illustrated in Example 5.7.1. The search for the best value of xc (one with minimum RMSE) would require several OLS regressions assuming different values of the change point temperature. (b) Identify the parsimonious model and estimate the appropriate parameters of the model. Note that of the six parameters appearing in Eq. 5.77, some of the parameters may be statistically insignificant, and appropriate care should be exercised in this regard. Report appropriate model and parameter statistics. (c) Perform a residual analysis and discuss the results. (d) Repeat the analysis using k-fold cross-validation with k=4 and compare the model parameter estimates with those of (b) above. Pr. 5.14 Determining energy savings from monitoring and verification (M&V) projects A crucial element in any energy conservation program is the ability to verify savings from measured energy use data—this is referred to as monitoring and verification (M&V). Energy service companies (ESCOs) are required to perform this as part of their services. Figure 5.38 depicts how energy savings are estimated. A common M&V protocol involves measuring
220
5
Linear Regression Analysis Using Least Squares
Fig. 5.38 Schematic representation of energy use prior to and after installing energy conservation measures (ECM) and of the resulting energy savings
the monthly total energy use at the facility for the whole year before the retrofit (this is the baseline period or the “preretrofit period”) and a whole year after the retrofit (called the “post-retrofit period”). The time taken for implementing the energy-saving measures (called the “construction period”) is neglected in this simple example. One first identifies a baseline regression model of energy use against ambient dry-bulb temperature To during the pre-retrofit period Epre = f(To). This model is then used to predict energy use during each month of the post-retrofit period by using the corresponding ambient temperature values. The difference between model predicted and measured monthly energy use is the energy savings during that month.
Table 5.25 Data table for Problem 5.14a Pre-retrofit period Month To (°F) 1994-Jul 84.04 Aug 81.26 Sep 77.98 Oct 71.94 Nov 66.80 Dec 58.68 199556.57 Jan Feb 60.35 Mar 62.70 Apr 69.29 May 77.14 Jun 80.54
Energy savings = Model‐predicted pre‐retrofit use - measured post‐retrofit use
a
ð5:78Þ The determination of the annual savings resulting from the energy retrofit and its uncertainty are finally determined. It is very important that the uncertainty associated with the savings estimates be determined as well for meaningful conclusions to be reached regarding the impact of the retrofit on energy use. You are given monthly data of outdoor dry bulb temperature (To) and area-normalized whole building electricity use (WBe) for two years (Table 5.25). The first year is the pre-retrofit period before a new energy management and control system (EMCS) for the building is installed, and the second is the post-retrofit period. Construction period, that is, the period it takes to implement the conservation measures, is taken to be negligible.
WBe (W/ft2) 3.289 2.827 2.675 1.908 1.514 1.073 1.237 1.253 1.318 1.584 2.474 2.356
Post-retrofit period Month To (°F) 1995-Jul 83.63 Aug 83.69 Sep 80.99 Oct 72.04 Nov 62.75 Dec 57.81 199654.32 Jan Feb 59.53 Mar 58.70 Apr 68.28 May 78.12 Jun 80.91
WBe (W/ft2) 2.362 2.732 2.695 1.524 1.109 0.937 1.015 1.119 1.016 1.364 2.208 2.070
Data available electronically on book website
(a) Plot time series and x–y plots and see whether you can visually distinguish the change in energy use as a result of installing the EMCS (similar to Fig. 5.38); (b) Evaluate at least two different models (with one of them being a model with indicator variables) for the pre-retrofit period, and select the better model. (c) Repeat the analysis using bootstrapping (10 samples are adequate) and compare the model parameter estimates with those of (a) and (b) above. (d) Use this baseline model to determine month-by-month energy use during the post-retrofit period representative of energy use had not the conservation measure been implemented. (e) Determine the month-by-month as well as the annual energy savings (this is the “model-predicted pre-retrofit energy use” of Eq. 5.78).
References
(f) The ESCO which suggested and implemented the ECM claims a savings of 15%. You have been retained by the building owner as an independent M&V consultant to verify this claim. Prepare a short report describing your analysis methodology, results, and conclusions. (Note: you should also calculate the 90% uncertainty in the savings estimated assuming zero measurement uncertainty. Only the cumulative annual savings and their uncertainty are required, not month-by-month values).
References ASHRAE, 1978, Standard 93-77: Methods of Testing to Determine the Thermal Performance of Solar Collectors, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. ASHRAE, 2005. Guideline 2-2005: Engineering Analysis of Experimental Data, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Belsley, D.A., E. Kuh and R.E Welsch, 1980, Regression Diagnostics, John Wiley & Sons, New York. Chatfield, C., 1995. Problem Solving: A Statistician’s Guide, 2nd Ed., Chapman and Hall, London, U.K. Chatterjee, S. and B. Price, 1991. Regression Analysis by Example, 2nd Edition, John Wiley & Sons, New York. Cook, R.D. and S. Weisberg, 1982. Residuals and Influence in Regression, Chapman and Hall, New York. Davison, A.C. and D. Hinkley, 1997, Cambridge University Press, U.K. Draper, N.R. and H. Smith, 1981. Applied Regression Analysis, 2nd Ed., John Wiley and Sons, New York. Efron, B. and R. Tibshirani, 1985. The Bootstrp Method for Assessing Statistical Accuracy, Behaviormetrika, 12, 1–35, Springer, 1985. Ezekiel, M. and K.A. Fox, 1959. Methods of Correlation and Regression Analysis, 3rd ed., John Wiley and Sons, New York.
221 Freedman, D. and Peters, S. (1984) Bootstraping an Econometric Model: Some Empirical Results. Journal of Business Economic Statistics, 2, 150–158. Gordon, J.M. and K.C. Ng, 2000. Cool Thermodynamics, Cambridge International Science Publishing, Cambridge, UK Hair, J.F., R.E. Anderson, R.L. Tatham and W.C. Black, 1998. Multivariate Data Analysis, 5th Ed., Prentice Hall, Upper Saddle River, NJ, James, G., D. Witten, T. Hastie and R. Tibshirani, 2013. An Introduction to Statistical Learning: with Applications to R, Springer, New York. Katipamula, S., T.A. Reddy and D. E. Claridge, 1998. Multivariate regression modeling, ASME Journal of Solar Energy Engineering, vol. 120, p.177, August. Neter, J. W. Wasserman and M.H. Kutner, 1983. Applied Linear Regression Models, Richard D. Irwin, Homewood IL. Pindyck, R.S. and D.L. Rubinfeld, 1981. Econometric Models and Economic Forecasts, 2nd Edition, McGraw-Hill, New York, NY. Reddy, T.A., 1987. The Design and Sizing of Active Solar Thermal Systems, Oxford University Press, Clarendon Press, U.K., September. Reddy, T.A., N.F. Saman, D.E. Claridge, J.S. Haberl, W.D. Turner and A.T. Chalifoux, 1997. Baselining methodology for facility-level monthly energy use- part 1: Theoretical aspects, ASHRAE Transactions, v.103 (2), American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta, GA. Reddy, T.A., J.F. Kreider, P. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings- Principles and Practice of Energy Efficient Design, 3rd Edition, CRC Press, Boca Raton, FL. Schenck, H., 1969. Theories of Engineering Experimentation, Second Edition, McGraw-Hill, New York. Shannon, R.E., 1975. System Simulation: The Art and Science, PrenticeHall, Englewood Cliffs, NJ. Stoecker, W.F., 1989. Design of Thermal Systems, 3rd Edition, McGraw-Hill, New York. Walpole, R.E., R.H. Myers and S.L. Myers, 1998. Probability and Statistics for Engineers and Scientists, 6th Ed., Prentice Hall, Upper Saddle River, NJ
6
Design of Physical and Simulation Experiments
Abstract
One of the objectives of performing engineering experiments is to assess performance/quality improvements (or system response variable) of a product under different changes/variations during the manufacturing process (called “treatments”). Experimental design is the term used to denote the series of planned experiments to be undertaken to compare the effect of one or more treatments or interventions on a response variable. The analysis of such data once collected entails methods which are a logical extension of the Student t-test and one-way ANOVA hypothesis tests meant to compare two or more population means of samples. Design of Experiments (DOE) is a broader term which includes defining the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact manner in which samples for testing need to be selected, specifying the conditions and executing the test sequence where one variable is varied at a time, analyzing the data collected to verify (or refute) statistical hypotheses, and then drawing meaningful conclusions. Selected experimental design methods are discussed such as full and fractional factorial designs, and complete block and Latin squares designs. The parallel between model building in a DOE framework and linear multiple regression is illustrated. Also discussed are response surface modeling (RSM) designs, which are meant to accelerate the search toward optimizing a process or finding the proper product mix by simultaneously varying more than one continuous treatment variable. It is a sequential approach where one starts with test conditions in a plausible area of the search space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on till the required optimum is reached. Central composite design (CCD) is often used for RSM situations for
continuous treatment variables since it allows fitting a second-order response surface with greater efficiency. Computer simulations are, to some extent, replacing the need to perform physical experiments, which are more expensive, time-consuming, and limited in the number of factors one can consider. There are parallels between the traditional physical DOE approach and designs based on computer simulations. The last section of this chapter discuses similarities and the important considerations/ differences between experimental design in both fields. It presents the various methods of sampling when the set of input design variables is very large (in some cases the RSM-CCD design can be used but the Latin Hypercube Monte Carlo and its variants such as the Morris method are much more efficient), for performing sensitivity analysis to identify important input variables (similar to screening) and reducing the number of computer simulations by adopting space-filling interpolation methods (also called surrogate modeling).
6.1
Introduction
6.1.1
Types of Data Collection
All statistical data analyses are predicated on acquiring proper data, and the more “proper” the data the sounder the statistical analysis. Basically, data can be collected in one of three ways (Montgomery 2017): (i) A retrospective cohort study involving a control group and a test group. For example, in medical and psychological research, data are collected from a group of individuals exposed or vaccinated against a certain factor and compared to another control group; (ii) An observational study where data are collected during normal operation of the system and the observer cannot
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_6
223
224
6
intervene; relevant analyses methods have been discussed briefly in Sect. 3.8 and in Sect. 10.3; (iii) A designed experiment where the analyst can control certain system inputs and the leeway to frame/perform the sequence of experiments as desired. This is the essence of a body of knowledge referred to as design of experiments (DOE), the focus of this chapter. One of the objectives of performing designed experiments in an engineering context is to improve some quality of products during their manufacture. Specifically, this involves evaluating process yield improvement or detecting changes in model system performance, i.e., the response of different specific changes (called treatments). An example related to metallurgy is to study the influence of adding carbon to iron in different concentrations to increase the strength and toughness of steels. The word “treatment” is generic and is used to denote an intervention in a process such as, say an additive nutrient in fertilizers, use of different machines during manufacture, design change in manufacturing components/processes, etc. (Box et al. 1978). Often data are collected without proper reflection of the intended purpose, and the analyst then tries to do the best he can. The two previous chapters dealt with statistical techniques for analyzing data, which were already gathered. However, no amount of “creative” statistical data analysis can reveal information not available in the data itself. The richness of the data set cannot be ascertained by the amount of data but by the extent to which all possible states of the system are represented in the data set. This is especially true for observational data sets where data are collected while the system is under routine dayto-day operation without any external intervention by the observer. The process of proper planning and execution of experiments, intentionally designed to provide data rich in information especially suited for the intended objective with low/least effort and cost, is referred to as experimental design. Optimal experimental design is one which stipulates the conditions under which each observation should be taken to minimize/maximize the system response or inherent characteristics. Practical considerations and constraints often complicate the design of optimal experiments, and these factors need to be explicitly considered.
6.1.2
Design of Physical and Simulation Experiments
Purpose of DOE
Experimental design techniques were developed at the beginning of the twentieth century primarily in the context of agricultural research, subsequently migrating to industrial engineering, and then on to other fields. The historic reason for their development was to ascertain, by hypothesis testing, whether a certain treatment increased the yield or improved the strength of the product. The statistical techniques which stipulate how each of the independent variables or factors have to be varied so as to obtain the most information about system behavior quantified by a response variable and do so with a minimum of tests (and hence, low/least effort and expense) are the essence of the body of knowledge known as experimental design. Design of experiments (DOE) is broader in scope and includes additional aspects: defining the objective and scope of the study, selecting the response variable, identifying the treatments and their levels/ranges of variability, prescribing the exact way samples for testing need to be selected, specifying the conditions and executing the test sequence, analyzing the data collected to verify (or refute) statistical hypotheses and then drawing meaningful conclusions. Often the terms experimental design and DOE are used interchangeably with the context dispelling any confusion. Experimental design involves one or more aspects where the intent is to (i) “screen” a large number of possible candidates or likely variables/factors and identify the dominant variables. These possible candidate factors are then subject to more extensive investigation; (ii) formulate the test conditions and sequence so that sources of unsuspecting and uncontrollable/extraneous errors can be minimized while eliciting the necessary “richness” in system behavior, and (iii) build a suitable mathematical model between the factors and the response variable using the data set acquired. This involves both hypothesis testing to identify the significant factors as well as model building and residual diagnostic checking. Many software packages have a DOE wizard which walks one through the various steps of defining the entire set of experimental combinations. Typical application goals of DOE are listed in Table 6.1. The relative importance of these goals depends on the
Table 6.1 Typical application goals of DOE Application Hypothesis testing
Factor screening Factor interaction Model development Response surface design
Description Determine whether there is a difference between the mean of the response variable for the different levels of a factor (i.e., to verify whether the new product or process is indeed an improvement over the status quo) Determine which factors greatly influence the response variable Determine whether (or which) factors interact with each other Determine functional relationship between factors and response variable Experimental design which allows determining the numerical values or settings of the factors that maximize/minimize the response variable
Chapter section 6.2, 6.4, 6.5
6.4.2 6.4 6.4, 6.5 6.6
6.2 Overview of Different Statistical Methods
specific circumstance. For example, often and especially so in engineering model building, the dominant regressor or independent variable set is known beforehand, acquired either from mechanistic insights or prior experimentation; and so, the screening phase may be redundant. In the context of calibrating a detailed simulation program with monitored data (Sect. 6.7), the problem involves dozens of input parameters. Which parameters are influential can be determined from a sensitivity analysis which is directly based on the principles of screening tests in DOE.
6.1.3
DOE Terminology
Literature on DOE has its own somewhat unique terminology which needs to be well understood. Referring to Fig. 1.8, a component or system can be represented by a simple block diagram, which consists of controllable input variables, uncontrollable inputs, system response, and the model structure denoted by the mathematical model and its parameter vector. The description of important terms is assembled in Table 6.2. Of special importance are the three terms: treatment factors, nuisance factors, and random effects. These and other terms appearing in this table are discussed in this chapter.
225
6.2
Overview of Different Statistical Methods
6.2.1
Different Types of ANOVA Tests
Recall the two statistical tests covered in Chap. 4, both of which involve comparing the mean values of one factor or variable corresponding to samples drawn from different populations. The Student t-test is a hypothesis test meant to compare the mean values on two population means using the t-distribution (Sect. 4.2). The one-factor or one-way ANOVA (Sect. 4.3) is a variance-based statistical F-test to concurrently compare three or more population means of samples gathered after the system has been subject to one or more treatments or interventions. Note that it could be used for comparing two population means; but the t-test is simpler. These tests are not meant for cases where more than one factor influences the outcome. Consider an example where an animal must be fattened (increase in weight to fetch a higher sale price), and the effect of two different animal feeds is being evaluated. If the tests are conducted without considering the confounding random effect of other disturbances which may influence the outcome, one obtains a large variance in the mean and the effect of the feed type is difficult to isolate/identify statistically
Table 6.2 Description of important terms used in DOE 1 2 2a
Terminology Component/system response Factors
2c
Treatment or primary factors Nuisance or secondary factors Covariates
3
Random effects
4
Factor levels
5
Blocking
6
Randomization
7
Replication
8
Experimental units
9
Block
2b
Description Usually a continuous variable (akin to the dependent variable in regression analysis discussed in Chap. 5) Controllable component or system variables/inputs, usually discrete or qualitative/categorical. Continuous variables need to be discretized The interventions or controlled variables whose impact on the response is the primary purpose of adopting DOE Variables which are sources of variability and can be controlled by blocking. They are not of primary interest to the experimenter but need to be included in the model Nuisance factors that cannot be controlled but can be measured prior to, or during, the experiment. They can be included in the model if they are of interest Extraneous or unspecified (or lurking) disturbances which the experimenter cannot control or measure, but which impact system response and obfuscate the statistical analysis Discrete values of the treatment and nuisance factors selected for conducting experiments (e.g., low, medium, and high can be coded as -1, 0, and +1). For continuous variables, the range of variation is discretized into a small number of numerical value sets or levels Clamping down on the variability of known nuisance factors to minimize or remove their influence/impact on the variability of the main factor. Reduces experimental error and increases power of statistical analysis. By not blocking a nuisance factor, it is much more difficult to detect whether the primary factor is significant or not Executing test runs in a random order to average out the impact of random effects. Meant to minimize the effect of “noise” in the experiments and improve estimation of other factors using statistical methods Repeating each factor/level combination independently of any previous run to reduce the effect of random error. Replication allows estimating the experimental error and provides a more precise estimate of the actual parameter values Physical entities such as lab equipment setups, pilot plants, field parcels, etc. for conducting DOE experiments. Any practical situation will have a limitation on the number of units available A noun used to denote a set or group of experimental units where one of the treatment factors has been blocked.
226
6
Design of Physical and Simulation Experiments
disturbances requires suitable experimental procedures involving randomization and replication; (ii) Several levels or treatments, but only two to four levels will be considered here for conceptual simplicity. The levels of a factor are the different values or categories which the factor can assume. This can be dictated by the type of variable (which can be continuous or discrete or qualitative) or selected by the experimenter. Note that the factors can be either continuous or qualitative. In case of continuous variables, their range of variation is discretized into a small set of numerical values; as a result, the levels have a magnitude associated with them. This is not the case with qualitative/categorical variables where there is no magnitude involved; the grouping is done based on some classification criterion such as different types of treatments. Fig. 6.1 The variance in two samples of the response variable (weight) against two treatments (factors 1 and 2) can be large when several nuisance factors are present. In such cases, one-way ANOVA tests are inadequate. DOE strategies reduce the variance and bias due to secondary or nuisance factors and average out the influence of random or uncontrollable effects. This is achieved by restricted randomization, replication, and blocking resulting in tighter limits as shown
(Fig. 6.1). It is in such cases that DOE can determine, by adopting suitable block designs, whether (and by how much) a change in one input variable (or intended treatment) among several known and controllable inputs affects the mean behavior/response/output of the process/system. Two-way ANOVA is used to compare mean differences on one dependent continuous variable between groups which have been subject to different interventions or treatments involving two independent discrete factors with one of them being a nuisance/secondary factor/variable. Similarly, three-way ANOVA involves techniques where two nuisance factors are blocked. The design of experiments can be extended to include: (i) Several factors, but only two to four factors will be discussed here for the sake of simplicity. The factors could be either controllable by the experimenter or cannot be controlled (random or extraneous factors). The source of variation of controllable variables can be reduced by fixing them at preselected levels called blocking and collecting different sets of experimental data. The analysis of these different sets or experimental units or blocks would allow the effect of treatment variables to be better discerned. If a model is being identified, the impact of cofactors can be included as well. The effect of the uncontrollable/unspecified
6.2.2
Link Between ANOVA and Regression
ANOVA and linear regression are equivalent when used to test the same hypotheses. The response variable is continuous in both cases while the factors or independent variables are discrete for ANOVA and continuous in the regression setting (Sect. 5.7 discusses the use of discrete or dummy regressors, but this is not the predominant situation). ANOVA can be viewed as a special case of regression analysis with all independent factors being qualitative. However, there is a difference in their application. ANOVA is an analysis method, widely used in experimental statistics, which addresses the question: what is the expected difference in the mean response between different groups/categories? Its main concern is to reduce the residual variance of a response variable in such a manner that the individual impact of specific factors can be better determined statistically. On the other hand, regression analysis is a mathematical modeling tool that aims to develop a predictive model for the change in the response when the predictor(s) changes by a given amount (or between different groups/categories). Achieving high model goodness-of-fit and predictive accuracy are the main concerns.
6.2.3
Recap of Basic Model Functional Forms
Section 5.4 discussed higher-order regression models involving more than one regressor. The basic additive linear model was given by Eq. 5.22a, while its “normal” transformation is one where the individual regressor variables are subtracted by their mean values:
6.3 Basic Concepts
227
Fig. 6.2 Points required for modeling linear and non-linear effects
y = β 0 0 þ β 1 ð x1 - x1 Þ þ β 2 ð x2 - x2 Þ þ ⋯ þ β k ð xk - xk Þ þ ε
ð5:20Þ
where k is the number of regressors and ε is the error or unexplained variation in y. The intercept term β0′ can be interpreted as the mean response of y. The basic regression model can be made to capture nonlinear variation in the response variables. For example, as shown in Fig. 6.2, a third observation would allow quadratic behavior (either concave or convex) to be modeled. The linear models capture variation when there is no interaction between regressors. The first-order linear model with two interacting regressor variables could be stated as: y = β 0 þ β 1 x1 þ β 2 x2 þ β 3 x1 x 2 þ ε
ð5:21Þ
The term (β3x1 x2) was called the interaction term. How the presence of interaction affects the shape of the family of curves was previously illustrated in Fig. 5.8. The secondorder quadratic model with interacting terms (Eq. 5.28) was also presented. The number of runs/trials or data points must be greater than the number of terms in the model to (i) estimate the model parameters, (ii) determine, by hypothesis testing, whether they are statistically significant or not, and (iii) obtain an estimate of the random/pure error. Even with blocking and randomization, it is important to adopt a replication strategy to increase the power of the DOE design.
6.3
Basic Concepts
6.3.1
Levels, Discretization, and Experimental Combinations
A full factorial design is one where experiments are done for each and every combination of the factors and their levels. The number of combinations is then given by:
Number of Experiments for Full Factorial = ∏ki ni
ð6:1Þ
where k is the number of factors, n is the number of levels, and i is the index for the factors. For example, a factorial experiment with three factors involving one two-level factor, a three-level factor, and a four-level factor would have 2 × 3 × 4 = 24 runs or trials. For the special case when all factors have the same number of levels, the number of experiments = nk. A widely used experimental design is the 2k factorial design since it is intuitive and easy to design. It derives its terminology from the fact that only two levels for each factor k are presumed, one indicative of the lower level of its range of variation (coded as “-”) and the other representing the higher level (coded as “+”). The factors can be qualitative or continuous; if the latter, they are discretized into categories or levels. Figure 6.3a illustrates how the continuous regressors x1 and x2 are discretized depending on their range of variation into four system states, while Fig. 6.3b depicts how these four observations would appear in a scatter plot should they exhibit no factor interaction (that is why the lines are parallel). For two factors, the number of experiments/trials (without any replication) would be 22 = 4; for three factors, this would be 23 = 8, and so on. The formalism of coding the low and high levels of the factors as -1 and +1 respectively is most widespread; other ways of coding variables can also be adopted. A full factorial design involving three factors (k = 3) at two levels each (n = 2) would involve 8 tests. These are shown conceptually in Fig. 6.4 as the 8 corner points of the cube. One could add center points to a two-level design at the center of the cube, and this would provide an estimate of residual error (the variability of the variation in the response not captured by the model) and allow one to test the goodness of model fit of a linear model. It would also indicate whether quadratic effects are present; however, it would be unable to identify the specific factor causing this behavior. To capture the quadratic effect of individual factors, one
228
6
High
o
o
Low
o
o
Low
High
X2
X1
oHigh X2
Y 0
o
Low X2 0
Low
High
X1 Fig. 6.3 Illustration of how models are built from factorial design data. A 22 factorial design is assumed (a) Discretization of the range of variation of the regressors x1 and x2 into “low” and “high” ranges, and (b) Regression of the system performance data as they appear on a scatter plot (there is no factor interaction since the two lines are shown parallel)
Fig. 6.4 Full factorial design for three factors at two-Levels (23 design) involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or -1 to denote the high- and low-level settings of the factors. Section 6.4.2 discusses this representation in more detail
needs to add experimental points at the center of each of the 6 surfaces. This strategy is not used in the classic factorial designs but has been adopted in later designs discussed in
Design of Physical and Simulation Experiments
Sect. 6.6.4. Designs with more than 3 factors are often referred to as hyper-cube designs and cannot be drawn graphically. Figure 6.5 is a flowchart of some of the simpler cases encountered in practice and discussed in this chapter. One distinguishes between two broad categories of traditional designs: (a) Full factorial and fractional factorial designs (Sect. 6.4): Full factorial designs are the most conservative of all design types where it is assumed that all trials can be run randomly by varying one factor/variable at a time. For example, with 3 factors with 3 levels each, the number of combinations is 33 = 27 while that for four factors is 34 = 81 and so on. Full factorial designs apply to instances when all factors at all their individual levels are deemed equally important and one wishes to compare k treatment/factor means. They allow estimating all possible interactions. Of special importance in this category of designs is the 2k design which is limited to each factor having two levels only. It is often used at the early stages of an investigation for factor screening and involves an iterative approach with each iteration providing incremental insight into factor dominance and into model building. The total number of runs required is much lower than that needed for the full factorial design. The number of experiments required for fullfactorial design increases exponentially, and it is rare to adopt such designs. Moreover, not all the factors initially selected may be dominant thereby warranting that the list of factors be screened. The fractional factorial design is appropriate in such instances. It only requires a subset of trials of those of the full factorial design, but then some of the main effects and two-way interactions are confounded and cannot be separated from the effects of other higher-order interactions. (b) Complete block and Latin squares (Sect. 6.5): Complete block design is used for instances when only the effect of one categorical treatment variable is being investigated over a range of different conditions when one or more categorical nuisance variables are present. The experimental units can be grouped in such a way that complete block designs require fewer experiments than the full factorial and allow the effect of interaction to be modeled. Latin squares are a special type of complete block design requiring even fewer number of experiments or trials. However, it has limitations: interaction effects cannot be captured, and it is restricted to the case of equal number of levels for all factors.
6.3 Basic Concepts
229
Number of Total Factors k=1
k>1 No. of Important Treatment Factors?
Only Random Effects
All
One
Two samples
Multiple samples
Yes
t-test 4.2.3
One-factor ANOVA 4.3.1
Enough number of experimental units?
Interaction terms important? Yes
No
No. of blocked factors Linear
Complete Block 6.5.1
No Fractional Factorial
Full 6.4.1 Factorial
Two
Three
Latin Squares
Graeco-Latin Squares
Two-level 2k Design 6.4.2
6.5.2
Model Behavior?
6.4.4
Quadratic
Three-level 3k Design 6.4.2
Fig. 6.5 Flow chart meant to provide an overview of the applicability of the various traditional DOE methods covered in different sections of this chapter. Response surface design (treated in Sect. 6.6) is akin to a three-level design while requiring fewer experiments; this is not
shown. Monte Carlo methods are also not indicated in the flowchart since they are not traditional and are adopted for different types of applications (such as computer simulation experiments)
6.3.2
6.3.3
Blocking
The general strategies or foundational principles adopted by DOE in order to reduce the effect of nuisance factors and random effects are block what factors you can and randomize and replicate (or repeat) what you cannot (Box et al. 1978). The concept of blocking is a form of stratified sampling whereby experimental tests or items in the sample of data are grouped into blocks according to some “matching” criterion so that the similarity of subjects within each block or group is maximized while that from block to block is minimized. Pharmaceutical companies wishing to test the effectiveness of a new drug adopt the above concept extensively. Since different people react differently, grouping of subjects is done according to some criteria (such as age, gender, body fat percentage, etc.). Such blocking would result in more uniformity among groups. Subsequently, a random administration of the drugs to half of the people within each group/block with a placebo to the other half would constitute randomization. Thus, any differences between each block taken separately would be more pronounced than if randomization was done without blocking.
Unrestricted and Restricted Randomization
After the experimental design is formulated, the sequence of the tests, i.e., the selection of the combinations of different levels and different factors, should be done in a random manner. This randomization would reduce the effect of random or extraneous factors beyond the control of the experimenter or arising from inherent experimental bias on the results of the statistical analysis. The simplest design is the randomized unrestricted block design, which involves selecting at random the combinations of the factors and levels under which to perform the experiments. This type of design, if done naively, is not very efficient and may require an unnecessarily large number of experiments to be performed. The concept is illustrated with a simple example, from the agricultural area from which DOE emerged. Say, one wishes to evaluate the yield of four newly developed varieties of wheat (labeled x1, x2, x3, x4). Since the yield is affected in addition by regional climatic and soil differences, one would like to perform the evaluation at four different locations. If one had four plots of land at each
230
6
Table 6.3 Example of unrestricted randomized block design (one of several possibilities) for one factor of interest at levels x1, x2, x3, x4 and one nuisance variable (Regions 1, 2, 3, 4) Region 1 x1 x3 x1 x2
Region 2 x1 x2 x3 x1
Region 3 x2 x1 x3 x3
Region 4 x3 x4 x2 x4
Design of Physical and Simulation Experiments
Table 6.5 Standard method of assembling test results for a balanced (3 × 2) design with two replication levels
Factor A
Average
Level 1 Level 2 Level 3
Factor B Level 1 10, 14 23, 21 31, 27 21
Level 2 18, 14 16, 20 21, 25 19
Average 14 20 26 20
This is not an efficient design since x1 appears twice under Region 1 and not at all in Region 4 Table 6.4 Example of restricted randomized block design (one of several possibilities) for the same example as that of Table 6.3 Region 1 x1 x2 x3 x4
Region 2 x2 x3 x4 x1
Region 3 x3 x4 x1 x2
Region 4 x4 x1 x2 x3
individual region (i.e., 16 experimental units in total), all the tests could be completed over one period or cycle. Consider the more realistic case, when unfortunately, only one plot of land or field station is available at each of the four different geographic regions, i.e., only four experimental units in all. The total time duration of the evaluation process due to this limitation would now require four time periods. This simple case illustrates the importance of designing one’s DOE strategy keeping in mind physical constraints such as the number of experimental units available and the total time duration within which to complete the entire evaluation. The simplest way of assigning which station will be planted with which variety of wheat is to do so randomly; one such result (among many possible ones) is shown in Table 6.3. Such an unrestricted randomization leads to needless replication (e.g., wheat type x1 is tested twice in Region 1 and not at all in Region 4) and is not very efficient. Since the intention is to reduce variability in the uncontrolled variable, in this case the “region” variable, one can insist that each variety of wheat be tested in each region. There are again several possibilities, with one being shown in Table 6.4. This example illustrates the principle of restricted randomization by blocking the effect of the uncontrollable factor. This aspect is further discussed in Sect. 6.5.2. Note that the intent of this investigation is to determine the effect of wheat variety on total yield. The location of the field or station is a “nuisance” variable, but in this case can be controlled by suitable blocking. However, there may be other disturbances which are uncontrollable (say, excessive rainfall in one of the regions during the test) and even worse some of the disturbances may be unknown. Such effects can be partially compensated for by replication, i.e., repeating the tests more than once for each combination.
6.4
Factorial Designs
6.4.1
Full Factorial Design
Consider two factors (labeled A and B) that are to be studied at two levels: a and b. This is often referred to as (a × b) design, and the standard manner of representing the results of the test is by assembling them as shown in Table 6.5. Each combination of factor-level can be tested more than once to minimize the effect of random errors, and this is called replication. Though more tests are done, replication reduces experimental errors introduced by extraneous factors not explicitly controlled during the experiments that can bias the results. Often for mathematical convenience, each combination is tested at the same replication level, and this is called a balanced design. Thus, Table 6.5 is an example of a (3 × 2) balanced design with replication r = 2. The above terms are perhaps better understood in the context of regression analysis (treated in Chap. 5). Let Z be the response variable which is linear in regressor variable X, and a model needs to be identified. Further, say, another variable Y is known to influence Z which may corrupt the sought-after relation. Selecting three specific values is akin to selecting three levels for the factor X (say, x1, x2, x3). The nuisance effect of variable Y can be “blocked” by performing the tests at pre-selected fixed levels or values of Y (say, y1 and y2). The corresponding scatter plot is shown in Fig. 6.6. Repeat testing at each of the six combinations in order to reduce experimental errors is akin to replication; in this example, replication r = 3. Finally, if the 18 tests are performed in random sequence, the experimental design would qualify as a full factorial random design. The averages shown in Table 6.5 correspond to those of either the associated row or the associated column. Thus, the average of the first row, i.e. {10, 14, 18, 14}, is shown as 14, and so on. Plots of the average response versus the levels of a factor yield a graph which depicts the trend, called main effect of the factor. Thus, Fig. 6.7a suggests that the average response tends to increase linearly as factor A changes from A1 to A3, while that of factor B decreases a
6.4 Factorial Designs
231
little as factor B changes from B1 to B2. The effect of the factors on the response may not be purely additive, and an interaction or multiplicative term (see Eq. 5.25) may have to be included as well. In such cases, the two factors are said to interact with each other. Whether this interaction effect is statistically significant or not can be determined from the results shown in Table 6.6. The effect of going from A1 to A3 is 17 under B1 and only 7 under B2. This
o o o
Z
y1
o o o o o o
o o o
y2
o o o
SST = SSA þ SSB þ SSðABÞ þ SSE
o o o
x1
suggests interaction effects. A simpler and more direct approach is to graph the two-factor interaction plot as shown in Fig. 6.7b. Since the lines are not parallel (in this case they cross each other), one would infer interaction between the two factors. However, in many instances, such plots are not conclusive enough, and one needs to perform statistical tests to determine whether the main or the interaction effects are significant or not (illustrated below). Figure 6.8 shows the type of interaction plots one would obtain for the case when the interaction effects are not at all significant. ANOVA decompositions allow breaking up the observed total sum of square variation (SST) into its various contributing causes (the one- factor ANOVA was described in Sect. 4.3.1). For a two-factor ANOVA decomposition (Devore and Farnum 2005): ð6:2aÞ
where the observed sum of squares: x2
x3
X
SST =
a
b
r
i=1 j=1 m=1 2
yijm - < y >
2
ð6:2bÞ
= ðstdevÞ ðabr - 1Þ
Fig. 6.6 Correspondence between block design approach and multiple regression analysis
Fig. 6.7 Plots for the (3 × 2) balanced factorial design. (a) Main effects of factors A and B with mean and 95% intervals (data from Table 6.5). (b) Two-factor interaction plot. (Data from Table 6.6)
Table 6.6 Interaction effect calculations for the data in Table 6.5 Effect of changing A (B fixed at B1) (10 + 14)/2 = 12 A1 and B1 (23 + 21)/2 = 22 A2 and B1 (31 + 27)/2 = 29 A3 and B1
29 - 12 = 17
Effect of changing A (B fixed at B2) A1 and B2 (18 + 14)/2 = 16 A2 and B2 (16 + 20)/2 = 18 A3 and B2 (21 + 25)/2 = 23
23 - 16 = 7
232
6
sum of squares associated with factor A: a
SSA = br
Ai - < y >
2
ð6:2cÞ
i=1
sum of squares associated with factor B: b
SSB = ar
Bj - < y >
2
ð6:2dÞ
j=1
error or residual sum of squares: a
b
r
yijm - yij
SSE =
2
ð6:2eÞ
i=1 j=1 m=1
with yijm = observation under mth replication when A is at level i and B is at level j a = number of levels of factor A b = number of levels of factor B r = number of replications per cell Ai = average of all response values at ith level of factor A Bj = average of all response values at jth level of factor B
Design of Physical and Simulation Experiments
yij = average of y for each cell (i.e., across replications) = grand average of y values i = 1, . . . a is the index for levels of factor A j = 1, . . . b is the index for levels of factor B m = 1, . . . r is the index for replicate. The sum of squares associated with the AB interaction is SST(AB) and this is deduced from Eq. 6.2a since all other quantities can be calculated. A linear statistical model, referred to as a random effects model, between the response and the two factors which include the interaction term between factors A and B can be deduced. More specifically, this is called a nonadditive two-factor model (it is nonadditive because the interaction term is present). It assumes the following form given that one starts with the grand average and then adds individual effects of the factors, the interaction terms, and the noise or error term: yij = < y > þ αi þ βj þ ðαβÞij þ εij
ð6:3Þ
where αi represents the main effect of factor A at the ith level = Ai - < y > and =
a i=1
αi = 0
βj the main effect of factor B at the jth level = Bj - < y > and = (αβ)ij
b j=1
the
βj = 0 interaction
between
B = yij - ð < y > þ αi þ βi Þ =
b
factors a
j=1 i=1
A
and
ðαβÞij = 0
and εij is the error (or residuals) assumed uncorrelated with mean zero and variance σ 2 = MSE.
Fig. 6.8 An example of a two-factor interaction plot when the factors have no interaction
The analysis of variance is done as described earlier, but care must be taken to use the correct degrees of freedom to calculate the mean squares (refer to Table 6.7). The analysis of the variance model (Eq. 6.3) can be viewed as a special case of multiple linear regression (or more specifically to one with indicator variables—see Sect. 5.7.3). This concept is illustrated in Example 6.4.1.
Table 6.7 Computational procedure for a two-factor ANOVA design Source of variation Factor A Factor B AB interaction
Sum of squares SSA SSB SS(AB)
Degrees of freedom a-1 b-1 (a - 1)(b - 1)
Error Total variation
SSE SST
ab(r - 1) abr - 1
Mean square MSA = SSA/(a - 1) MSB = SSB/(b - 1) MS(AB) = SS(AB)/(a - 1) (b - 1) MSE = SSE/[ab(r - 1)] –
Computed F statistic MSA/MSE MSB/MSE MS(AB)/MSE
Degrees of freedom for pvalue a - 1, ab(r - 1) b - 1, ab(r - 1) (a - 1)(b - 1), ab(r - 1)
– –
– –
6.4 Factorial Designs
233
Example 6.4.1 Two-factor ANOVA analysis and random effect model fitting Using the data from Table 6.5, determine whether the main effect of factor A, the main effect of factor B, and the interaction effect of AB are statistically significant at α = 0.05. Subsequently, identify the random effects model. It is recommended that one start by generating the treatment and effect plots- as shown in Fig. 6.7a, b. The response increases with the increasing level of factor A while it decreases a little with Factor B. The effect of factor A on the response looks more pronounced. First, using all 12 observations, one computes the grand average = 20 and the standard deviation stdev = 6.015. Then, following Eq. 6.2b: SST = stdev2 :ðabr - 1Þ = 6:0152 ½ð3Þð2Þð2Þ - 1 = 398 SSA = ð2Þ:ð2Þ ð14 - 20Þ2 þ ð20 - 20Þ2 þ ð26 - 20Þ2 = 288 SSB = ð3Þ:ð2Þ ð21 - 20Þ2 þ ð19 - 20Þ2 = 12 SSE = ½ð10 - 12Þ2 þ ð14 - 12Þ2 þ ð18 - 16Þ2 þð14 - 16Þ2 þ ð23 - 22Þ2 þ ð21 - 22Þ2 þð16 - 18Þ2 þ ð20 - 18Þ2 þ ð31 - 29Þ2 þð27 - 29Þ2 þ ð21 - 23Þ2 þ ð23 - 25Þ2 = 42
The statistical significance of the factors can now be evaluated by computing the F-values and comparing them with the corresponding critical values. S A 144 • Factor A: F - value = M M S E = 7 = 20:57: Since critical F-value for degrees of freedom (2, 6) = Fc (2,6) @ 0.05 significance level = 5.14, and because calculated F > Fc, one concludes that this factor is indeed significant at the 95% confidence level (CL). S B 12 • Factor B: F - value = M M S E = 7 = 1:71: Since Fc (1,6) @ 0.05 significance level = 5.99; this factor is not significant.
• Factor AB: F - value = MMSðSA EBÞ = 28 7 = 4: Since Fc (2,6) @ 0.05 significance level = 5.14; this factor is not significant. These results are also assembled in Table 6.8 for easier comprehension. The use of Eq. 6.3 can also be illustrated in terms of this example. The main effects of A and B are given by the differences between the cell averages and the grand average = 20 (see Table 6.7): α1 = ð14 - 20Þ = - 6; α2 = ð20 - 20Þ = 0; α3 = ð26 - 20Þ = 6; β1 = ð21 - 20Þ = 1; β2 = ð19 - 20Þ = - 1;
Then, from Eq. 6.2a SST = SSA þ SSB þ SSðABÞ þ SSE SSðABÞ = SST - SSA - SSB - SSE = 398 - 288 - 12 - 42 = 56
and those of the interaction terms by (refer to Table 6.6): ðαβÞ11 = 12 - ð20 - 6 þ 1Þ = - 3; ðαβÞ21 = 22 - ð20 þ 0 þ 1Þ = 1;
Next, the expressions shown in Table 6.7 result in: MSA = MSB = MSðABÞ = MSE =
ðαβÞ31 = 29 - ð20 þ 6 þ 1Þ = 2; ðαβÞ12 = 16 - ð20 - 6 - 1Þ = 3;
SSA 288 = = 144 a-1 3-1 SSB 12 = = 12 b-1 2-1 SSðABÞ 56 = = 28 ð a - 1Þ ð b - 1Þ ð 2 Þ ð 1 Þ SSE 42 = =7 abðr - 1Þ ð3Þð2Þð1Þ
ðαβÞ22 = 18 - ð20 þ 0 - 1Þ = - 1; ðαβÞ32 = 23 - ð20 þ 6 - 1Þ = - 2; Finally, following Eq. 6.3, the random effects model can be expressed as:
Table 6.8 Results of the ANOVA analysis for Example 6.4.1 Source Main effects A:Factor A B:Factor B Interactions AB Residual Total (corrected)
Sum of squares
d.f.
Mean square
F-ratio
P-value
288.0 12.0
2 1
144.0 12.0
20.57 1.71
0.0021 0.2383
56.0 42.0 398.0
2 6 11
28.0 7.0
4.00
0.0787
All F-ratios are based on the residual mean square error
234
6
yij = 20 þ f - 6, 0, 6gi þ f1, - 1gj þf - 3, 1, 2, 3, - 1, - 2gij with
ð6:4aÞ
i = 1, 2, 3 and j = 1, 2
For example, the cell corresponding to (A1, B1) has a mean value of 12 which is predicted by the above model as: yij = 20 - 6 þ 1 - 3 = 12, and so on. Finally, the prediction error of the model has a variance σ 2 = MSE = 7. Recasting the above model (Eq. 6.4a) as a regression model with indicator variables may be insightful (though cumbersome) to those more familiar with regression analysis methods: yij = 20 þ ð - 6ÞI 1 þ ð0ÞI 2 þ ð6ÞI 3 þ ð1ÞJ 1 þ ð - 1ÞJ 2 þð - 3ÞI 1 J 1 þ ð1ÞI 1 J 2 þ ð2ÞI 2 J 1 þ ð3ÞI 2 J 2 þð - 1ÞI 3 J 1 þ ð - 2ÞI 3 J 2 ð6:4bÞ where Ii and Ji are indicator variables corresponding to the α and β terms.
6.4.2
Design of Physical and Simulation Experiments
(following the standard form suggested by Yates). Notice that the last but one column has four (-) followed by four (+), the last but two column by successive pairs of (-) and (+), and the second column has alternating (-) and (+). The Yates algorithm is easily extended to higher number of factors. However, the sequence in which the runs are to be performed should be randomized; a good way is to simply sample the set of trials {1, . . ., 8} in a random fashion without replacement. The approach can be modified to treat the case of parameter interaction. Table 6.9 is simply modified by including separate columns for the three interaction terms, as shown in Table 6.10. The product of any two columns of the factors yields a column for the effect of the interaction term (e.g., the interaction of A and B is denoted by AB). The appropriate sign for the interactions is determined by multiplying the signs of each of the two corresponding terms. For example, AB for trial 1, would be coded as (-)(-) = (+); and so on.1 Note that every column has an equal number of (–) and (+) signs. The orthogonality property (discussed in the next section) relates to the fact that the sum of the product of the signs in any two columns is zero.
2k Factorial Designs
The above treatment of full factorial designs can lead to a prohibitive number of runs when numerous levels need to be considered. As pointed out by Box et al. (1978), it is wise to design a DOE investigation in stages, with each successive iteration providing incremental insight into influential factors, type of interaction, etc. while suggesting subsequent investigations. Factorial designs, primarily 2k and 3k, are of great value at the early stages of an investigation, where many possible factors are investigated with the intention of either narrowing down the number (screening), or to get a preliminary understanding of the mathematical relationship between factors and the response variable. These are, thus, viewed as a logical lead-in to the response surface method discussed in Sect. 6.6. The associated mathematics and interpretation of 2k designs are simple and can provide insights into the framing of more sophisticated and complete experimental designs called sequential designs which allow for more precise parameter estimation (a practical example is given in Sect. 10.4.2). They are popular in R&D of products and processes and are used extensively. They can also be used during computer simulation experiments; see, for example, Hou et al. 1996 for evaluating the performance of building energy systems (discussed in Sect. 6.7.4). Figure 6.4 depicts the full factorial design for three factors at two-levels (23 design), which involves 8 experiments mapped to each of the corners of the cube. The three factors (A, B, C) are coded as 1 or -1 to denote the high- and low-level settings of the factors. Table 6.9 depicts a quick and easy way of setting up a two-level three-factor design
Table 6.9 The standard form (suggested by Yates) for setting up the two-level three-factor (or 23) design Trial 1 2 3 4 5 6 7 8
Level of factors A B + + + + + + + +
C + + + +
Response y1 y2 y3 y4 y5 y6 y7 y8
Table 6.10 The standard form of the two-level three-factor (or 23) design with interactions Trial 1 2 3 4 5 6 7 8
1
Level of factors A B C + + + + + + + + + + + +
Interactions AB AC + + + + + + + +
BC + + + +
ABC + + + +
Response y1 y2 y3 y4 y5 y6 y7 y8
The statistical basis of this simple coding process is given by Box et al. (1978).
6.4 Factorial Designs
235
Table 6.11 Response table representation for the 23 design with interactions (omitting the ABC term) generated by expanding Table 6.10 Trial 1 2 3 4 5 6 7 8 Sum Avg Effect
Resp. y1 y2 y3 y4 y5 y6 y7 y8 8
A+
Ay1
B+
y3
y3 y4
y2 y4 y5
y5 y6
y6 y7 y8 4
4
Aþ
A-
(Aþ - A - )
By1 y2
y7 y8 4 Bþ
4 B-
(Bþ - B - )
C+
y5 y6 y7 y8 4 Cþ
ð6:5Þ
AB-
AC+ y1
y2 y3
y2
4 C-
ABþ
BC+ y1 y2
y3
y4 y5
y8 4
AC-
y6 y7
4
y8 4
4
AB -
ACþ
AC -
(ABþ - AB - )
(ACþ - AC - )
ð6:6aÞ
Similarly, the interaction effect of, say, BC can be determined as the average of the B effect when C is held constant at +1 minus the B effect when C is held constant at -1. Interaction effect of BC = BC þ - BC ð6:6bÞ 1 = ½ðy1 þ y2 þ y7 þ y8 Þ - ðy3 þ y4 þ y5 þ y6 Þ 4
BC-
y3 y4 y5 y6
y4 y5 y7 y8 4 BCþ
4 BC -
(BCþ - BC - )
Thus, the individual and interaction effects directly provide a prediction model of the form: y = b 0 þ b 1 A þ b 2 B þ b3 C Main effects
þ b12 AB þ b13 AC þ b23 BC þ b123 ABC
where the overbar indicates the average value. Statistical textbooks on DOE provide elaborate details of how to obtain estimates of all main and interaction effects when more factors are to be considered, and then how to use statistical procedures such as ANOVA to identify the significant ones. The standard form shown in Table 6.10 can be rewritten as in Table 6.11 for the 23 design with interactions by expanding each of the four interaction columns into their (+) and (–) columns respectively. For example, AB in Table 6.10 is (+) for trials 1, 4, 5, and 8 and these are listed under AB+ column in Table 6.11. This is referred to as the response table form and is advantageous in that it allows the analysis to be done in a clear and modular manner. The physical interpretation of the measure of interaction AB is that it is the difference between the average change in the response with factor A and that of factor B. Similarly, AB+ denotes the average effect of A on the response variable when B is held fixed (or blocked) at the B+ level. On the other hand, AB- denotes the effect of A when B is held fixed at the lower level B-. The main effect of A is simply: 1 = Aþ - A - = ½ðy2 þ y4 þ y6 þ y8 Þ 4 - ðy1 þ y3 þ y5 þ y7 Þ
AB+ y1
y6 y7
(Cþ - C - )
The main effect of, say, factor C can be determined simply as: Main effect of C = C þ - C ðy þ y6 þ y7 þ y8 Þ ðy1 þ y2 þ y3 þ y4 Þ = 5 4 4
Cy1 y2 y3 y4
ð6:7aÞ
Interaction terms
The intercept term is given by the grand average of all the response values y. This model is analogous to Eq. 5.25 which is one form of the additive multiple linear models discussed in Chap. 5. Note that Eq. 6.7a has eight parameters and with eight experimental runs, the model fit will be perfect with no variance. A measure of the random error can only be deduced if the degrees of freedom (d. f.) > 0, and so replication (i.e., repeats of runs) is necessary. Another option, relevant when interaction effects are known to be negligible, is to adopt a model which includes main effects only: y = b0 þ b 1 A þ b 2 B þ b 3 C
ð6:7bÞ
In this case, d.f. = 4, and so a measure of random error of the model can be determined. Example 6.4.2 Deducing a prediction model for a 23 factorial design Consider a problem where three factors {A, B, C} are presumed to influence a response variable y. The problem is to specify a DOE design, collect data, ascertain the statistical importance of the factors, and then identify a prediction model. The numerical values of the factors or regressors corresponding to the high and low levels are assembled in Table 6.12. It was decided to use two replicate tests for each of the 8 combinations to enhance accuracy. Thus, 16 runs were performed, and the results are tabulated in the standard form as suggested by Yates (Table 6.9) and shown in Table 6.13.
236
6
(a) Identify statistically significant terms This tabular data can be used to create a table similar to Table 6.11, which is left to the reader. Then, the main effects and interaction terms can be calculated following Eq. 6.6a and Eq. 6.6b. Main effect of factor A: 1 ½ð26 þ 29Þ þ ð21 þ 22Þ þ ð23 þ 22Þ ð 2Þ ð 4Þ þ ð18 þ 18Þ - ð34 þ 40Þ - ð33 þ 35Þ - ð24 þ 23Þ - ð19 þ 18Þ 47 == - 5:875 8 while the effect sum of squares SSA = (-47.0)2/16 = 138.063. Table 6.12 Assumed low and high levels for the three factors (Example 6.4.2) Factor A B C
Low level 0.9 1.20 20
High level 1.1 1.30 30
Table 6.13 Standard table (Example 6.4.2) Trial 1 2 3 4 5 6 7 8
Level of factors A B 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3 0.9 1.2 1.1 1.2 0.9 1.3 1.1 1.3
C 20 20 20 20 30 30 30 30
Responses (two replicates) 34, 40 26, 29 33, 35 21, 22 24, 23 23, 22 19, 18 18, 18
Data available electronically on book website
Design of Physical and Simulation Experiments
Similarly, the main effects of B = - 4.625 and C = 9.375, while interaction effects AB = - 0.625, AC = 5.125, BC = - 0.125, and ABC= 0.875. The results of the ANOVA analysis are assembled in Table 6.14. One concludes that the main effects A, B, and C and the interaction effect AC are significant at the 0.01 level. The main effect and interaction effect plots are shown in Figs. 6.9 and 6.10. These plots do suggest that interaction effects are present only for factors A and C since the lines are clearly not parallel. (b) Identify prediction model Only four terms, namely A, B, C, and AC interaction are found to be statistically significant at the 0.05 level (see Table 6.14). In such a case, the functional form of the prediction model reduces to: y = b0 þ b 1 x A þ b 2 x B þ b 3 x c þ b 4 x A x C
ð6:8aÞ
Substituting the values of the effect estimates determined earlier results in y = 25:313 - 2:938xA - 2:313xB - 4:688xc þ 2:563xA xC ð6:8bÞ where coefficient b0 is the mean of all observations. Also, note that the values of the model coefficients are half the values of the main and interaction effects determined in part (a). For example, the main effect of factor A was calculated to be (- 5.875) which is twice the (- 2.938) coefficient for the xA factor shown in the equation above. The division by 2 is needed because of the way the factors were coded, i.e., the high and low levels, coded as +1 and -1, are separated by 2 units. The performance equation thus determined can be used for predictions. For example, when xA = +1, xB = -1, xC = -1, one gets y = 26:813 which agrees reasonably well with the average of the two replicates performed (26 and 29) despite dropping the interaction terms.
Table 6.14 Results of the ANOVA analysis Source Main effects Factor A Factor B Factor C Interactions AB AC BC ABC Residual or error Total (corrected)
Sum of squares
D.f.
Mean square
F-ratio
p-value
138.063 85.5625 351.563
1 1 1
138.063 85.5625 351.563
41.68 25.83 106.13
0.0002 0.0010 0.0000
1.5625 105.063 0.0625 3.063 26.5 711.438
1 1 1 1 8 15
1.5625 105.063 0.0625 3.063 3.3125
0.47 31.72 0.02 0.92
0.5116 0.0005 0.8941 0.3640
Interaction effects AB, BC, and ABC are not significant (Example 6.4.2a)
6.4 Factorial Designs
237
Fig. 6.9 Main effect scatter plots for the three factors for Example 6.4.2 Fig. 6.10 Interaction plots for Example 6.4.2
(c) Comparison with linear multiple regression approach The parallel between this approach and regression modeling involving indicator variables was discussed previously (Sects. 5.7.3 and 6.2.2). For example, if one were to perform a multiple regression to the above data with the three regressors coded as -1 and +1 for low and high values respectively, one obtains the results shown in Table 6.15.
Note that the same four variables (A, B, C, and AC interaction) are statistically significant while the model coefficients are identical to the ones determined by the ANOVA analysis. If the regression were to be redone with only these four variables present, the model coefficients would be identical. This is a great advantage with factorial designs in that one could include additional variables incrementally in the model without impacting the model
238
6
Design of Physical and Simulation Experiments
Table 6.15 Results of performing a multiple linear regression to the same data with regressors coded as +1 and -1 (Example 6.4.2c) Parameter Constant Factor A Factor B Factor C Factor A*Factor B Factor A*Factor C Factor B*Factor C Factor A*Factor B* Factor C
Parameter estimate 25.3125 -2.9375 -2.3125 -4.6875 -0.3125 2.5625 -0.0625 0.4375
Standard error 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007 0.455007
t-statistic 55.631 -6.45595 -5.08234 -10.302 -0.686803 5.63178 -0.137361 0.961524
p-value 0.0000 0.0002 0.0010 0.0000 0.5116 0.0005 0.8941 0.3644
Table 6.16 Goodness-of-fit statistics of different multiple linear regression models (Example 6.4.2) Regression model With all terms With only four significant terms
Model R2 0.963 0.956
Adjusted R2 0.930 0.940
RMSE 1.820 1.684
Fig. 6.12 Model residuals versus model predicted values highlight the larger scatter present at higher values indicative of non-additive errors (Example 6.4.2)
Fig. 6.11 Observed versus predicted values for the regression model indicate larger scatter at high values (Example 6.4.2)
coefficients of variables already identified. Why this is so is explained in the Sect. 6.4.3. Table 6.16 assembles pertinent goodness-of-fit indices for the complete model and the model with the four significant regressors only. Note that while the R2 value of the former is higher (a misleading statistic to consider when dealing with multivariate model building), the Adj-R2 and the RMSE of the reduced model are superior. Finally, Figs. 6.11 and 6.12 are model predicted versus observed plots, which allow one to ascertain how well the model has fared; in this case, there seems to be a larger scatter at higher values indicative of non-additive errors. This suggests that a linear additive model may not be the best choice, and the analyst may undertake further refinements if time permits. In summary, DOE involves the complete reasoning process of defining the structural framework, i.e., prescribing the exact manner in which samples for testing need to be selected, and the conditions and sequence under which the testing needs to be performed under specific restrictions
imposed by space, time and nature of the process (Mandel 1964). The applications of DOE have expanded to the area of model building as well. It is now used to identify which subsets among several possible variables influence the response variable, and to determine a quantitative relationship between them.
6.4.3
Concept of Orthogonality
An important concept in DOE is orthogonality by which it is implied that trials should be framed such that the data matrix X2 results in (XTX) = -1 where XT is the transpose of X. 3 In such a case, the off-diagonal terms of the matrix (XTX) will be zero, i.e., the regressors are uncorrelated. This would lead to the best designs since it would minimize the variance of the regression coefficients. For example, consider Table 6.10 where the standard form for the two-level threefactor design is shown. Replacing low and high values (i.e., 2
Refer to Sect. 5.4.2 for refresher. Recall from basic geometry that two straight lines are perpendicular when the product of their slopes is equal to -1. Orthogonality is an extension of this concept to multiple dimensions. 3
6.4 Factorial Designs
239
Fig. 6.13 The coded regression matrix with four main and four interaction parameters with eight experiments
x1 1 1 1 1 1 1 1 1
X=
-1 1 -1 1 -1 1 -1 1
x2 -1 -1 1 1 -1 -1 1 1
x3
x1x2
-1 -1 -1 -1 1 1 1 1
1 -1 -1 1 1 -1 -1 1
Main effects - and +) by - 1 and +1, and noting that an extra column of 1 needs to be introduced to take care of the constant term in the model (see Eq. 5.31) results in the regressor matrix being defined by:
X=
1 1
-1 þ1
-1 -1
-1 -1
1
-1
þ1
-1
1 1
þ1 -1
þ1 -1
-1 þ1
1 1
þ1 -1
-1 þ1
þ1 þ1
1
þ1
þ1
þ1
ð6:9Þ
Example 6.4.34 Matrix approach to inferring a prediction model for a 23design This example will illustrate the analysis procedure for a complete 2k factorial design with three factors. The model, assuming a linear form, is given by Eq. 5.28 and includes main and interaction effects. Denoting the three factors by x1, From Beck and Arnold (1977) by permission of Beck.
1 -1 1 -1 -1 1 -1 1
1 1 -1 -1 -1 -1 1 1
-1 1 1 -1 1 -1 -1 1
Interaction effects
x2, and x3, the regressor matrix X will have four main effect parameters (the intercept term is the first column) as well as the four interaction terms as shown in Fig. 6.13. For example, the 6th column is the product of the 2nd and 3rd columns and so on. Let us assume that a DOE has yielded the following eight values for the response variable: YT = ½49 62 44 58 42 73 35 69
The reader can verify that the off-diagonal terms of the matrix (XTX) are indeed zero. All nk factorial designs are thus orthogonal, i.e., (XTX)-1 is a diagonal matrix with nonzero diagonal components. This leads to the soundest parameter estimation (as discussed in Sect. 9.2.3). Another benefit of orthogonal designs is that parameters of regressors already identified remain unchanged as additional regressors are added to the model; thereby allowing the model to be developed incrementally. Thus, the effect of each term of the model can be examined independently. These are two great benefits when factorial designs are adopted for model identification.
4
x1x3 x2x3 x1x2x3
ð6:10Þ
The intention is to identify a parsimonious model, i.e., one in which only the statistically significant terms are retained in the model given by Eq. 5.28. The inverse of (XTX) = 18 I and the (XTX) terms can be deduced by taking the sums of the yi terms multiplied by either (+1) or (-1) as indicated in XT. The coefficient b0 = 54 (average of all eight values of y), b1 following Eq. 6.5 is: b1 = [(62 + 58 + 73 + 69) - (49 + 44 + 42 + 35)]/ (4 × 2) = 11.5, and so on. The resulting model is: yi = 54 þ 11:5x1i - 2:5x2i þ 0:75x3i þ 0:5x1i x2i þ4:75x1i x3i - 0:25x2i x3i þ 0:25x1i x2i x3i
ð6:11Þ
With eight parameters and eight observations (and no replication), the model will be perfect with zero degrees of freedom; this is referred to as a saturated model. This is not a prudent situation since a model variance cannot be computed nor can the p-values of the various terms be inferred. Had a replication design been adopted, an estimate of the variance in the model could have been conveniently estimated and some measure of the goodness-of-fit of the model deduced (as in Example 6.4.1). In this case, the simplest recourse is to drop one of the terms from the model (say the (x1.x2.x3) interaction term) and then perform the ANOVA analysis.
240
6
Design of Physical and Simulation Experiments
Table 6.17 Results of the ANOVA analysis (Example 6.4.3) Source Main effects Factor x1 Factor x2 Factor x3 Interactions x1x2 x1x3 x2x3 Residual or error Total (Corrected)
Sum of squares
D.f.
Mean square
F-ratio
p-value
1058 50.0 4.50
1 1 1
1058 50.0 4.50
2116 100 9.00
0.0138 0.0635 0.2050
2.00 180.5 0.50 0.50 1296
1 1 1 1 7
2.00 180.5 0.50 0.50
4.00 361 1.00
0.2950 0.0335 0.5000
Because of the orthogonal behavior, the significance of the dropped term can be evaluated at a later stage without affecting the model terms already identified. The effect of individual terms is now investigated in a manner similar to the previous example. The ANOVA analysis shown in Table 6.17 suggests that only the terms x1 and (x1x3) are statistically significant at the 0.05 level. However, the p-value for x2 is close, and so it would be advisable to keep this term. Then, the parsimonious model is directly stated as: yi = 54 þ 11:5x1i - 2:5x2i þ 4:75x1i x3i
ð6:12Þ Fig. 6.14 Illustration of the differences between (a) full factorial and
The above example illustrates how data gathered within a DOE design and analyzed following the ANOVA method can yield an efficient functional predictive model of the data. It is left to the reader to repeat the analysis illustrated in Example 6.4.1 where an identical model was obtained by straightforward use of multiple linear regression. Note that orthogonality is maintained only if the analysis is done with coded variables (-1 and +1), and not with the original ones. Recall that a 22 factorial design implies two regressors or factors, each at two levels; say “low” and “high.” Since there are only two states, one can only frame a first-order functional model to the data such as Eq. 6.12. Thus, a 22 factorial design is inherently constrained to identifying a first-order linear model between the regressors and the response variable. If the mathematical relationship requires higher order terms, multilevel factorial designs are required (to identify polynomial models such as Eq. 5.23). For example, the 3k design will require the range of variation of the factors to be aggregated into three levels, such as “low”, “medium” and “high”. If the situation is one with three factors (i.e., k = 3), one needs to perform 27 experiments even if no replication tests are considered. This is more than three times the number of tests needed for the 23 design. Thus, the additional higherorder insight can only be gained at the expense of a larger number of runs which, for higher number of factors, may
(b) fractional factorial design for a 23 DOE experiment. Several different combinations of fractional factorial designs are possible; only one such combination is shown
become prohibitive. In such instances, central composite designs are often advisable since they allow second-order effects to be modeled with 2k designs; this design is discussed in Sect. 6.6.4.
6.4.4
Fractional Factorial Designs
One way of greatly reducing the number of runs, provided interaction effects are known to be negligible, is to adopt fractional factorial designs. The 27 tests needed for a full 33 factorial design can be reduced to 9 tests only. Thus, instead of 3k tests, an incomplete block design would only require (3k-1) tests. A graphical interpretation of how a fractional factorial design differs from a full factorial one for a 23 instance is illustrated in Fig. 6.14. Three factors are involved (A, B, and C) at two levels each (-1, 1). While 8 test runs are performed corresponding to each of the 8 corners of the cube for the full factorial, only 4 runs are required for the fractional factorial as shown. The Latin squares design, discussed in Sect. 6.5.2, is a type of fractional factorial method. The interested reader can refer to Box et al. (1978) or Montgomery (2017) for a detailed treatment of fractional factorial design methods.
6.5 Block Designs
241
Table 6.18 Machining time (in minutes) for Example 6.5.1 Operator 1 42.5 39.8 40.2 41.3 40.950
Machine 1 2 3 4 Average
2 39.3 40.1 40.5 42.2 40.525
3 39.6 40.5 41.3 43.5 41.225
4 39.9 42.3 43.4 44.2 42.450
5 42.9 42.5 44.9 45.9 44.050
6 43.6 43.1 45.1 42.3 43.525
Average 41.300 41.383 42.567 43.233 42.121
Data available electronically on book website
Table 6.19 ANOVA table for Example 6.5.1 Source of variation Machines Operators Error Total
Sum of squares 15.92 42.09 23.84 81.86
6.5
Block Designs
6.5.1
Complete Block Design
Degrees of freedom 3 5 15 23
Complete block design pertains to the instance when one wishes to investigate the effect of only one primary or treatment factor/variable on the response when other secondary factors (or nuisance variables) are present. The effect of these nuisance variables is minimized/eliminated by blocking. The following example serves to illustrate the concept of randomized complete block design with one nuisance factor.5 Example 6.5.16 Evaluating performance of four machines while blocking effect of operator dexterity The performance of four different machines M1, M2, M3, and M4 is to be evaluated in terms of time needed to manufacture a widget. It is decided that the same widget will be made on these machines by six different machinists/operators in a randomized block experiment. The machines are assigned in a random order to each operator. Since dexterity is involved, there will be a difference among the operators in the time needed to machine the widget. Table 6.18 assembles the machining time in minutes after 24 tests have been completed. Here, the machine type is the primary treatment factor, while the nuisance factor is the operator (the intent of the study could have been the reverse). The effect of this nuisance factor is blocked or controlled by the randomized complete block design where all operators use all four machines. The analysis calls for testing the hypothesis at
5
Since two factors are involved, such problems are often referred to as two-way ANOVA problems. 6 From Walpole et al. (2007) by # permission of Pearson Education.
Mean square 5.31 8.42 1.59 –
Computed F statistic 3.34
p-value 0.048
the 0.05 level of significance that the performance of the machines is identical. Let Factor A correspond to the machine type and B to the operator. Thus a = 4 and b = 6, with replication r = 1. Then, Eq. 6.2a reduces to:
where
SST = SSA þ SSB þ SSE
SSA = ð6Þ ð41:3 - 42:121Þ2 þ ð41:383 - 42:121Þ2 þ . . . = 15:92 SSB = ð4Þ ð40:95 - 42:121Þ2 þ ð40:525 - 42:121Þ2 þ . . . = 42:09 Total variation = SST = (abr - 1). stdev2 = (23).(1.88652) = 81.86 Subsequently, SSE = 81.86 - 15.92 - 42.09 = 23.84 The ANOVA table can then be generated as depicted in Table 6.19. The F-statistic = (5.31/1.59) = 3.34 is significant at probability p = 0.048. One would conclude that the performance of the machines cannot be taken to be similar at the 0.05 significance level (this is a close call though and would merit further investigation!). What can one infer about differences in the dexterity of the machinists? As illustrated earlier, graphical display of data can provide useful diagnostic insights in ANOVA type of problems as well. For example, a simple plotting of the raw observations around each treatment mean can provide a feel for variability between sample means and within samples. Figure 6.15 depicts all the data as well as the mean variation. One notices that there are two unusually different values which stand out, and it may be wise to go back and study the experimental conditions which produced these results. Without these, the interaction effects seem small.
242
6
Fig. 6.15 Factor mean plots of the two factors with six levels for the operator variable and four for the machine variable
Design of Physical and Simulation Experiments
Fig. 6.18 Normal probability plot of the residuals
Inspection of the residuals can provide diagnostic insights regarding violation of normality and non-uniform variance akin to regression analysis. Since model predictions are given by: yij = < y > þ Ai - < y > þ Bj - < y > = Ai þ Bj - < y >
ð6:14aÞ
the residuals of the (i,j) observation are: εij yij - yij = yij - Ai þ Bj - < y > i = 1, . . . , 4 Fig. 6.16 Scatter plot of the residuals versus the six operators
Fig. 6.17 Scatter plot of residuals versus predicted values
A random effects model can also be identified. In this case, an additive linear model is perhaps adequate such as: yij = < y > þ αi þ βj þ εij
ð6:13Þ
and
j = 1, . . . , 6
ð6:14bÞ
Two different residual plots have been generated. Figures 6.16 and 6.17 reveal that the variance of the errors versus operators and versus model predicted values are fairly random except for two large residuals (as noted earlier). Further, a normal probability plot of the model residuals seems to show some departure from normality, and this issue may need further scrutiny (Fig. 6.18). An implicit and important assumption in the above model design is that the treatment and block effects are additive, i.e., negligible interaction effects. In the context of Example 6.5.1, it means that if, say, Operator 3 is on average 0.5 min faster than Operator 2 on machine 1, the same difference also holds for machines 2, 3, and 4. This pattern would be akin to that depicted in Fig. 6.3 where the mean responses of different blocks differ by the same amount from one treatment to the next. In many experiments, this assumption of additivity does not hold, and the treatment and block effects interact (as illustrated in Fig. 6.7b). For example, Operator 1 may be faster by 0.5 min on the average than Operator 2 when machine 1 is used, but he may be slower by, say, 0.3 min on the average than Operator 2 when machine 2 is used. In such a case, the operators and the machines are said to be interacting.
6.5 Block Designs
243
This results in a significant reduction in the number of experimental runs especially when several levels are involved. However, replication is advisable to reduce random error, to estimate the experimental error, and to provide more precise estimate of the parameter values. A Latin square for n levels denoted by (n × n) is a square of n rows and n columns with each of the n2 cells containing one specific treatment that appears once, and only once, in each row and column. Consider a three-factor experiment at four different levels each. The number of experiments required for full factorial, i.e., to map out the entire experimental space would be 43 = 64. For incomplete factorials, the
number of experiments reduces to 42 = 16 experiments. A Latin square is said to be reduced (also, normalized or in standard form) if both its first row and its first column are in their natural order. The standard manner of specifying a (4 × 4) Latin square design with three factors is shown in Table 6.20a. One of the blocked factors is represented by levels (1, 2, 3, 4) and the other by (I, II, III, IV) with the primary or treatment factor by (A, B, C, D). A randomized design (one of several) is shown in Table 6.20b. Note that the Latin square design shown in Table 6.20b is not unique. The number of possible combinations N grows exponentially with the number of levels n. For n = 3, N = 12; for n = 4, N = 576 and for n = 5, N = 161,280. For 3 factors at 3 levels, each Latin square design only needs 32 = 9 as against 33 = 27 experiments required for the full factorial design. Thus, Latin square designs reduce the required number of experiments from n3 to n2 (where n is the number of levels), thereby saving cost and time. In general, the fractional factorial design requires nk-1 experiments, while the full factorial requires nk. A simple way of generating Latin square designs for higher values of n is to simply write them in order of level in the first row with the subsequent rows generated by simply shifting the sequence of levels one space to the left. Then one needs to randomize (and perhaps include replicates as well) to average out the effect of random influences. Table 6.21 assembles the analysis of variance equations for a Latin square design, which will be illustrated in Example 6.5.2. Latin square designs usually have a small number of error degrees of freedom (e.g., 2 for a 3 × 3 and 6 for a 4 × 4 design), which allows a measure of model variance to be deduced. In summary, while the randomized block design allows blocking of one source of variation, the Latin square design allows systematic blocking of two sources of variability for problems involving three factors (k = 3 with two nuisance factors). The restrictions of this design are that (i) all 3 factors must have the same number of levels, and (ii) no interaction effects are present. The concept, under the same assumptions as those for Latin square design can be extended to problems
Table 6.20 A (4 × 4) Latin square design with three factors and four levels. The treatment factor levels are (A, B, C, D) while those of the two nuisance factors are shown as (1, 2, 3, 4) and (I, II, III, IV). Note that
each treatment occurs in every row and column. (a) The standard manner of specifying the design called reduced form. (b) One of several possible randomized designs
The above treatment of full factorial designs was limited to one nuisance factor. The treatment can be extended to a greater number of factors, but the analysis gets messier though the extension is quite straightforward; see, for example, Box et al. (1978) or Montgomery (2017).
6.5.2
Latin Squares
For the special case when all factors have the same number of levels, the number of experiments necessary for a complete factorial design which includes all main effects and interactions is nk where k is the number of factors and n the number of levels. If certain assumptions are made, this number can be reduced considerably (see Fig. 6.14). Such methods are referred to as fractional factorial designs. The Latin squares approach is one such special design meant for problems: (i) involving three factors with one treatment and two noninteracting nuisance factors (i.e., k = 3), (ii) that allows blocking in two directions, i.e., eliminating two sources of nuisance variability, (iii) where the number of levels for each factor is the same, and (iv) where interaction terms among factors are negligible (i.e., the interaction terms (αβ)ij in the statistical effects model given by Eq. 6.3 are dropped).
(a)
I II III IV
1 A B C D
2 B C D A
3 C D A B
4 D A B C
(b)
I II III IV
1 A D B C
2 B C D A
3 D A C B
4 C B A D
244
6
Design of Physical and Simulation Experiments
Table 6.21 The analysis of variance equations for (n × n) Latin square design Source of variation Row Column Treatment Error Total
Sum of squares SSR SSC SSTr SSE SST
Degrees of freedom n-1 n-1 n-1 (n - 1)(n - 2) n2 - 1
with four factors (k = 4) where three sources of variability need to be blocked; this is done using Graeco-Latin square designs (see Box et al. 1978; Montgomery 2017). Example 6.5.2 Evaluating impact of air filter type on breathing complaints with school vintage and season being nuisance factors7 To reduce breathing related complaints from students, four different types of air cleaning filters (labeled A, B, C, and D which are the treatment factors) are being considered for mandatory replacement of existing air filters in all schools in a school district. Since seasonal effects are important, tests are to be performed under each of the four seasons (and correct for the days when the school is in session for each of these seasons). Further, it is decided that tests should be conducted in four schools representative of different vintage (labeled 1 through 4). Because of the potential for differences in the HVAC systems between old and new schools, it is logical to insist that each filter type be tested at each school during each season of the year. It would have been advisable to have replicates but that would have increased the duration of the testing period from one year to two years which was not acceptable to the school board. (a) Develop a DOE design This is a three-factor problem with four levels in each. The total number of treatment combinations for a completely randomized design would be 43 = 64. The selection of the same number of categories for all three criteria of classification could be done following a Latin square design, and the analysis of variance was performed using the results of only 16 treatment combinations. One such Latin square is given in Table 6.22. The rows and columns represent the two sources of variation one wishes to control. One notes that in this design, each treatment occurs exactly once in each row and in each column. Such a balanced arrangement allows the effect of the air-cleaning filter to be separated from that of the season variable. Note that if the interaction between the sources of variation is present, the Latin square model cannot be used; 7
Since three factors are involved, such problems are often referred to as three-way ANOVA problems.
Mean square SSR/(n - 1) SSC/(n - 1) SSTr/(n - 1) SSE/(n - 1)(n - 2) –
Computed F statistic FR = (MSR/MSE) FC = (MSC/MSE) FTr = (MSTr/MSE) – –
Table 6.22 Experimental design (Example 6.5.2) School vintage 1 2 3 4
Season Fall A D C B
Winter B A D C
Spring C B A D
Summer D C B A
Table 6.23 Data table showing a number of breathing complaints School vintage 1 2 3 4 Average
Fall A 70 D 66 C 59 B 41 59.00
Winter B 75 A 59 D 66 C 57 64.25
Spring C 68 B 55 A 39 D 39 50.25
Summer D 81 C 63 B 42 A 55 60.25
Average 73.5 60.75 51.50 48.00 58.4375
A, B, C, and D are four different types of air filters being evaluated (Example 6.5.2) Data available electronically on book website
this assessment ought to be made based on previous studies or expert opinion. (b) Perform an ANOVA analysis Table 6.23 summarizes the data collected under such an experimental protocol, where the numerical values shown are the number of breathing-related complaints per season corrected for the number of days when the school is in session and for changes in a number of student population. Assuming that the various sources of variation do not interact, the objective is to statistically determine whether filter type affects the number of breathing complaints. A secondary objective is to investigate whether any (and, if so, which) of the nuisance factors (school vintage and season) are influential. Generating scatter plots such as those shown in Fig. 6.19 for school vintage and filter type is a logical first step. The standard deviation is stdev = 12.91, while the averages of the four treatments or filter types are:
245
89
89
79
79
Complaints
Complaints
6.6 Response Surface Designs
69 59
69 59 49
49
39
39
x1
x2
x3
z1
x4
z2
z3
z4
Filter
School
Fig. 6.19 Scatter plots of number of complaints vs (a) school vintage and (b) filter type
Table 6.24 ANOVA results following equations shown in Table 6.21 (Example 6.5.2) Source of variation School vintage Season Filter type Error Total
Sum of squares 1557.2
Degrees of freedom 3
Mean square 519.06
Computed F statistic 11.92
pvalue 0.006
417.69 263.69 261.37 2499.94
3 3 6 15
139.23 87.90 43.56 –
3.20 2.02 – –
0.105 0.213 – –
A = 55:75, B = 53:25, C = 61:75, D = 63:00 In this example, one would make a fair guess based on the intra and within variation that filter type is probably not an influential factor on the number of complaints while school vintage may be. The analysis of variance approach or ANOVA is likely to be more convincing because of its statistical rigor. From the probability values in the last column of Table 6.24, it can be concluded that the number of complaints is strongly dependent on the school vintage, statistically significant at the 0.10 level on the season, and not statistically significant on filter type.
6.6
Response Surface Designs
6.6.1
Applications
Recall that the factorial methods described in the previous section can be applied to either continuous or discrete qualitative/categorical variables with only one variable changed at
a time. The 2k factorial method allows both screening to identify dominant factors and to identify a robust linear predictive model. In a historic timeline, these techniques were then extended to optimizing a process or product by Box and Wilson in the early 1950s. A special class of mathematical and statistical techniques was developed meant to identify models and analyze data between a response and a set of continuous treatment variables with the intent of determining the conditions under which a maximum (or a minimum) of the response variable is obtained when one or more of the variables are simultaneously changed (Box et al. 1978). For example, the optimal mix of two alloys which would result in the product having maximum strength can be deduced by fitting the data from factorial experiments with a model from which the optimum is determined either by calculus or search methods (described in Chap. 7 under optimization methods). These models, called response surface (RS) models, can be framed as second-order models (sometimes, the first order is adequate if the optimum is far from the initial search space) which are linear in the parameters. RS designs involve not just the modeling aspect, but also recommendations on how to perform the sequential search involving several DOE steps. The reader may wonder why most of the DOE models treated in this chapter assume empirical polynomial models. This was because of historic reasons where the types of applications which triggered the development of DOE were not understood well enough to adopt mechanistic functional forms. Empirical polynomial models are linear in the parameters but can be nonlinear in their functional form due to interaction terms and higher terms in the variables (such as Eq. 6.7a).
246
6.6.2
6
Methodology
A typical RS experimental design involves three general phases (screening, optimizing, and confirming) performed with the specific intention of limiting the number of experiments required to achieve a rich data set. This will be illustrated using the following example. The R&D staff of a steel company wants to improve the strength of the metal sheets sold. They have identified a preliminary list of five factors that might impact the strength of their metal sheets: concentrations of chemical A and chemical B, the annealing temperature, the time to anneal, and the thickness of the sheet casting. The first phase is to run a screening design to identify the main factors influencing the metal sheet strength. Thus, those factors that are not important contributors to the metal sheet strength are eliminated from further study. How to perform such screening tests involving the 2k factorial design have been discussed in Sect. 6.4.2. It was concluded that the chemical concentrations A and B are the main treatment factors that survive the screening design. To optimize the mechanical strength of the metal sheets, one needs to know the relationship between the strength of the metal sheet and the concentration of chemicals A and B in the mix; this is done in the second phase, which requires a sequential search process. The following steps are undertaken during the sequential search: (i) Identify the levels of the amount of chemicals A and B to study. Three distinct values for each factor are usually necessary to fit a quadratic function, so standard two-level designs are not appropriate for fitting curved surfaces. (ii) Generate the experimental design using one of several factorial methods. (iii) Run the experiments. (iv) Analyze the data using ANOVA to identify the statistical significance of factors. (v) Draw conclusions and develop a model for the response variable. The quadratic terms in these equations approximate the curvature in the underlying response function. If a maximum or minimum exists inside the design region, the point where that value occurs can be estimated. Unfortunately, this is unlikely to be the case. The approximate model identified is representative of the behavior of the metal in the local design space only while the global optimum may lie outside the search space. (vi) Using optimization methods (such as calculus-based methods or search methods such as steepest descent), move in the direction where the overall optimum is likely to lie (refer to Sect. 7.4.2
Design of Physical and Simulation Experiments
(vii) Repeat steps (i) through (vi) until the global optimum is reached. Once the optimum has been identified, the R&D staff would want to confirm that the new, improved metal sheets have higher strength; this is the third phase. They would resort to hypothesis tests involving running experiments to support the alternate hypothesis that the strength of the new, improved metal sheet is greater than the strength of the existing metal sheet. In summary, the goals of the second and third phases of the RS design are to determine and then confirm, with the needed statistical confidence, the optimum levels of chemicals A and B that maximize the metal sheet strength.
6.6.3
First- and Second-Order Models
In most RS problems, the form of the relationship between the response and the regressors is unknown. Consider the case where the yield (Y ) of a chemical process is to be maximized with temperature (T ) and pressure (P) being the two independent variables (Montgomery 2017). The 3-D plot (called the response surface plot in DOE terminology) is shown in Fig. 6.20, along with its projection of a 2-D plane, known as a contour plot. The maximum yield is achieved under T = 138 and P = 28, at which the maximum yield Y = 70. If one did not know the shape of this curve, one simple approach would be to assume a starting point (say, T = 115 and P = 20, as shown) and repeatedly perform experiments in an effort to reach the maximum point. This is akin to a univariate optimization search (see Sect. 7.4) which is not very efficient. In this example involving a chemical process, varying one variable at a time may work because of the symmetry of the RS plot. However, in cases (and this is often so) when the RS plot is asymmetrical or when the search location is far away from the optimum, such a univariate search may erroneously indicate a nonoptimal maximum. A superior manner, and the one adopted in most numerical methods is the steepest gradient method, which involves adjusting all the variables together (see Sect. 7.4). As shown in Fig. 6.21, if the responses Y at each of the four corners of the square are known by experimentation, a suitable model is identified (in the figure, a linear model is assumed and so the set of lines for different values of Y are parallel). The steepest gradient method involves moving along a direction perpendicular to the sets of lines (indicated by the “steepest descent” direction in the figure) to another point where the next set of experiments ought to be performed. Repeated use of this testing, modeling, and stepping is likely to lead one close to the sought-after maximum or minimum (provided one is not caught in a local peak or valley or a saddle point).
6.6 Response Surface Designs
247
Fig. 6.20 A three-dimensional response surface between the response variable (the expected yield) and two regressors (temperature and pressure) with the associate contour plots indicating the optimal value. (From Montgomery 2017 by permission of John Wiley and Sons)
Fig. 6.21 Figure illustrating how the first-order response surface model (RSM) fit to a local region can progressively lead to the global optimum using the steepest descent search method
fractional, are good choices at the preliminary stage of the RS investigation. As stated earlier, due to the benefit of orthogonality, these designs are recommended since they would minimize the variance of the regression coefficients. (b) Once close to the optimal region, polynomial models higher than the first order are advised. This could be a second-order polynomial (involving just the main effects) or a higher-order polynomial which also includes quadratic effects and interactions between pairs of factors (two-factor interactions) to account for curvature. Quadratic models are usually sufficient for most engineering applications, though increasing the order of approximation to higher orders could, sometimes, further reduce model errors. Of course, it is unlikely that a polynomial model will be a reasonable approximation of the true functional relationship over the entire space of the independent variables, but for a relatively small region, they usually work well. Note that rarely would all the terms of the quadratic model be needed; and how to identify a parsimonious model has been illustrated in Examples 6.4.2 and 6.4.3.
The following recommendations are noteworthy to minimize the number of experiments to be performed:
6.6.4 (a) During the initial stages of the investigation, a firstorder polynomial model in some region of the range of variation of the regressors is usually adequate. Such models have been extensively covered in Chap. 5 with Eq. 5.29 being the linear first-order model form in vector notation. 2k factorial designs, both full and
Central Composite Design and the Concept of Rotation
One must assume 3 levels for the factors in order to fit quadratic models. For a 3k factorial design with the number of factors k = 3, one needs 27 experiments with no replication, which, for k = 4 grows to 81 experiments. Thus, the
248
6
Design of Physical and Simulation Experiments
Fig. 6.22 A central composite design (CCD) contains two sets of experiments: a fractional factorial or “cube” portion which serves as a preliminary stage where one can fit a first-order (linear) model, and a group of axial or “star” points that allow estimation of curvature. A CCD
always contains twice as many axial (or star) points as there are factors in the design. In addition, a certain number of center points are also used to capture inherent random variability in the process or system behavior. (a) CCD for two factors. (b) CCD for three factors
number of trials at each iteration point increases exponentially. Hence, 3k designs become impractical for k > 3. A more efficient design requiring fewer experiments is to use the concept of rotation, also referred to as axisymmetric. An experimental design is said to be rotatable if the trials are selected such that they are equidistant from the center. Since the location of the optimum point is unknown, such a design would result in equal precision of estimation in all directions. In other words, the variance of the response variable at any point in the regressor space is function of only the distance of the point from the design center. Central composite design (CCD) contains three components (Berger and Maurer 2002):
This equation is confirmed by Fig. 6.22. For a two-factor experiment design, the CCD generates 4 factorial points and 4 axial points, i.e., 4 + 4 + 1 = 9 points (assuming only one center point). For a three-factor experiment design, the CCD generates 8 factorial points and 6 axial points, i.e., the number of experiments = 8 + 6 + 1 = 15 points. The factorial or “cube” portion and center points (shown as circles in Fig. 6.22) may aid in fitting a first-order (linear) model during the preliminary stage while still providing evidence regarding the importance of a second-order contribution or curvature. A CCD always contains twice as many axial (or star) points as there are factors in the design. The star points represent new extreme values (low and high) for each factor in the design. The number of center points for some useful CCDs has also been suggested. Sometimes, more center points than the numbers suggested are introduced; nothing will be lost by this except the cost of performing the additional runs. For a two-factor CCD, it is recommended that at least two center points be used, while many researchers routinely are said to use as many as 6–8 points. CCDs are most widely used during RSM experimental design since they inherently satisfy the desirable design properties of orthogonal blocking and rotatability and allow for efficient estimation of the quadratic terms in the secondorder model. Central composite designs with two and three factors, along with the manner of coding the levels of the factors at which experimental tests must be conducted, are shown more clearly in Fig. 6.23a, b, respectively. If the distance from the center of the design space to a factorial point is ±1 unit for each factor, the distance from the center of the design space to a star point is±α with |α| > 1 . The precise value of α depends on certain properties desired for the design, like orthogonal blocking and on the number of
(a) A two-level (fractional) factorial design which estimates the main and two factor interaction terms; (b) A “star” or “axial” design, which in conjunction with the other two components, allows estimation of curvature by allowing quadratic terms to be introduced in the model function; (c) A set of center points (which are essentially random repeats of the center point) provides a measure of process stability by reducing model prediction error and allowing one to estimate the error. They provide a check for curvature, i.e., if the response surface is curved, the center points will be lower or higher than predicted by the design points (see Fig. 6.22). The total number of experimental runs for CCD with k factors = 2k þ 2k þ c here c is the number of center points.
ð6:15Þ
6.6 Response Surface Designs
249
Fig. 6.23 Rotatable central composite designs (CCD) for two factors and three factors during RSM. The black dots indicate locations of experimental runs
factors involved. To maintain rotatability, the value of α for CCD is chosen such that: α = nf
1=4
ð6:16Þ
where nf is the number of experimental runs in factorial portion. For example: if the experiment has 2 factors, the full factorial portion would contain 22 = 4 points; the value of α for rotatability would be α = (22)1/4 = 1.414. If the experiment has 3 factors, α = (23)1/4 = 1.682; if the experiment has 4 factors, α = (24)1/4 = 2; and so on. As shown in Fig. 6.23, CCDs usually have axial points outside the “cube” (unless one intentionally specifies α ≤ 1 due to, say, safety concerns in performing the experiments). Finally, since the design points describe a circle circumscribed about the factorial square, the optimum values must fall within this experimental region. If not, suitable constraints must be imposed on the function to be optimized (as illustrated in the example below). For further reading on CCD, the texts by Box et al. (1978) and Montgomery (2017) are recommended. Many software packages have a DOE wizard, which walks one through the various steps of defining the entire set of experimental/simulation combinations. Example 6.6.18 Optimizing the deposition rate for a tungsten film on silicon wafer. A two-factor rotatable central composite design (CCD) was run to optimize the deposition rate for a tungsten film on silicon wafer. The two factors are the process pressure (in kPa) and the ratio of hydrogen H2 to tungsten hexafluoride WF6 in the reaction atmosphere. The levels for these factors are given in Table 6.25. Let x1 be the pressure factor and x2 the ratio factor for the two coded factors. The rotatable CCD design with three center points was adopted with the experimental results 8
From Buckner et al. (1993) with small modification.
Table 6.25 Assumed low and high levels for the two factors (Example 6.6.1) Factor Pressure Ratio H2/WF6
Low level 0.4 2
High level 8.0 10
Table 6.26 Results of the CCD rotatable design for two coded factors with 3 center points (Example 6.6.1) x1 -1 1 -1 1 -1.414 1.414 0 0 0 0 0
x2 -1 -1 1 1 0 0 -1.414 1.414 0 0 0
y 3663 9393 5602 12488 1984 12603 5007 10310 8979 8960 8979
Data available electronically on book website
assembled in Table 6.26. For example, for the pressure term, the low level of 0.4 is coded as -1, and the high level of 8.0 as +1. The numerical value of the response is left unaltered. A second-order linear regression with all 11 data points results in a model with Adj-R2 = 0.969 and RMSE = 608.9. The model coefficients assembled in Table 6.27 indicate that coefficients (x1x2), and (x12x22) are not statistically significant. Dropping these terms results in a better model with Adj-R2 = 0.983, and RMSE = 578.8. The corresponding values of the reduced model coefficients are shown in Table 6.28. In determining whether the model can be further simplified, one notes that the highest p-value on the independent variables is 0.0549, belonging to (x22). Since the p-value is greater or equal to 0.05, that term may not be statistically significant and one could consider removing this term from the model; this, however, is a close call.
250
6
Design of Physical and Simulation Experiments
Table 6.27 Model coefficients for the second-order complete model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1*x2 x1^2 x2^2 x1^2*x2^2
Estimate 8972.6 3454.43 1566.79 289.0 -839.837 -657.282 310.952
Standard error 351.53 215..284 215.284 304.434 277.993 277.993 430.60
t-statistic 25.5246 16.046 7.27781 0.949303 -3.02107 -2.36438 0.722137
p-value 0.0000 0.0001 0.0019 0.3962 0.0391 0.0773 0.5102
Table 6.28 Model coefficients for the reduced model with coded regressors (Example 6.4.1) Parameter Constant x1 x2 x1^2 x2^2
Estimate 8972.6 3454.43 1566.79 -762.044 -579.489
Standard error 334.19 204.664 204.664 243.63 243.63
t-statistic 26.8488 16.8785 7.65544 -3.12787 -2.37856
p-value 0.0000 0.0000 0.0003 0.0204 0.0549
Fig. 6.25 Studentized residuals versus model predicted values (Example 6.6.1)
The optimal values of the two coded regressors associated with the maximum response are determined by taking partial derivatives of Eq. 6.17 and setting them to zero; resulting in: x1 = 2.267 and x2 = 1.353. However, this optimum lies outside the experimental region used to identify the RSM model (see Fig. 6.26) and is unacceptable. A constrained optimization is warranted since the spherical constraint of a rotatable CCD must be satisfied; i.e., from Eq. 6.16: α = (22)1/4 = 1.414. This would guarantee that the optimal condition would fall within the experimental region assumed. Resorting to a constrained optimization (see Sect. 7.3) results in the optimal values of the regressors: x1* = 1.253 and x2* = 0.656 representing a maximum deposition rate y* = 12,883. The low and high values of the two regressors are shown in Table 6.25. These optimal values of the coded variables can be transformed back in terms of the original variables to yield pressure = 8.56 kPa and ratio H2/ WF6 = 8.0. Finally, a confirmatory experiment would have to be conducted in the neighborhood of this optimum.
Fig. 6.24 Observed versus model predicted values (Example 6.6.1)
Thus, the final RSM model with coded regressors is: y = 8972:6 þ 3454:4x1 þ 1566:8x2 - 762:0x21 - 579:5x22 ð6:17Þ It would be wise to look at the residuals. Figure 6.24 suggests quite good agreement between the model and the observations. However, Fig. 6.25 indicates that one of the points (the first row, i.e., y = 3663) is unusual since its studentized residual is high (recall that studentized residuals measure how many standard deviations each observed value of y deviates from the fitted model using all of the data except that observation, see Sect. 5.6.2). Those greater than 3 in absolute value warrant a close look and if necessary, may be removed prior to model fitting.
6.7
Simulation Experiments
6.7.1
Background
Computer simulations of product behavior, processes, and systems ranging from simple (say a solar photovoltaic system) to extremely complex (such as design of a wide-area distributed energy system involving traditional and renewable power generation subunits or simulation of future climate change scenarios) have acquired great importance and a well-respected role in all branches of scientific and engineering endeavor. These virtual tools are based on “behavioral models” at a certain level of abstraction which allow prediction, assessment, and verification of the performance of products and systems under different design and operating
6.7 Simulation Experiments
251
Fig. 6.26 Contour plot of the RSM given by Eq. 6.17 of the two coded factors during RSM (Example 6.6.1)
conditions. They are, to some extent, replacing the need to perform physical experiments which are more expensive, time-consuming, and limited in the number of issues one can consider. Note that simulations are done assuming a set of discrete values for the design parameters/variables, which is akin to the selection of specific levels for treatment and secondary factors while conducting physical experiments. Thus, the primary purpose of performing simulation studies is to learn as much as possible about system behavior under different sets of design factors/variables/parameters with the lowest possible cost (Shannon 1975). The parallel between the traditional physical-experimental DOE methods and the selection of samples of input vectors of design variables for simulations is obvious. Unfortunately, the technical developments in the design of computer simulations have tended to be siloed with rediscovery and duplication slowing down progress. It is common to have several competing commercial simulation programs meant for the same purpose but with differing degree of accuracy, sophistication, and capabilities; however, such issues will not be considered here. It is simply assumed that a reasonable high-fidelity validated computer simulation program is available for the intended purpose. Further, this discussion is aimed at computationally intensive long-run simulations i.e., to instances where a single computer simulation run may require minutes/hours to complete. The material in this section is primarily focused on the use of detailed hourly or sub-hourly time-step computer simulation programs for the design of energy-efficient buildings, which predict the hourly energy consumption and the indoor comfort conditions for a whole year for any building
geometry and climatic conditions (see, for example, Clarke 1993). These programs include models for the dynamic thermal response of the building envelope, that of the performance of the different types of HVAC systems, and they consider the specific manner in which the building is scheduled and operated (widely used ones are EnergyPlus, TRNSYS, Modelica, and eQuest, refer to, for example, de Witt 2003). Commonly, computer simulations can be used for one or more of the following applications: (a) During design, i.e., while conceptualizing the system before it is built. It generally involves two different tasks: (i) Sizing equipment or systems based on point conditions such as peak conditions during normal course of operation. System reliability, safety, etc. must meet some code-specified criteria. Examples involve (i) a sloping roof that is structurally able to support code-specified conditions (say, locationspecific 50-year maximum snow loading), (ii) extreme conditions of outdoor temperature and solar radiation for sizing heating and cooling equipment (often based on 1% and 99% annual probability criteria) for building HVAC equipment in a specific location, (iii) mitigation measures to be activated to avoid environmental pollutants due to vehicles exceeding some prespecified threshold in a specific city neighborhood. (ii) Long-term temporal simulation for system performance prediction under a pre-selected range of design conditions or during normal range of
252
operating conditions assuming efficient or optimal control as per prevailing industry norms. Two examples are: (i) simulating the energy use in a building controlled in a standard energy efficient manner on an hour-by-hour basis over a standard year, and (ii) predicting the annual electricity produced by a PV system assuming fault-free faulty behavior. Such capability allows optimal or satisfactory options to be evaluated based on circumstance-specific constraints in addition to criteria such as energy or cost or both. (b) During day-to-day operation, using model-based shortterm forecasting and optimization techniques to control and operate the system as efficiently as possible. Such conditions may deviate from the idealized and standard conditions assumed during design. Typical applications are model-based supervisory control of energy systems (such as a distributed energy system in conjunction with building HVAC systems) and fault detection and diagnosis. (c) System response/performance under extraordinary/rare events which result in full or partial component/system failure leading to large functionality loss and severe economic and social hardship (i.e., hurricane knocking down power lines). Simulations are performed to study system robustness and ability to recover quickly (this aspect falls under reliability (Sect. 7.5.6) and resilience analysis (complementary to sustainability briefly mentioned in Sect. 12.7.6). Note that the above applications may cover conditions wherein the model/simulation program inputs are either assumed to be deterministic or stochastic. The latter is an instance when some of the physical parameters of the model are not known with certainty and need to be expressed by a probability distribution. For example, during the construction of the wall assembly of a building, deviation from design specifications is common (referred to as specification uncertainty). Another source of uncertainty is that several external drivers (such as weather) may not be known with the needed accuracy at the intended site. In addition, the conditions under which the system is operated may not be known properly and this introduces scenario uncertainty. For example, an architect during design may assume the building to be operated for 12 h/day while the owner may subsequently use it for 16 h/day. Finally, given the complexity of the actual system performance, the underlying models used for the simulation are simplifications of reality (modeling uncertainty). The extent to which all these uncertainties along with the uncertainty of the numerical method used to solve the set of modeling equations (called numerical uncertainty) would affect the final design should be evaluated.
6
Design of Physical and Simulation Experiments
Surprisingly, this aspect has yet to reach the necessary level of maturity for practicing architects and building energy analysts to adopt routinely. The impact of uncertainties in building simulations, and how to address them realistically has been addressed by de Witt (2003).
6.7.2
Similarities and Differences Between Physical and Simulation Experiments
DOE in conjunction with model building can be used to simplify the search for an optimum or a satisficing solution when detailed simulation programs of physical systems requiring long-run computer times are to be used. The similarity of such problems to DOE experiments on physical processes is obvious since the former requires: (a) Important design parameters (or independent variables) and their range of variability to be selected, which is identical to that for treatment and secondary factors. (b) Number of levels for the factors to be decided based on some prior insight of the dependence between response and the set of design parameters. (c) Experimental design: the specific combinations of primary factors to be selected for which one predicts system responses to improve the design. This involves making multiple runs of the computer model using specific values and pairings of these input parameters. (d) Finally, the data analysis phase allows insights into the following: (i) Screening: performing a sensitivity analysis to determine a subset of dominant model input parameter combinations, (ii) Model: gain insights into the structure of the performance model structure such as whether linear, second-order, cross-product terms, etc. (iii) Design optimization: finding an optimum (or nearoptimum) either from an exhaustive search of the set of discrete simulation results or fitting an appropriate mathematical model to the data, referred to as surrogate modeling (discussed in Sect. 6.7.5) There are, however, some major differences. One major difference is that computer experiments are deterministic, i.e., one gets the same system response under the same set of inputs. Replication is not required and only one center point is sufficient if CCD design is adopted. A second major difference is that since the analyst selects the input variable set prior to each simulation run, blocking and control of the variables are inherent and no special consideration is required (Dean et al. 2017). Another difference is that the initial number of design variables tends to be usually much larger than that
6.7 Simulation Experiments
involving physical experiments. The resulting range of input combinations or design space is very large requiring different approaches to generate the input combination samples, reduce the initial variable set by sensitivity analysis (similar to screening), and reduce the number of computer simulations during the search for a satisficing/optimal solution by adopting space-filling interpolation methods (also called “surrogate modelings”). These aspects are discussed below.
6.7.3
Monte Carlo and Allied Sampling Methods
The Monte Carlo (MC) method has been introduced earlier in the context of uncertainty analysis (Sect. 3.7.2). The MC approach, of which there are several variants, comprises that branch of computational mathematics that relies on experiments using random numbers to infer the response of a system (Hammersley and Handscomb 1964). Chance events are artificially recreated numerically (on a computer), the simulation runs many times, and the results provide the necessary insights. MC methods provide approximate solutions to a variety of deterministic and stochastic problems, hence their widespread appeal. The methods vary but tend to follow a particular pattern when applied to computer simulation situations: (i) Define a domain of possible inputs, (ii) Generate inputs randomly from an assumed probability distribution over the domain, (iii) Perform a deterministic computation on the inputs, and (iv) Aggregate and analyze the results. The many advantages of MC methods are conceptual simplicity, low level of mathematics, applicability to a large number of different types of problems, ability to account for correlations between inputs, and suitability to situations where model parameters have unknown distributions. MC methods are numerical methods in that all the uncertain inputs must be assigned a definite probability distribution. For each simulation, one value is selected at random for each input based on its probability of occurrence. Numerous such input sequences are generated, and simulations are performed. Provided the number of runs is large, the simulation output values will be normally distributed irrespective of the probability distributions of the inputs (this follows from the Central Limit theorem described in Sect. 4.2.1). Even though nonlinearities between the inputs and output are accounted for, the accuracy of the results depends on the number of runs. However, given the power of modern computers, the relatively large computational effort is no longer a major limitation except in very large simulation studies. The concept of “sampling efficiency” has been used
253
to compare different schemes of implementing MC methods and was introduced earlier (refer to Eq. 4.49 in Sect.4.7.4 dealing with stratified sampling). In the present context of computer simulations, this term assumes a different meaning and Eq. 4.49 needs to be modified. Here, a more efficient scheme of sampling a pre-chosen set of input variables along with their range of variation is one which results in greater spread or variance in the resulting simulation response set with fewer simulations n (thus requiring less computing time). Say two methods, methods 1 and 2, are to be compared. Method 1 created a sample of n1 simulation runs, while method 2 created a sample of n2 runs. The resulting two sets of the response variable outputs were found to have variances σ 21 and σ 22 : Then, the sampling efficiency of method 1 with respect to method 2 can be said to be as: ε1 = ε2
σ 21 =σ 22 =ðn1 =n2 Þ
ð6:18Þ
where (n1/n2) is called the labor ratio, and σ 21 =σ 22 is called the variance ratio. MC methods have emerged as a basic and widely used generic approach to quantify variabilities associated with model predictions, and for examining the relative importance of model parameters that affect model performance (Spears et al. 1994; de Wit 2003). There are different types of MC methods depending on the sampling algorithm for generating the trials (Helton and Davis 2003): (a) Random sampling methods, which were the historic manner of explaining MC methods. They involve using random sampling for estimating integrals (i.e., for computing areas under a curve and solving differential equations). However, there is no assurance that a sample element will be generated from any subset of the sample space. Important subsets of space with low probability but high consequences are likely to be missed. (b) Crude MC which uses traditional random sampling where each sample element is generated independently following a pre-specified distribution. (c) Stratified MC (also called “importance sampling”), where the population is divided into groups or strata according to some pre-specified criterion, and sampling is done so that each stratum is guaranteed representation (unlike the crude MC method). Thus, it has the advantage of forcing the inclusion of specified subsets of sampling space while maintaining the probabilistic character of random sampling This method is said to be an order of magnitude more efficient than the crude MC method. A major problem is the necessity of defining the strata and calculating their probabilities, especially for high-dimension situations.
254
6
Design of Physical and Simulation Experiments
Fig 6.27 The LHMC sampling method is based on quantiles, i.e., dividing the range of variation of each input variable into intervals of equal probability and then combining successive variables by sampling without replacement. LHMC is more efficient than basic MC sampling since it assures that each interval is sampled with the same density
(d) Latin hypercube, or LHMC, a stratified sampling without replacement technique for generating a set of input vectors from a multidimensional distribution. This is often used to construct computer experiments for performing sensitivity and uncertainty analysis on complex systems. It uses stratified sampling without replacement and can be viewed as a compromise procedure combining many of the desirable features of random and stratified sampling. A Latin hypercube is the generalization of the Latin square to an arbitrary number of dimensions. It is most appropriate for design problems involving computer simulations because of its higher efficiency (McKay et al. 2000). LHMC is said to yield an unbiased estimator of the mean, but the estimator of the variance is biased (unknown but generally small).
Fig. 6.27). A more detailed discussion of how to construct LHMC and why this method is advantageous in terms of computer-time efficiency is discussed by Dean et al. (2017). How to modify this method to deal with correlated variables has also been proposed and is briefly discussed in the next section.
6.7.4
Sensitivity Analysis for Screening
LHMC sampling, which can be considered to be a factorial method, is conceptually easy to grasp. Say the input variable/factor vector is of dimension k and a sample of size m is to be generated from p = [p1, p2, p3, . . . ,pk]. The range of each variable pj is divided into n disjoint or nonoverlapping intervals of equal probability9 (see Fig. 6.27 where n = 8 and m = 16) and values are selected randomly from each interval. The m values thus obtained for p1 are paired at random without replacement with similarly obtained m values for p2. These two pairs (called k2-pairs) are then combined in a random manner without replacement with the m values of p3 to form k3-triples. This process is continued until m samples of kp-tuples are formed. Note that this method of generating samples assures that each interval/subspace is sampled with the same density; this leads to greater efficiency compared to the basic MC method (as shown in
Sensitivity analysis is akin to factor screening in DOE and is different from an uncertainty analysis (treated in Sects. 3.6 and 3.7). The aim of sensitivity analysis is to explore/determine/identify the impact of input factors/variables/ parameters on the predicted/simulated output variable/ response, and then, to quantify their relative importance.10 During many design studies, the performance of the system often depends on a relatively few significant factors and to a much lesser degree on several insignificant ones (this is the Pareto principle rule of thumb often stated as 20/80!). The mapping between random input and the obtained model output can be explored with various techniques to determine the effects of the individual factors. The simplest manner of identifying parameter importance, appropriate for low-dimension input parameter vectors, is to generate scatterplots for each factor versus the response, which can visually reveal the relationships between them. Another possibility is to use the popular least squares techniques to construct a regression model that relates the parameters with the response. Several studies have proposed using partial correlation coefficients of linear regression (Sect. 5.4.5) or using stepwise regression (Sect. 5.4.6) for ranking
9
10
Recall that quantiles split sorted data or a probability distribution into equal parts.
Uncertainty analyses, on the other hand, use probabilistic values of model inputs to estimate probability distributions of model outputs.
6.7 Simulation Experiments
255
Fig. 6.28 Figure illustrating linear and nonlinear sensitivity behavior of annual energy use of a building under different sets of design variables (from Lam and Hui 1996). (a) Linear effect of four different
options of glazing design, (b) Nonlinear effect of three different external shading designs
parameter importance and for detecting parameter interaction (one such study is that by Wang et al. 2014). The use of linear regression is questionable; it may be suitable when system performance is linear (as exhibited by energy use in certain types of buildings and their HVAC system behavior, certain design parameters, for narrow range of parameter variation etc.), but may not be of general applicability (see Fig. 6.28). Data mining methods, such as random forest algorithm (Sect. 11.5), are generally superior than regression-based methods in both screening and ranking parameters for building energy design (see, e.g., Dutta et al. 2016). There are several more formal sensitivity analysis methods (see Saltelli et al. 2000 and relevant technical papers). In such methods, the analyst is often faced with the difficult task of selecting the one method most appropriate for his application. A report by Iman and Helton (1985) compares different sensitivity analysis methods as applied to complex engineering systems and summarizes current knowledge in this area. Note that the modeling equations on which the simulation is based are often nonlinear in the parameters and even if linear, the parameters may interact. In that case, the sensitivity of a parameter/factor may vary from point to point in the parameter space, and random samples in different regions will be required for proper analysis. Thus, one needs to distinguish between local sensitivity analysis (LSA) and global sensitivity analysis (GSA) (Heiselberg et al. 2009). There are two types of sensitivities: (i) LSA (or oneparameter-at-a-time method) which describes the influences of individual design parameters on system response with all other parameters held constant at their standard or base values, i.e., at a specific local region in the parameter space. It is satisfactory for linear models and when the sensitivity of
each individual input is independent of the value of the other inputs (often not true), (ii) GSA which provides insight into the influence of a single design parameter on system response when all other parameters are varied together based on their individual ranges and probability distributions. These two types are discussed below. (a) LSA or local sensitivity analysis The general approach to determining individual sensitivity coefficients is summarized below: (i) Formulate a base case reference and its description. (ii) Study and break down the factors into basic parameters (parameterization). (iii) Identify parameters of interest and determine their base case values. (iv) Determine which simulation outputs are to be investigated and their practical implications. (v) Introduce perturbations to the selected parameters about their base case values one at a time. Factorial and fractional factorial designs are commonly used. (vi) Study the corresponding effects of the variation/perturbation on the simulation outputs. (vii) Determine the sensitivity coefficients for each selected parameter. The ones with the largest values are deemed more significant. Sensitivity coefficients (also called, elasticity in economics, as well as influence coefficients) are defined in various ways as shown in Table 6.29. All these formulae involve discrete changes to the variables denoted by step change Δ as against partial derivatives. The first form is simply the
256
6
Design of Physical and Simulation Experiments
Table 6.29 Different forms of sensitivity coefficient Form 1 2a 2b 3a 3b
Formula1 ΔOP ΔIP ΔOP=OPBC ΔIP=IPBC ΔOP=OPBC ΔIP ðOP þOP Þ ΔOP= 1 2 2 ðIP þIP Þ ΔIP= 1 2 2 ΔOP ΔIP
=
- OP - IP
Dimension With dimension
Common name(s) Sensitivity coefficient, influence coefficient
% OP change % IP change
Influence coefficient, point elasticity
With dimension
Influence coefficient
% OP change % IP change
Arc mid-point elasticity, meant for two inputs
% OP change % IP change
(See note 2)
From Lam and Hui (1996) 1. ΔOP, ΔIP = changes in output and input respectively OPBC, IPBC = base case values of output and input respectively IP1, IP2 = two values of input OP1, OP2 = two values of the corresponding output - OP, - IP = mean values of output and input respectively 2. the slope of the linear regression line divided by the ratio of the mean output and mean input values
derivative of the output variable (OP) with respect to the input parameter (IP). The second group uses the base case values to express the sensitivity in percentage change, while the third group uses the mean values to express the percentage change (this is similar to forward differencing and central differencing approaches used in numerical methods). Form (1) is local sensitivity coefficient and is the simplest to interpret. Forms (2a), (3a), and (3b) have the advantage that the sensitivity coefficients are dimensionless. However, form (3a) can only be applied to one-step change and cannot be used for multiple sets of parameters. In general, such methods are of limited use for simulation models with complex sets of non-linear functions and a large set of input variables. Figure 6.28 illustrates the linear and nonlinear behavior of two different sets of design strategies. The first set consists of four glazing design variables while the second set of three external shading variables (all the variables are described in the figure). Variations in the parameters related to the four different window designs (Fig. 6.28a) are linear, and the local sensitivity method would provide the necessary insight. Further, it is clear that the effect of the shading coefficient (SC) has the largest impact on annual energy use. The behavior of the sensitivity coefficients of the external shading designs likely to impact the energy use of the building is clearly nonlinear from Fig. 6.28b. The projection ratio of the egg-crate external shading design (shown as EG) has the largest impact of energy use. Further, all three external shading designs exhibit an exponential asymptotic behavior. (b) GSA or global sensitivity analysis Global sensitivity methods allow parameter sensitivity and ranking for nonlinear models over the entire design space. The simplest approaches applied to the design of energyefficient buildings include traditional randomized factorial two-level sampling designs (e.g., Hou et al. 1996), variations
of the Latin squares, and CCD in conjunction with quadratic regression (e.g., Snyder et al. 2013). These approaches are suitable when the number of input variables is relatively low (say up to 6–7 variables with 2 or 3 levels). However, for larger number of variables such approaches are usually infeasible since running detailed simulation programs are computationally intensive and have long run-times. One cannot afford to perform separate simulation runs for sensitivity analysis and for acquiring system response for different input vector combinations. While being an attractive way of generating multidimensional input vector samples for extensive computer experiments, LHMC is the most popular method for performing sensitivity studies as well, i.e., for screening variable importance (Hofer 1999). Once the LHMC simulation runs have been performed, one can identify the strong or influential parameters and/or the weak ones based on the results of the response variable. For example, the designer of an energy efficient building would be more concerned with identifying the subset of input parameters which are more likely to lead to low annual energy consumption (such parameters can be referred to as “strong” or influential parameters). This is achieved by a process called regional sensitivity analysis. If the “weak” parameters can be fixed at their nominal values and removed from further consideration during design, the parameter space would be reduced enormously and somewhat alleviate the “curse of dimensionality.” There are several approaches one could adopt; a rather simple and intuitive statistical method is described below. Assume that m “candidate” input/design parameter were initially selected with each parameter discretized into three levels or states (i.e., n = 3). The necessary number of LHMC runs (say, 1,000) are then conducted. Out of these 1,000 runs, it was found that only 30 runs had response variable values in the acceptable range corresponding to 30 specific vectors of input parameters. One would expect
6.7 Simulation Experiments
257
Table 6.30 Critical thresholds for the chi-square statistic with different significance levels for degrees of freedom 2 d. f 2
α = 0.001 13.815
α = 0.005 10.597
α = 0.01 9.210
α = 0.05 5.991
the influential input parameters to appear more often in one level in this “acceptable subset” of input parameter vectors than the weak or non-influential ones. In fact, the latter are likely to be randomly distributed among the acceptable subset of vectors. Thus, the extent to which the number of occurrences of an individual parameter differs from 10 within each discrete state would indicate whether this parameter is strong or weak. This is a type of sensitivity test where the weak and strong parameters are identified using non-random pattern tests (Saltelli et al. 2000). The wellknown chi-square χ 2 test for comparing distributions (see Sect. 2.4.3g) can be used to assess statistical independence for each and every parameter. First, the χ 2 statistic is computed for each of the m input variables as: 3
χ = 2
s=1
pobs,s - pexp pexp
2
ð6:19Þ
where pobs is the observed number of occurrences, and pexp is the expected number (in this example above, this will be 10), and the subscript s refers to the index of the state (in this case, there are three states). If the observed number is close to the expected number, the χ 2 value will be small indicating that the observed distribution fits the theoretical distribution closely. This would imply that the particular parameter is weak since the corresponding distribution can be viewed as being random. Note that this test requires that the degrees of freedom (d. f.) be selected as (number of states -1), i.e., in our case d.f. = 2. The critical values for the χ 2 distribution for different significance levels α are given in Table 6.30. If the χ 2 statistic for a particular parameter is greater than 9.21, one could assume it to be very strong since the associated statistical probability is greater than 99%. On the other hand, a parameter having a value of 1.386 (α = 0.5) could be considered to be weak, and those in between the two values as uncertain in influence. (c) The Morris method A superior method to perform GSA than the one described above is the variance-based approach, which depends on the decomposition of variance of the response variable. One such variant has proven to be particularly attractive for many practical design problems. The elementary effects method, also referred to as the Morris method (Morris 1991), is an MC approach utilizing random sampling and a one-at-a-time (OAT) approach to generate vectors of input parameters. It
α = 0.2 3.219
α = 0.3 2.408
α = 0.5 1.386
α = 0.9 0.211
Table 6.31 Simple example of how the Morris method generates a trajectory of 5 simulation runs with four design parameters (k = 4) Run 0a Run 1 Run 3 Run 4 Run 5
Parameter k1 k1,1 k1,2 k1,2 k1,2 k1,2
Parameter k2 k2,1 k2,1 k2,2 k2,2 k2,2
Parameter k3 k3,1 k3,1 k3,1 k3,2 k3,2
Parameter k4 k4,1 k4,1 k4,1 k4,1 k4,2
Note that the parameter values above and below those indicated in bold are frozen for consecutive runs a Baseline run
is an extension of the derivative-based methods (see Table 6.29 for different types of working definitions). It includes variance as an additional sensitivity index for screening the global space and allows combining first- and second-order sensitivity analysis. Each of the k input variables is defined within a range of a continuous variable, normalized by a min-max scaling (see Eq. 3.15) and discretized into sub-intervals (equal to the number of levels selected for each parameter) with equal probability (see Fig. 6.27). It uses an LHMC input vector generation method where the pairing of successive parameter combinations retains their correlation behavior. A trajectory is defined as a set of (k + 1) simulation runs or vector/sequence of the k parameters with each successive point differing from the preceding one in one variable value only by a multiple of a pre-defined step size Δi. Multiple trajectories need to be generated during the parametric design. How the vector sequence is generated is illustrated using an example with four parameters in Table 6.31. The baseline run (Run 0) is generated by randomly selecting the discrete levels of each parameter (indicated as k1,1, k2,1, . . .k4,1). For Run 1, one of the parameters, in this case, k1 is resampled with the other four parametric values unchanged. Run 2 consists of randomly selecting one of the parameters (the selection should not be sequential; in this case, k2 has been selected for easier comprehension) and then randomly resampling one of the parameters (shown as k2,2). This is repeated for the remaining parameters k3 and k4. Hence, only one parameter is resampled for each run and that value is frozen for the remaining subsequent runs in the trajectory. The second simulation trajectory is created by randomly selecting a new set of combinations for Run 1 and using the OAT approach for subsequent runs similar to the first trajectory. It has been found that only a relatively small number of such trajectory simulation runs are needed. Once the
258
6
Design of Physical and Simulation Experiments
Fig. 6.29 Variation of the mean and standard deviation (μ* and σ) for the 23 building design variables chosen with annual energy use as the response variable (from Didwania et al. 2023). Only the seven design variables falling outside the envelope indicated are influential, the others are anonymously shown as dots
simulations are performed, the elementary effect (EE) of an individual parameter or factor k can be calculated for all points in the trajectory as follows (Sanchez et al. 2014): y ð k þ ei Δ i Þ - y ð k Þ EEi = Δi
ð6:20Þ
where y is the response variable or design criterion or objective function and Δi is the pre-defined step size. The term ei is a vector of zeros except for its ith component, which takes on integer values by which different levels of the discretized parameter levels can be selected. Each trajectory with (k+1) simulation runs provides an estimate of the elementary effects for each of the k parameters or variables. A set of r such different trajectories are defined, and so the total number of simulations runs = [r (k + 1)]. The average μ and standard deviation σ of elementary effects are computed for each parameter and each trajectory t: μi = σi =
1 r
1 ð r - 1Þ
r t=1
EEit
r t=1
ðEEit - μi Þ2
ð6:21Þ ð6:22Þ
The ensemble of trajectory runs is analyzed by computing and plotting the two statistical indicators, namely mean and standard deviation (μ* and σ) of the response to each design or input parameter. The relative importance of each parameter on the response variable and parameter interaction on the response can be determined as11: 1. Negligible – low average (μ*) and low standard deviation (σ) 2. Linear and additive – high average (μ*) and low standard deviation (σ) 11
Note that this process is akin to the regional sensitivity approach using the chi-square test given by Eq. 6.19.
3. Nonlinear or presence of interactions – high standard deviation (σ) How the final selection is done based on the above criteria is illustrated in Fig. 6.29. The points falling outside the zone of influence indicated by a curved line are deemed influential. One notes that out of 23 parameters investigated, only 7 are significant and interactive. Lower μ* would indicate that changing these variables will not have a substantial impact on the objective function; lower σ would suggest that its impact on the objective function is not affected by other parameters and that its impact is not nonlinear. If the model parameters have a significantly nonlinear effect, then it is suggested that an additional analysis involving second-order effects be performed as described by Sanchez et al. (2014). The Morris method is said to require far fewer simulation trajectory runs (i.e., more efficient) than the traditional LHMC method when a large number of parameters with possible interaction and second-order effects are being screened. The greater the number of trajectory simulation runs, the greater the accuracy, but studies have shown that 15–20 trajectory runs are adequate with 25 or so design variables. The Morris method has also been combined with the parallel coordinate graphical representation to enhance the ability for architects and designers to visually explore different combinations of sustainable building design which meet prestipulated ranges of variation of multi-criteria objective functions (Didwania et al. 2023). (e) Closure Helton and Davis (2003) cite over 150 references in the area of sensitivity analysis, discuss the clear advantages of LHMC for the analysis of complex systems, and enumerate the reasons for the popularity of such methods. Another popular GSA method similar to the Morris method meant for complex mathematical models is the Sobol method (Sobol 2001). It generates a sample that is uniformly distributed over the unit
6.7 Simulation Experiments
hypercube and uses variance-based global sensitivity indices to determine first-order, second-order, and total effects of individual variables or groups of variables on the model output. It is said to be more stable than the Morris method but requires more simulation runs. The results of both methods have been reported to be similar by Didwania et al. (2023) for a case study involving the design of energy-efficient buildings. In addition to the LHMC method and the variance-based GSA method, variable screening and ranking of variables/ factors by importance can also be done by data mining methods (such as random forest algorithms discussed in Sect. 11.5). Dutta et al. (2016) illustrate this approach in the context of energy-efficient building design as a more attractive option to the CCD design approach which has been reported by Snyder et al. (2013) and described in Problem 11.14.
6.7.5
Surrogate Modeling
All the experimental design methods discussed above are referred to as static sampling approaches since all the samples are defined prior to running the batch of simulations and not adjusted depending on simulation outcomes. A more efficient approach that speeds up convergence is the iterative sampling approach while requiring fewer simulation runs. This is akin to RSM which, as described in Sect. 6.6, is an iterative approach, which accelerates the search toward finding the optimum condition by simultaneously varying more than one response variable. It proceeds by first identifying a model between a response and a set of several Fig. 6.30 Simple example with two design variables to visually illustrate the surrogate modeling iterative approach as applicable to the design of an energy-efficient building. The dots indicate the preliminary computer simulation runs which would provide insights into how to narrow the solution space for subsequent simulation runs. (From Westermann and Evins 2019)
259
continuous treatment variables over an initial limited solution space, analyzes test results to determine the optimal direction to move, performs a second set of test conditions, and so on until the desired optimum is reached. A similar methodology for iteratively reducing the number of simulations during the search for an optimum and speed-up convergence can be adopted for detailed computer simulation programs. Specifically, the following steps are undertaken:
(i) Initially select a relatively small set of specific values and pairings of these input parameters (akin to performing factorial experiments). (ii) Make multiple runs on the computer model and perform a sensitivity analysis to determine a subset of dominant model input parameter combinations (akin to screening). (iii) Fit an appropriate mathematical model between the response and the set of independent variables (usually a second-order polynomial model with/without interactions). (iv) Use this fitted response surface polynomial model as a surrogate (replacement or proxy) for the computer model to rapidly revise/shrink the original search space. (v) Repeat steps (ii) to (iv) till the desired optimal/ satisficing design solution is reached. Figure 6.30 visually illustrates the general approach in a conceptually clear manner for a simple case involving two design variables only. An energy-efficient building is to be designed by varying two design variables: WWR—window to wall ratio and SHGC—solar heat gain coefficient of the window. The traditional factorial-type experimental design would require that the solution space be uniformly blanketed to find a satisfactory or optimal solution. The same insight
260
6
could be provided by fewer simulation runs by performing an initial set of limited runs (indicated as dots), selecting a finer grid in the sub-space of interest, and then iteratively zeroing in on the design solution. This approach is akin to the response surface modeling (RSM) approach described in Sect. 6.6. The significant benefit of the mathematical surrogate model approach is that it allows the solution to be reached more quickly using calculus methods than by computer simulations alone. However, a higher level of domain knowledge and analytical skills is demanded of the analyst. A good review of surrogate modeling techniques in general can be found in Dean et al. (2017) while its application to the design of sustainable buildings and an extensive literature review as applied to the design of buildings is provided by Westermann and Evins (2019).
6.7.6
Summary
Sensitivity analysis as pertinent to computer-based simulation design involves three major stages: Stage 1: Pre-processing or selection of independent design variable combinations The pre-processing stage involves selecting design variables of interest and identifying practical ranges based on the building type, project requirements, and owner specifications. If nonlinear relationships between predictors and response are suspected, then a minimum of three levels (which allow nonlinear and interactions to be explicitly captured) for each variable should be selected for the traditional factorial design methods while two levels can be used for rotatable CCD design. The number of evaluative combinations increases exponentially with the number of factors and levels. For example, 15 variables at three levels would lead to 315~14 × 106 combinations, an impractical number of simulations. An experimental design technique is essential to select fewer runs while ensuring stratified (representative) sampling of the variable space. LHMC is popular since it is numerically efficient while its results are easy to interpret while performing sensitivity and uncertainty analyses of complex systems with many design parameters (Helton and Davis 2003; Heiselberg et al. 2009). Alternatively, a more traditional CCD design could also be adopted if the initial variable set is relatively small (say less than 6–7 variables)12. Since the samples are defined prior to simulation and not adjusted
Design of Physical and Simulation Experiments
depending on simulation outcomes, this approach is called static sampling (more popular and easier to implement than the iterative approach). Stage 2: Simulation-based generation of system responses Selected variable combinations can now be input into an hourly building energy simulation program for batch processing. The responses could be direct outputs from the chosen simulation program, such as annual energy use/peak demand or could be derived metrics like energy costs or environmental impacts. A database of such simulation outputs is created which is then used in the next stage. Stage 3: Post-processing of simulation results Traditionally, least squares regression analysis has been the most popular technique in developing a model for predicting energy consumption under the range of variation of the numerous design parameters. However, many studies have found that global regression techniques are questionable for this purpose and that nonparametric data mining methods such as the random forest algorithm (described in Sect. 11.5) are superior in both feature selection (identification of design variables that are most influential) and as a global prediction model.
Problems Pr. 6.113 Full-factorial design for evaluating three different missile systems A full-factorial experiment is conducted to determine which of three different missile systems is preferable. The propellant burning rate for 24 static firings was measured using four different propellant types. The experiment performed duplicate observations (replicate r = 2) of burning rates (in minutes) at each combination of the treatments. The data, after coding, are given in Table 6.32. Table 6.32 Burning rates in minutes for the (3 × 4) case with two replicates (Problem 6.1) Missile System A1 A2 A3
Propellant type b1 b2 34.0, 32.7 30.1, 32.8 32.0, 33.2 30.2, 29.8 28.4, 29.3 27.3, 28.9
b3 29.8, 26.7 28.7, 28.1 29.7, 27.3
b4 29.0, 28.9 27.6, 27.8 28.8, 29.1
Data available electronically on book website
12
Refer to Problem 6.14 in the context of energy efficient building design.
13
From Walpole et al. (2007) by # permission of Pearson Education.
Problems
261 Table 6.34 Data table for Problem 6.3
The following hypotheses tests are to be studied: (i) There is no difference in the mean propellant burning rates when different missile systems are used. (ii) There is no difference in the mean propellant burning rates of the four propellant types. (iii) There is no interaction between the different missile systems and the different propellant types. Pr. 6.2 Random effects model for worker productivity A full-factorial experiment was conducted to study the effect of indoor environment condition (depending on such factors as dry bulb temperature, relative humidity. . .) on the productivity of workers manufacturing widgets. Four groups of workers were selected and distinguished by such traits as age, gender. . . called G1, G2, G3, and G4. The number of widgets produced over a day by two members of each group under three different environmental conditions (E1, E2, and E3) was recorded. These results are assembled in Table 6.33. Using 0.05 significance level, test the hypothesis that: (a) different environmental conditions have no effect on number of widgets produced, (b) different worker groups have no effect on number of widgets produced, (c) there are no interaction effects between both factors. Subsequently, identify a suitable random effects model, study model residual behavior, and draw relevant conclusions. Pr. 6.3 Two-factor two-level (22) factorial design (complete or balanced) Consider a brand of variable speed electric motor which is meant to operate at different ambient temperatures (factor X) and at different operating speeds (factor Y ). The time to failure in hours (the response variable) has been measured for the four different treatment groups (conducted in a randomized manner) at a replication level of three (Table 6.34).
Table 6.33 Number of widgets produced daily using a replicate r = 2 (Problem 6.2) Environmental conditions E1 E2 E3
Group number G1 G2 227, 214, 221 259 187, 181, 208 179 174, 198, 202 194
Data available electronically on book website
G3 225, 236 232, 198 178, 213
G4 260, 229 246, 273 206, 219
Factor X Low High
Factor Y-!
Low 82, 78, 86 67, 74, 67
High 64, 61, 57 46, 54, 52
Table 6.35 Thermal efficiencies (%) of the two solar thermal collectors (Problem 6.4)
Without selective surface With selective surface
Mean operating temperature (°C) 80 70 60 28, 34, 38, 29, 31, 32 33, 35, 34 39, 41, 38 33, 38, 41, 36, 33, 34 38, 36, 35 40, 43, 42
50 40, 42, 41, 41 43, 45, 44, 45
Data available electronically on book website
(a) Code the data in the table using the standard form suggested by Yates (see Table 6.10). (b) Generate the main effect and interaction plots and summarize observations. (c) Repeat the analysis procedure described in Example 6.4.2 and identify a suitable prediction model for the response variable at the 0.05 significance level. (d) Compare this model with one identified using linear OLS multiple regression. Pr. 6.4 The thermal efficiency of solar thermal collectors decreases as their average operating temperatures increase. One of the means of improving the thermal performance is to use selective surfaces for the absorber plates which have the special property that the absorption coefficient is high for the solar radiation and low for the infrared radiative heat losses. Two collectors, one without a selective surface and another with, were tested at four different operating temperatures under replication r = 4. The experimental results of thermal efficiency in % are tabulated in Table 6.35. (i) Perform an analysis of variance to test for significant main and interaction effects. (ii) Identify a suitable random effects model. (iii) Identify a linear regression model and compare your results with those from part (ii). (iv) Study model residual behavior and draw relevant conclusions. Pr. 6.5 Complete factorial design (32) with replication The carbon monoxide (CO) emissions in g/m3 from automobiles (the response variable) depend on the amount of ethanol added to a standard fuel (Factor A) and the air/fuel ratio (Factor B). A standard 3^k factorial design (i.e. three levels) with k = 2 with two replicates results in the values shown in
262
6
Table 6.36 Data table for Problem 6.5 Trial # 1 2 3 4 5 6 7 8 9
Levels Factor A -1 0 +1 -1 0 +1 -1 0 +1
CO emissions Response Y 66, 62 78, 81 90, 94 72, 67 80, 81 75, 78 68, 66 66, 69 60, 58
Factor B -1 -1 -1 0 0 0 +1 +1 +1
Data available electronically on book website
Table 6.37 Data table for Problem 6.6 X1 -1 +1 -1 +1 -2 +2 0 0 0
X2 -1 -1 +1 +1 0 0 -2 +2 0
Y 2 4 3 5 1 4 1 5 3
Design of Physical and Simulation Experiments
(c) Identify a prediction model. (d) Using least squares, evaluate first-order and secondorder models. Compare them with the model identified from step (c). (e) Determine the optimal value. (f) Criticize the analysis and suggest improvements to the design procedure. (For example, more central points, replicates, more decimal points in the response variable, .... Is this a rotatable design?. . . .) Pr. 6.7 The close similarity between a factorial design model and a multiple linear regression model was illustrated in Example 6.4.2. You will repeat this exercise with data from Example 6.4.3. (a) Identify a multiple linear regression model and verify that the parameters of all regressors are identical to the factorial design model. (b) Verify that model coefficients do not change when multiple linear regression is redone with the reduced model using variables coded as -1 and +1. (c) Perform a forward stepwise linear regression and verify that you get back the same reduced model with the same coefficients.
Data available electronically on book website
Table 6.36. Low, medium, and levels are codes as -1, 0, and +1 (Table 6.36.) (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the significant main and interaction terms using multifactor ANOVA. (c) Confirm that this is an orthogonal array (by taking the inverse of (XTX)). (d) Using the matrix inverse approach (see Eq. 6.11), identify a factorial design model. (e) Perform a multiple linear regression analysis with only the significant terms and identify a suitable prediction model for the response variable at the 0.05 significance level. Pr. 6.6 Composite design The following data were experimentally collected using a composite design (Table 6.37): (a) Generate the main effect and interaction plots and summarize observations. (b) Identify the significant main and interaction terms using multifactor ANOVA.
Pr. 6.8 23 factorial analysis for strength of concrete mix A civil construction company wishes to maximize the strength of its concrete mix with three factors or variables: A- water content, B- coarse aggregate, and C- silica. A 23 full factorial set of experimental runs, consistent with the nomenclature of Table 6.10, was performed. These results are assembled below: (a) You are asked to analyze these data, generate the ANOVA table and identify statistically meaningful terms (Hint: You will find that none are significant which is probably due to d.f. = 1 for the residual error. It would have been more robust to do replicate testing). (b) Analyze the data using stepwise multiple linear regression and identify the statistically significant factors and interactions (at 0.005 significance). (c) Identify the complete linear regression model using all main and interaction terms and verify that the model coefficients of the statistically significant terms are identical to those using stepwise regression (step b). This is one of the major strengths of the 2^k factorial design. (d) Develop a factorial design model for this problem using only the statistically significant terms (Table 6.38).
Problems
263
Pr. 6.9 Predictive model inferred from 23 factorial design on a large laboratory chiller14 Table 6.39 assembles steady-state data of a 23 factorial series of laboratory tests conducted on a 90 ton centrifugal chiller. There are three response variables (Tcho = chilled water leaving the evaporator, Tcdi = cooling water entering the condenser, and Qch = chiller cooling load) with two levels each, thereby resulting in 8 data points without any replication. Note that there are small differences in the high and low levels of each of the factors because of operational control variability during testing. The chiller coefficient of performance (COP) is the response variable. (a) Perform an ANOVA analysis, and check the importance of the main and interaction terms using the 8 data points indicated in the table. (b) Identify the parsimonious predictive model from the above ANOVA analysis. (c) Identify a least square regression model with coded variables and compare the model coefficients with those from the model identified in part (b). Table 6.38 Data table for Problem 6.8
Trial 1 2 3 4 5 6 7 8
Level of factors A -1 1 -1 1 -1 1 -1 1
B -1 -1 1 1 -1 -1 1 1
C -1 -1 -1 -1 1 1 1 1
Response Replication 1 58.27 55.06 58.73 52.55 54.88 58.07 56.60 59.57
(d) Generate model residuals and study their behavior (influential outliers, constant variance and near-normal distribution). (e) Reframe both models in terms of the original variables and compare the internal prediction errors. (f) Using the four data sets indicated in the table as holdout points meant for cross-validation, compute the NMSE, RMSE and CV values of both models. Draw relevant conclusions. Pr. 6.10 Blocking design for machining time Consider Example 6.5.1 where the performance of four machines was analyzed in terms of machining time with operator dexterity being a factor to be blocked. How to identify an additive linear model was also illustrated. It was pointed out that interaction effects may be important. (a) You will reanalyze the data to determine whether interaction terms are statistically significant or not. (b) It was noted from the residual plots that two of the extreme data points were suspect. Can you redo the analysis while accounting for this fact? Pr. 6.11 Latin squares design with k = 3 Reduction in nitrogen oxides due to gasoline additives (the main treatment factor) is to be analyzed for different types of automobiles (factor X) under different drivers (Factor Y ). One does not expect interaction effects and so a Latin squares series of experiments are conducted assuming five different levels for all three factors. The following table assembles the results for this design with nitrogen oxide emissions being shown as continuous numerical values. The different gasoline additives are the treatment coded as A, B, C, D, and E (Table 6.40).
Replication 2 57.32 55.53 57.95 53.09 55.2 58.76 56.16 58.87
Data available electronically on book website
Table 6.39 Laboratory tests from a centrifugal chiller (Problem 6.9)
Test # 1 2 3 4 5 6 7 8
Data for model development Tcdi Tcho (°C) (°C) 10.940 29.816 10.403 29.559 10.038 21.537 9.967 18.086 4.930 27.056 4.541 26.783 4.793 21.523 4.426 18.666
Qch (kW) 315.011 103.140 289.625 122.884 292.052 109.822 354.936 114.394
COP 3.765 2.425 4.748 3.503 3.763 2.526 4.411 3.151
Data available electronically on book website
14
Adapted from a more extensive table from data collected by Comstock and Braun (1999). We are thankful to James Braun for providing this data.
Data for cross-validation Tcho Tcdi (°C) (°C) 7.940 29.628 7.528 24.403 6.699 24.288 7.306 24.202
Qch (kW) 286.284 348.387 188.940 93.798
COP 3.593 4.274 3.678 2.517
264
6
Table 6.40 Data table for Problem 6.11
Factor X X1 X2 X3 X4 X5
Factor Y Y1 A B C D E
Y2 B C D E A
24 17 18 26 22
Design of Physical and Simulation Experiments
20 24 38 31 30
Y3 C D E A B
Course 2 B C D A
79 82 70 91
Y4 D E A B C
19 30 26 26 20
24 27 27 23 29
Y5 E A B C D
24 36 21 22 31
Data available electronically on book website Table 6.41 Data for Problem 6.12 where the grades are out of 100
Time period 1 2 3 4
Course 1 A B C D
84 91 59 75
Course 3 C D A B
63 80 77 75
Course 4 D A B C
97 93 80 68
Data available electronically on book website
(a) Briefly state the benefits and limitations of Latin square design. (b) State the relevant hypothesis tests one could perform. (c) Generate the analysis of variance table and identify the factors at the 0.05 significance levels. (d) Perform a linear regression analysis and identify a parsimonious model. Pr. 6.12 Latin squares for teaching evaluations The College of Engineering of a large university wishes to evaluate the teaching capabilities of four professors. To eliminate any effects due to different courses offered during different times of the day, a Latin squares experiment was performed in which the letters A, B, C, and D represent the four professors. Each professor taught one section of each of the four different courses scheduled at each of four different times of the day. Table 6.41 shows the grades assigned by these professors to 16 students of approximately equal ability. At the 0.05 level of significance, test the hypothesis that different professors have no effect on the grades. Pr. 6.13 As part of the first step of a response surface (RS) approach, the following linear model was identified from preliminary experimentation using two coded variables y = 55 - 2:5x1 þ 1:2x2
with
- 1 ≤ xi ≤ þ 1
ð6:23Þ
Determine the path of the steepest ascent and draw this path on a contour plot. Pr. 6.14 Building design involving simulations15 This is a simplified example to illustrate how DOE can be used in conjunction with a detailed building energy 15
We thank Steve Snyder for this design problem which is fully discussed in Snyder et al. (2013).
Table 6.42 Design parameters along with their range of variability and the response variables (Pr. 6.14) Design parameter Lighting power density Window shading coefficient Exterior Wall R-value (total resistance) Window U-value (overall heat loss coefficient) Window to wall ratio Response variables Electricity use (annual) Natural gas use (annual)
Variable name LPD SC EWR
WWR
Range 0.8–1.5 W/ft2 0.2–0.7 7.8–27 h-ft2-° F/Btu 0.26–1 Btu/hft2 °F 0.1–0.5
E_elec E_gas
103 kWh 106 Btu
WU
simulation program to find optimal values of the design parameter, which minimize annual energy use. Deru et al. (2011) undertook a project of characterizing the commercial building stock in US and developing reference models for them. Fifteen commercial building types and one multifamily residential building were determined to represent approximately two-thirds of the commercial building stock. The input parameters for the building models came from several sources, some determined from ASHRAE standards, and rest were determined from other studies of data and standard practices. National data from 2003 CBECS (EIA 2003) were used to determine the appropriate, average mix of representative buildings, with the intension to represent 70% of US commercial building floor area. The building selected for this example is a medium office building of about 52,000 ft2, three floors, 1.5 aspect ratio. Building energy use depends on several variables but only 5 variables are assumed, as shown in Table 6.42: lighting power density (LPD), window shading coefficient (SC), exterior wall R-value (EWR), window U-value (WU), and window-to-wall ratio (WWH). All these design variables are continuous, and the range of variation set by the architect is
References
265
also shown. The study used the CCD design at two levels to generate an “optimal” set of 43 input combinations (edge + axial + center = 25 + 2 × 5 + 1 = 43). Each of these combinations was then simulated by the detailed building energy simulation program to yield the annual energy use of both electricity use and natural gas use as assembled in the last two columns of Table B.3 in Appendix B and also available electronically. The location is Madison, WI, and the TMY2 climate data file was used. (a) Analyze this study in terms of design approach. Start by identifying the edge points, the axial and central points. Compare this design with the full factorial and the partial factorial design methods when the behavior is known to be non-linear (i.e., 3 levels for each factor). Generate the combinations and discuss benefits and limitations compared to CCD. (b) Electricity use in kWh must be converted into the same units as that of natural gas. Assume a conversion factor of 33% (this is the efficiency of electricity generation and supply). Combine the two annual energy use quantities into a total thermal energy use, which is the aggregated variable to be considered below. (c) Perform ANOVA analysis and identify significant terms and interactions of the aggregated energy use variable. Fit a response surface model and deduce the optimal design point (note: the design variables are bounded as indicated in Table 6.42). Discuss advantages and limitations. (d) If the study were to be expanded to include more variables (say 15), how would you proceed. Lay out your experimental design procedure in logical successive steps. (Hint: consider a two-step process: sensitivity analysis as well as identification of the optimal combination of design variables.) (Table B.3)
Table B.3 Five factors at two level CCD design combinations and the two response variable values (building energy use per year of electricity and natural gas) found by simulation. Only a few rows are shown for comprehension while the entire data set is given in Appendix B3. The units of the variables are specified in Table 6.42 Run # 1 2 3 43
LPD 0.800 1.003 1.003 ... 1.500
SC 0.450 0.555 0.555
EWR 17.400 13.364 21.436
WU 0.630 0.786 0.786
WWR 0.300 0.216 0.216
E_elec 444.718 463.986 463.374
E_gas 117.445 116.974 117.064
0.450
17.400
0.630
0.300
523.427
116.534
Data for this problem are given in Appendix B.3 and also available electronically on book website
References Beck, J.V. and K.J. Arnold, 1977. Parameter Estimation in Engineering and Science, John Wiley and Sons, New York Berger, P.D. and R.E. Maurer, 2002. Experimental Design with Applications in Management, Engineering, and the Sciences. Duxbury Press. Box, G.E.P., W.G. Hunter and J.S. Hunter, 1978. Statistics for Experimenters, John Wiley & Sons, New York. Buckner, J., D.J. Cammenga and A. Weber, 1993. Elimination of TiN peeling during exposure to CVD tungsten deposition process using designed experiments, Statistics in the Semiconductor Industry, Austin, Texas: SEMATECH, Technology Transfer No. 92051125A-GEN, vol. I, 4-445-3-71. Clarke, J.A., 1993. Assessing building performance by simulation, Building and Environment, 28(4), pp. 419–427 Comstock, M.C. and J.E. Braun, 1999. Development of Analysis Tools for the Evaluation of Fault Detection and Diagnostics in Chillers, ASHRAE Research Project 1043-RP; also, Ray W. Herrick Laboratories. Purdue University. HL 99-20: Report #4036-3, December. Dean, A., D. Voss and D. Draguljk, 2017. Design and Analysis of Experiments, 2nd ed., Springer-Verlag, New York. Deru, M., K. Field, D. Studer, K. Benne, B. Griffith, P. Torcellini, B. Liu, M. Halverson, D. Winiarski, M. Yazdanian, J. Huang and D. Crawley, 2011. U.D. Department of Energy Commercial Reference Building Models of the National Building Stock, National Renewable Energy Laboratory, NREL/TP-5500-46861, U.S. Department of Energy, February. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. De Witt, S., 2003. Chap. 2, Uncertainty in Building Simulation, in Advanced Building Simulation, Eds. A.M. Malkawi and G. Augenbroe, Spon Press, Taylor and Francis, New York. Didwania, K., T.A. Reddy and M. Addison, M., 2023. Synergizing Design of Building Energy Performance using Parametric Analysis, Dynamic Visualization and Neural Network Modeling, J. of Arch Eng., American Society of Civil Engineers, Vol. 29, issue 4, Sept Dutta, R., T.A. Reddy and G. Runger, 2016. A Visual Analytics Based methodology for Multi-Criteria Evaluation of Building Design Alternatives, ASHRAE Winter Conference paper, OR-16-C051, Orlando, FL, January. EIA. 2003. Commercial Building Energy Consumption Survey (CBECS), www.eia.gov/consumption/commercial/data/2005/ Hammersley, J.M. and D.C. Handscomb, 1964. Monte Carlo Methods, Methuen and Co., London. Heiselberg, P., H. Brohus, A. Hesselholt, H. Rasmussen, E. Seinre and S. Thomas, 2009. Application of sensitivity analysis in design of sustainable buildings, Renewable Energy, 34(2009) pp. 2030–2036. Helton, J.C. and F.J. Davis, 2003. Latin hypercube sampling and the propagation of uncertainty of complex systems, Reliability Engineering and System Safety, vol. 81, pp. 23–69. Hofer, E., 1999. Sensitivity analysis in the context of uncertainty analysis for computationally intensive models, Computer Physics Communication, vol. 117, pp. 21–34. Hou, D., J.W. Jones, B.D. Hunn and J.A. Banks, 1996. Development of HVAC system performance criteria using factorial design and DOE-2 simulation, Tenth Symposium on Improving Building Systems in Hot and Humid Climates, pp. 184–192, May-13–14, Forth Worth, TX. Iman, R.L. and J.C. Helton, 1985. A Comparison of Uncertainty and Sensitivity Analysis Techniques for Computer Models, Sandia National Laboratories report NUREG/CR-3904, SAND 84–1461.
266 Lam, J.C. and S.C.M. Hui, 1996. Sensitivity analysis of energy performance of office buildings, Building and Environment, vol. 31, no.1, pp 27–39. Mandel, J., 1964. The Statistical Analysis of Experimental Data, Dover Publications, New York. Mckay, M. D., R.J. Beckman and W.J. Conover, 2000. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics, 42(1, Special 40th Anniversary Issue), 55–61. Montgomery, D.C., 2017, Design and Analysis of Experiments, 9th Edition, John Wiley & Sons, New York. Morris, M.D., 1991. Factorial sampling plans for preliminary computational experiments, Technometrics, 33(2)pp. 161–174. Saltelli, A., K. Chan and E.M. Scott (eds.) 2000. Sensitivity Analysis, John Wiley and Sons, Chichester. Sanchez, D.G., B. Lacarriere, M. Musy and B. Bourges, 2014. Application of sensitivity analysis in building energy simulations: Combining first-and-second-order elementary effects methods, Energy and Buildings, 68 (2014), pp. 741–750. Shannon, R.E., 1975. Systems Simulation: The Art and Science, Prentice-Hall, Englewood Cliffs, New Jersey.
6
Design of Physical and Simulation Experiments
Snyder, S., T.A. Reddy and M. Addison, 2013. Automated Design of Buildings: Need, Conceptual Approach, and Illustrative Example, ASHRAE Conference paper, paper #DA-13-C010, January Sobol, I.M., 2001. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates, Mathematics and Computers in Simulations, 55 (2001), pp. 271–280, Elsevier Spears, R., T. Grieb and N. Shiang, 1994. Parameter uncertainty and interaction in complex environmental models, Water Resources Research, vol. 30 (11), pp 3159–3169. Walpole, R.E., R.H. Myers, S.L. Myers and K. Ye, 2007. Probability and Statistics for Engineers and Scientists, 8th Ed., Prentice-Hall, Upper Saddle River, NJ. Wang, M., J. Wright and A. Brownlee, 2014. A comparison of approaches to stepwise regression for the indication of variables sensitivities used with a multi-objective optimization problem, ASHRAE Annual Conference, paper SE-14-C060, Seattle, WA, June. Westermann, P. and R. Evins, 2019. Surrogate modelling for sustainable building design- A review, Energy and Buildings, 198 (2019), pp. 170–186.
7
Optimization Methods
Abstract
This chapter provides a rather introductory overview and foundation of traditional optimization techniques along with pertinent engineering applications and illustrative examples. These techniques apply to situations where the impact of uncertainties is relatively minor and can be viewed as a subset of the broader domain of decisionmaking (treated in Chap. 12). This chapter starts by defining the various terms used in the optimization literature, such as the objective function and the different types of constraints, followed by a description of the various steps involved in an optimization problem such as sensitivity and post-optimality analysis. Simple graphical methods are used to illustrate the fact that one may have problems with unique, none, or multiple solutions; and that one may encounter instances when not all the constraints are active, and some may even be redundant. Analytical methods involving calculus-based techniques (such as the Lagrange multiplier method) as well as numerical search methods, both for unconstrained and constrained problems as relevant to univariate and multivariate problems, are reviewed, and the usefulness of slack variable approach is explained. Subsequently, different solutions to problems which can be grouped as linear, quadratic, non-linear or mixed integer programming are described, while highlighting the differences between them. Several simple examples illustrate the theoretical approaches, while more in-depth practical examples involving network models and supervisory control of an integrated energy system are also presented. How such optimization analysis approaches can be used for system reliability studies involving breakage of one or more components or links in a power grid is also illustrated. Methods that allow global solutions as against local ones such as simulated annealing and genetic algorithms are briefly described. Finally, the important topic of dynamic optimization is covered, which applies to optimizing a trajectory over time, i.e., to situations
when a series of decisions have to be made to define or operate a system over a set of discrete timesequenced stages. There is a vast amount of published material on the subject of optimization, and this chapter is simply meant to provide a good foundation and adequate working understanding for the reader to tackle the more complex and ever evolving extensions and variants of optimization problems.
7.1
Introduction
7.1.1
What Is Optimization?
One of the most important tools for both design and operation of engineering systems is optimization which corresponds to the case of finding optimal solutions under low uncertainty. This branch of applied mathematics, also studied under “operations research” (OR),1 is the use of specific methods where one tries to minimize or maximize a global characteristic (say, the cost or the benefit) whose variation is modeled by an “objective function.” The setup of the optimization problem involves both the mathematical formulation of the objective function but as importantly, as well as the explicit and complete framing of a set of constraints. Optimization problems arise in almost all branches of industry or society, for example, in product and engineering process design, production scheduling, logistics, traffic control, and even strategic planning. Optimization in an engineering context involves certain basic aspects consisting of some or all of the following: (i) the framing of a situation or problem (for which a solution or a course of action is sought) in terms of a mathematical model often called the objective function; this could be a simple “Operations Research” is the scientific/mathematical/quantitative discipline adopted by industrial and business organizations to better manage their complex business operations/systems with hundreds of variables for optimal operation/scheduling and for planning for future expansion growth.
1
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_7
267
268
7
expression, or framed as a decision tree model in case of multiple outcomes (deterministic or probabilistic) or sequential decision making stages; (ii) defining the range of constraints to the problem in terms of input parameters which may be dictated by physical considerations; (iii) placing bounds on the solution space of the output variables in terms of some practical or physical constraints; (iv) defining or introducing uncertainties in the input parameters and in the types of parameters appearing in the model; (v) mathematical techniques which allow solving such models efficiently (short execution times) and accurately (unbiased solutions); and (vi) sensitivity analysis to gauge the robustness of the optimal solution to various uncertainties. Framing of the mathematical model involves two types of uncertainties: (i) epistemic or lack of complete knowledge of the process or system which can be reduced as more data is acquired, and (ii) aleotory uncertainty which has to do with the stochasticity of the process and which cannot be reduced by collecting more data. These notions were introduced in Sect. 1.3.2 and are discussed at some length in Chap. 12 while dealing with decision analysis. This chapter deals with traditional optimization techniques as applied to engineering applications which are characterized by low aleatory and low epistemic uncertainty. Further, recall the concept of abstraction presented in Sect. 1.2.2 in the context of formulating models. It pertains to the process of deciding on the level of detail appropriate for the problem at hand without, on one hand, oversimplification which may result in loss of important system behavior predictability, while on the other hand, avoiding the formulation of an overly-detailed model which may result in undue data and computational resources as well as time spent in understanding the model assumptions and the results generated. The same concept of abstraction also applies to the science of optimization. One must set a level of abstraction commensurate with the complexity of the problem at hand and the accuracy of the solution sought. Consider a problem framed as finding the optimum of a continuous function. There could, of course, be the added complexity of considering several discrete options; but each option has one or more continuous variables requiring proper control to achieve a global optimum. The simple example given in Pr. 1.9 from Chap. 1 will be used to illustrate this case.
7.1.2
Simple Example
Example 7.1.1 Function minimization Two pumps with parallel networks (Fig. 7.1) deliver a volumetric flow rate F = 0.01 m3/s of water from a reservoir to the destination. The pressure drops in Pascals (Pa) of each network are given by: Δp1 = ð2:1Þ:1010 :F 21 and Δp2 = ð3:6Þ:
Optimization Methods
Fig. 7.1 Pumping system whose operational power consumption is to be minimized (Example 7.1.1)
1010 :F 22 where F1 and F2 are the flow rates through each branch in m3/s. Assume that both the pumps and their motor assemblies have equal efficiencies η1 = η2 = 0.9. Let P1 and P2 be the electric power in Watts (W) consumed by the two pump-motor assemblies. The total power draw must be minimized. Since, power consumed is equal to volume flow rate times the pressure drops, the objective function to be minimized is the sum of the power consumed by both pumps: J = or J =
Δp1 :F 1 Δp2 F 2 þ η1 η2 ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :F 32 þ 0:9 0:9
ð7:1Þ
The sum of both flows is equal to 0.01 m3/s, and so F2 can be eliminated in Eq. 7.1. Thus, the sought-after solution is the value of F1, which minimizes the objective function J: J = Min fJ g = Min
ð2:1Þ:1010 :F 31 ð3:6Þ:1010 :ð0:01 - F 1 Þ3 þ 0:9 0:9
ð7:2Þ subject to the constraint that F1 > 0. dJ From basic calculus, dF = 0 would provide the 1 optimum solution from where F1 = 0.00567 m3/s and F2 = 0.00433 m3/s, and the total power of both pumps P = 7501 W. The extent to which non-optimal performance is likely to lead to excess power can be gauged (referred to as post-optimality analysis) by simply plotting the function J vs. F1 (Fig. 7.2) In this case, the optima is rather broad; the system can be operated such that F1 is in the range of 0.005–0.006 m3/s without much power penalty. On the other hand, sensitivity analysis would involve a study of how the optimum value is affected by certain parameters. For example, Fig. 7.3 shows that varying the efficiency of pump 1 in the range of 0.85–0.95 has negligible impact on the optimal result. However, this may not be the case for
7.2 Terminology and Classification
some other variable. A systematic study of how various parameters impact the optimal value falls under “sensitivity analysis,” and there exist formal methods of investigating this aspect of the problem. Finally, note that this is a very simple optimization problem with a simple imposed constraint which was not even considered during the optimization. ■
7.2
Terminology and Classification
7.2.1
Definition of Terms
A mathematical formulation of an optimization problem involving control of an engineering system consists of the following terms:
Fig. 7.2 One type of post-optimality analysis involves plotting the objective function for Total Power against flow rate through pump 1 (F1) to evaluate the shape of the curve near the optimum. In this case, there is a broad optimum indicating that the system can be operated near-optimally over this range without much corresponding power penalty
Fig. 7.3 Sensitivity analysis with respect to efficiency of pump 1 on the overall optimum
269
(i) Decision variables or process variables, say (x1, x2 . . . xn), whose respective values are to be determined. These can be either discrete or continuous variables. An example is the air flow rate in an evaporative cooling tower; (ii) Control variables, which are the physical quantities that can be varied by hardware according to the numerical values of the decision variables sought. Determining the “best” numerical values of these variables is the basic intent of optimization. An example is the control of the fan speed to achieve the desired air flow rate through the evaporative cooling tower; (iii) Objective function is an analytical formulation of an appropriate measure of performance of the system (or characteristic of the design problem) in terms of decision variables. An example is the total electric power in Example 7.1.1; (iv) Constraints or restrictions imposed on the values of the decision variables. These can be of two types: nonnegative constraints, for example, flow rates cannot be negative; and functional constraints (also called structural constraints), which can be equality, non-equality or range constraints that specify a range of variation over which the decision variables can be varied. These can be based on direct considerations (such as not exceeding capacity of energy equipment, limitations of temperature, and pressure control values) or on indirect ones (when mass and energy balances have to be satisfied). (v) Model parameters are constants appearing in constraints and objective equations.
7.2.2
Categorization of Methods
Optimization methods can be categorized in a number of ways.
270
7
Optimization Methods
Fig. 7.4 Examples of unimodal, bimodal and multimodal peaks for a univariate unconstrained function showing the single, dual and multi-critical points, respectively
Fig 7.5 Two types of critical points for the unconstrained bivariate problem (a) saddle point, (b) global minimum of a unimodal function
(i) Univariate/multivariate problems with uni/multi-modal critical points Univariate problems are those where only one variable is involved in the objective function, and this can be unimodal, bimodal, or multi-modal functions (Fig. 7.4). Critical points are those where the first derivative of the function is zero, and they can be either local or global maxima or minima depending on whether the second derivative is negative or positive respectively. If the second derivative is zero, this indicates a saddle point. By extension, critical points of a function of two variables are those points at which both partial derivatives of the function are zero. The saddle point (Fig. 7.5) is one where the slopes (or function derivatives) in orthogonal directions are both zero. As the number of variables in the function increase, the solution of the problem becomes exponentially more difficult. Searching for optimum points of non-linear problems poses the great danger of getting stuck in a local minima region; software programs often avoid this situation by adopting a technique called “multistart,” where several searches are undertaken from different starting points of the feasible search space. (ii) Analytical vs. numerical methods. Analytical methods apply to cases when the optimization of the objective function can be expressed as a mathematical
relationship using subsidiary equations which can be solved directly or by calculus methods. This contrasts with numerical search methods which require that the function or its gradient at successive locations be determined by an algorithm which allows homing in on the solution. The categorization of these methods is based on the algorithm used. Even though more demanding computationally, search methods are especially useful and often adopted for discontinuous or complex problems. (iii) Linear vs. non-linear methods. Linear optimization problems (or linear programing LP) involve linear models and a linear objective function and linear constraints. The theory is well developed, and solutions can be found quickly and robustly. There is an enormous amount of published literature on LP problems, and it has found numerous practical applications involving up to several thousands of independent variables. There are several well-known techniques to solve them (the Simplex method in Operations Research used to solve a large set of linear equations being the best known). However, many problems from engineering to economics require the use of non-linear models or constraints, in which case, non-linear programming2 (NLP) techniques must be used. In some cases, non-linear models (e.g., equipment models such as chillers, fans, pumps, and cooling towers) can be expressed as quadratic models, and algorithms more efficient than non-linear programming ones have been developed; this falls under quadratic programming (QP) methods. Calculus-based methods are best suited for finding the solution for simpler problems; for more involved problems resulting in complex non-linear 2
Programming does not refer to computer programming, but arises from the use of program by the United States military to refer to proposed training and logistics schedules, which were the problems studied by Dantzig (the primary developer of the Simplex method).
7.2 Terminology and Classification
(iv)
(v)
(vi)
(vii)
simultaneous equations, a search process would be required. Continuous vs. discontinuous. When the objective functions are discontinuous, calculus-based methods can break down. In such cases, one could use non-gradient based methods or even heuristic based computational methods such as simulated annealing, particle swarm optimization, or genetic algorithms. The latter are very powerful in that they can overcome problems associated with local minima and discontinuous functions, but they need long computing times, have no guarantee of finding the global optimum, and require a certain amount of knowledge on the part of the analyst. Another form of discontinuity arises when one or more of the variables are discrete as against continuous. Such cases fall under the classification known as integer or discrete programming. Static vs. dynamic. If the optimization is done with time not being a factor, then the procedure is called static. However, if optimization is to be done over a time period where decisions can be made at several sub-intervals of that period, then a dynamic optimization method is warranted. Two such examples are when one needs to optimize the route taken by a salesman visiting different cities as part of his road trip, or when the operation of a thermal ice storage supplying cooling to a building must be optimized during several hours of the day when high electric demand charges prevail. Whenever possible, analysts make simplifying assumptions to make the optimization problem static. Deterministic vs. probabilistic. This depends on whether one neglects or considers the uncertainty associated with various parameters of the objective function and the constraints. The need to treat these uncertainties together, and in a probabilistic manner, rather than one at a time (as is done in a sensitivity analysis) has led to the development of several numerical techniques, the Monte Carlo technique being the most widely used (Sect. 12.2.8). Traditional vs. stochastic adaptive search. Most of the methods referred to above can be designated as
Fig. 7.6 An example of a constrained optimization problem with no feasible solution
271
traditional methods in contrast to stochastic methods developed in the last three to four decades (also referred to as global optimization methods or metaheuristic methods). The latter are especially meant for very complex multivariate non-linear problems. Three of the best-known methods are simulated annealing, particle swarm optimization, and genetic algorithms. Each of these use metaheuristic search techniques, that is, a mix of selective combination of intermediate results and randomness. These algorithms are patterned after evolutionary and flocking behavior which nature adopts for physical and biological processes to gradually and adaptively improve the search towards a near-optimal solution.
7.2.3
Types of Objective Functions and Constraints
Single criterion optimization is one where a single objective function can be formulated. For example, an industrialist is considering starting a factory to assemble photovoltaic (PV) cells into PV modules. Whether to invest or not, and if yes, at what capacity level are issues which can both be framed as a single criterion optimization problem. However, if maximizing the number of jobs created is another (altruistic) objective, then the problem must be treated as a multi-criteria decision problem. Such cases are discussed in Sects. 12.5–12.7. Establishing the objective function is often simple. The real challenge is usually in specifying the complete set of constraints. A feasible solution is one that satisfies all the stated constraints, while an infeasible solution is one where at least one constraint is violated. The optimal solution is a feasible solution that has the most favorable value (either maximum or minimum) of the objective function, and it is this solution that is being sought after. The optimal solutions can be a single point or even several points. Also, some problems may have no optimal solutions at all. Figure 7.6 shows a function to be maximized subject to several constraints (six in this case). Note that there is no feasible solution and one of the constraints must be relaxed or the
272
7
Optimization Methods
Fig. 7.7 An example of a constrained optimization problem with more than one optimal solution
problem reframed. In some optimization problems, one can obtain several equivalent optimal solutions. This is illustrated in Fig. 7.7, where several combinations of the two variables, which define the line segment shown, are possible optima. Sometimes, an optimal solution may not necessarily be the one selected for implementation. A “satisficing” solution (combination of words “satisfactory” and “optimizing”) may be the solution that is selected for actual implementation and reflects the difference between theory (which yields an optimal solution) and reality faced situations (due to actual implementation issues, heuristic constraints that cannot be expressed mathematically, the need to treat unpredictable occurrences, risk attitudes of the owner/operator, . . .). Some practitioners also refer to such solutions as “near-optimal” though this has a sort of negative connotation. A final issue with optimization problems is that the constraints defined have different importance on the optimal solution; sometimes, they have no role at all, and such cases are referred to as non-binding constraints. The detection of such superfluous constraints is not simple; relaxing them would often simplify the solution search.
7.2.4
Sensitivity Analysis and Post-Optimality Analysis
Model parameters are often not known with certainty and could be based on models identified from partial or incomplete observations, or they could even be guess-estimates. The optimum is only correct insofar as the model is accurate, and the model parameters and constraints reflective of the actual situation. Hence, the optimal solution determined needs to be reevaluated in terms of how the various types of uncertainties affect it. This is done by sensitivity analysis, which determines a range of values: (i) Of the model parameters over which the optimal solutions will remain unchanged or vary within an allowable range to stay near-optimal. This would flag critical parameters which may require closer investigation, refinement and monitoring.
(ii) Over which the optimal solution will remain feasible with adjusted values for the decision variables (allowable range to stay feasible, i.e., the constraints are satisfied). This would help identify influential constraints. Further, the above evaluations can be performed by adopting (see Sect 6.7.4): (i) Individual parameter sensitivity, where one parameter at a time in the original model is varied (or perturbed) to check its effect on the optimal solution; (ii) Total sensitivity (also called parametric programming) involves the study of how the optimal solution changes as many parameters change simultaneously over some range. Thus, it provides insight into “correlated” parameters and trade off in parameter values. Such evaluations are conveniently done using Monte Carlo methods (Sect. 12.2.8).
7.3
Analytical Methods
Calculus-based solution methods can be applied to both linear and non-linear problems and are the ones to which undergraduate students are most likely to be exposed to. They can be used for problems where the objective function and the constraints are differentiable. These methods are also referred to as classical or traditional optimization methods as distinct from stochastic methods such as simulated annealing or evolutionary algorithms. A brief review of calculus-based analytical and search methods is presented below.
7.3.1
Unconstrained Problems
The basic calculus of the univariate unconstrained optimization problem can be extended to the multivariate case of dimension n by introducing the gradient vector ∇ and by recalling that the gradient of a scalar y is defined as:
7.3 Analytical Methods
273
Fig. 7.8 Illustrations of convex, concave and combination functions. A convex function is one where every point on the line joining any two points on the graph does not lie below the graph at any point. A
∇y =
∂y ∂y ∂y i1 þ i2 þ . . . þ in ∂x1 ∂x2 ∂xn
With this terminology, the condition for optimality of a continuous function y is simply:
Example 7.3.1 Determine the minimum value of the following function: 1 1 þ 8x21 x2 þ 2 4x1 x2
1 ∂y = - 2 þ 16x1 x2 ∂x1 4x1
and
∂y 2 = 8x21 - 3 ∂x2 x2
Setting the above two expressions to zero and solving result in x1 = 0:2051 and x2 = 1:8114 at which condition the optimal value of the objective function y* = 2.133. It is left to the reader to verify these results, and check whether this is indeed the minimum. ■
ð7:4Þ
However, the optimality may be associated with stationary points which could be minimum, maximum, saddle, or ridge points. Since objective functions are conventionally expressed as a minimization problem, one seeks the minimum of the objective function. Recall that for the univariate case, assuring that the optimal value found is a minimum (and not a maximum or a saddle point) involves computing the numerical value of the second derivative at this optimal point, and checking that its value is positive. Graphically, a minimum for a continuous function is found (or exists) when the function is convex, while a saddle point is found for a combination function (see Fig. 7.8). In the multivariate optimization case, one checks whether the Hessian matrix (i.e., the second derivative matrix which is symmetrical) is positive definite or not. It is tedious to check this condition by hand for any matrix whose dimensionality is greater 2, and so computer programs are invariably used for such problems. A simple hand calculation method (which works well for low dimension problems) for ascertaining whether the optimal point is a minimum (or maximum) is to simply perturb the optimal solution vector obtained by a small amount, compute the objective function, and determine whether this value is higher (or lower) than the optimal value found.
y=
First, the two first order derivatives are found:
ð7:3Þ
where i1, i2,. . ., in are unit vectors and y is the objective function and is a function of n variables: y = y(x1, x2,. . ., xn)
∇y = 0
combination function is one that exhibits both convex and concave behavior during different portions with the switch-over being the saddle point
ð7:5Þ
7.3.2
Direct Substitution Method for Equality Constrained Problems
The simplest approach is the direct substitution method where for a problem involving “n” variables and “m” equality constraints, one tries to eliminate the m constraints by direct substitution and solve the objective function using the unconstrained solution method described above. This approach was used earlier in Example 7.1.1. Example 7.3.23 Direct substitution method Consider the simple optimization problem stated as: Minimize f ðxÞ = 4x21 þ 5x22
ð7:6aÞ
2x1 þ 3x2 = 6
ð7:6bÞ
subject to :
Either x1 or x2 can be eliminated without difficulty. Say, the constraint equation is used to solve for x1, and then substituted into the objective function. This yields the unconstrained objective function: f ðx2 Þ = 14x22 - 36x2 þ 36: The optimal value of x2* = 1.286 from which by substitution, x1* = 1.071. The resulting value of the objective function is f(x)* = 12.857. 3
From Edgar et al. (2001) by permission of McGraw-Hill.
274
7
Optimization Methods
Fig. 7.9 Graphical representation of how direct substitution can reduce a function with two variables x1 and x2 into one with one variable. The unconstrained optimum is at (0, 0) at the center of the contours (From Edgar et al. 2001 by permission of McGraw-Hill)
This simple problem allows a geometric visualization to better illustrate the approach. As shown in Fig. 7.9, the objective function is a paraboloid shown on the z axis with x1 and x2 being the other two axes. The constraint is represented by a plane surface which intersects the paraboloid as shown. The resulting intersection is a parabola whose optimum is the solution of the objective function being sought after. Notice how this constrained optimum is different from the unconstrained optimum which occurs at (0, 0) (Fig. 7.9). The above approach requires that one variable be first explicitly expressed as a function of the remaining variables, and then eliminated from all equations; this procedure is continued till there are no more constraints. Unfortunately, this is not an approach that is likely to be of general applicability in most problems.
7.3.3
Lagrange Multiplier Method for Equality Constrained Problems
A more versatile and widely used approach that allows the constrained problem to be reformulated into an unconstrained one is the Lagrange multiplier approach. Consider an optimization problem involving an objective function y, a set of n decision variable x and a set of m equality constraints h(x): Minimize
y = y ð xÞ
Subject to hðxÞ = 0
objective function equality constraints
ð7:7aÞ ð7:7bÞ
The Lagrange multiplier method simply absorbs the equality constraints into the objective function, and states that the optimum occurs when the following modified objective function is minimized:
J minfyðxÞg = yðxÞ - λ1 h1 ðxÞ - λ2 h2 ðxÞ - . . . = 0 ð7:8Þ where the quantities λ1, λ2, . . . λm, are called the Lagrange multipliers and applied to each of the m equality constraints. The optimization problem, thus, involves minimizing y with respect to both x and the Lagrange multipliers. The cost of eliminating the constraints comes at the price of increasing the dimensionality of the problem from n to (n + m), or stated differently, one is now seeking the optimal values of (n + m) variables as against those of n variables which optimize the function y. A simple example with one equality constraint serves to illustrate this approach. The objective function y = 2x1 + 3x2 is to be optimized subject to the constraint x1 x22 = 48. Figure 7.10 depicts this problem visually with the two variables being the two axes and the objective function being represented by a series of parallel lines for different assumed values of y. Since the constraint is a curved line, the optimal solution is obviously the point where the tangent vector of the curve (shown as a dotted line) is parallel to these lines (shown as point A). Example 7.3.34 Optimizing a solar water heater system using the Lagrange method A solar water heater consisting of a solar collector and fully mixed storage tank is to be optimized for lowest first cost consistent with the following specified system performance. During the day, the storage temperature is to be raised gradually from an initial 30 °C (equal to the ambient temperature Ta) to a final desired temperature Tmax, while during the night heat is to be withdrawn from storage such that the storage temperature drops back to 30 °C for the next day’s 4
From Reddy (1987).
7.3 Analytical Methods
275
Plugging numerical values results in: ð20Þ 106 = AC ð12Þ 106 ð0:8Þ- ð4Þð3600Þð10Þ
T max þ 30 -30 2
or 20:106 = ð11:76- 0:072T max ÞAC
ð7:10bÞ A heat balance on the storage over the day yields: QC = Mcp ðT max - T initial Þ or 20:106 = V S ð1000Þð4190ÞðT max - 30Þ Fig. 7.10 Optimization of the linear function y = 2x1 + 3x2 subject to the constraint shown. The problem is easily solved using the Lagrange multiplier method to yield optimal values of (x1* = 3, x2* = 4). Graphically, the optimum point A occurs where the constraint function and the lines of constant y (which, in this case, are linear) have a common normal line indicated by the arrow at A
operation. The system should be able to store 20 MJ of thermal heat over a typical day of the year, during which HT, the incident radiation over the collector operating time (assumed to be 10 h) is 12 MJ/m2. The collector performance characteristics5 are FRη0 = 0.8 and FRUL = 4.0 W/m2.°C. The costs of the solar subsystem components are fixed cost Cb = $600, collector area proportional cost Ca = $200/m2 of collector area, and storage volume proportional cost Cs = $200/m3 of storage volume. Assume that the average inlet temperature to the collector Tci during a charging cycle over a day is equal to the arithmetic mean of Tmax and Ta. Let AC (m2) and VS (m3) be the collector area and storage volume respectively. The objective function is: J = 600 þ 200 AC þ 200V S
See Pr. 5.7 for a description of the solar collector model.
þ 30.
AC 9:6 -
0:344 = 20 VS
This allows the combined Lagrangian objective function to be deduced as: J = 600 þ 200 AC þ 200 V S - λ AC 9:6 -
0:344 - 20 VS ð7:12Þ
The resulting set of Lagrangian equations are: δJ 0:344 = 0 = 200 - 9:6 λ δAC VS 0:344 δJ = 0 = 200 - λAC δV S V 22 δJ 0:344 - 20 = 0 = AC 9:6 δλ VS Solving this set of non-linear equations is not straightforward, and numerical search methods (discussed in Sect. 7.4) need to be adopted. In this case, the sought-after optimal values are AC* = 2.36 m2 and VS* = 0.308 m3. The value of the Lagrangian multiplier is λ = 23.58, and the corresponding initial cost J* = $1134. The Lagrangian multiplier can be interpreted as the sensitivity coefficient, which in this example corresponds to the marginal cost of solar thermal energy. In other words, increasing the thermal requirements by 1 MJ would lead to an increase of λ = $23.58 in the initial cost of the optimal solar system. ■
ð7:10aÞ
where Δt is the number of seconds during which the collector operates. 5
20 ð4:19ÞV S
Substituting this back into the constraint Eq. 7.10b results in:
ð7:9Þ
In essence, the optimization involves determining the most cost-effective sizes of the collector area and of the storage tank that can deliver the required amount of thermal energy at the end of the day. A larger collector area would require a smaller storage volume (and vice versa); but then the water in the storage tank temperature would be higher thereby penalizing the thermal efficiency of the solar collector array. The constraint of the daily amount of solar energy collected is found from the collector performance model expressed as: QC = AC ½H T F R η0 - U L ðT Ci - T a ÞΔt
from which T max =
ð7:11Þ
7.3.4
Problems with Inequality Constraints
Most practical problems have constraints in terms of the independent variables, and often these assume the form of
276
7
Optimization Methods
Fig 7.11 Graphical solution of Example 7.3.4
x1 ≤ 5.5
X2
x1 + x2 ≤ 7 x2 ≤ 3.5
x1 + 2x2 =10.5
X1
inequality constraints. There are several semi-analytical techniques that allow the constrained optimization problem to be reformulated into an unconstrained one, and the way this is done is what differentiates these methods. In such a case, one can avoid the use of generalized optimization solver approaches if such software programs are unavailable. Notice that no inequality constraints appear in Examples 7.3.2 or 7.3.3. When optimization problems involve nonequality constraints, they can be re-expressed as equality constraints by introducing additional variables, called slack (or artificial) variables. Each inequality constraint requires a new slack variable. The order of the optimization problem will increase, but the efficiency in the subsequent numerical solution approach outweighs this drawback. The following simple example serves to illustrate this approach. Example 7.3.4 Consider the following problem with x1 and x2 being continuous variables: Objective function : maximize J = ðx1 þ 2x2 Þ Subject to constraints 2x2 ≤ 7
ð7:13aÞ ð7:13bÞ
x 1 þ x2 ≤ 7
Max J = ðx1 þ 2x2 Þ 2x2 þ x3 = 7 x1 þ x2 þ x4 = 7 2x1 þ x5 = 11
and x3, x4, x5 ≥ 0. Because of the ≥ sign, the slack variables x3, x4, and x5 can only assume either zero or positive values. Using standard matrix inversion results in a minimum value of the objective function J* = 10.5 for x1* = 3.5, x2* = 3.5. The graphical solution is shown in Fig. 7.11. The dashed lines indicate the constraints, and the region within which the maximum should lie after meeting the constraints is shown partially hatched. The objective function line drawn as a solid line assumes a maximum value as indicated by the circled point. The above example was a linear optimization problem since the objective function and all the constraints were linear. For such problems, the slack variables are first order unknown quantities. For nonlinear problems involving non-linear objective function or one or more non-linear constraints, the standard way of introducing slack variables is as quadratic terms. In the example above, if the second constraints were x12 + x2 ≤ 7, it would be expressed as x12 + x2 + x42 = 7. Even the other slack variables x3 and x5 should also be represented by their squares.
2x1 ≤ 11 x1 , x 2 ≥ 0 The calculus-based solution involves introducing three additional variables for the three constraints (to simplify the analysis, the last two constraints that the two variables x1 and x2 be positive can be discarded and the resulting solutions verified that these constraints are met).
7.3.5
Penalty Function Method
Another widely used method for constrained optimization is the use of the penalty factor method, where the problem is converted to an unconstrained one. It is especially useful when the constraint is not very rigid or not very important. Consider the problem stated in Example 7.3.2, with the
7.4 Numerical Unconstrained Search Methods
277
possibility that the constraints can be inequality constraints as well. Then, a new unconstrained function is framed as: J minfyðxÞg = min yðxÞ þ
k
Pi ðh1 Þ2
ð7:14Þ
i=1
where Pi is called the penalty factor for condition i with k being the number of constraints. The choice of this penalty factor provides the relative weighting of the constraint compared to the function. For high Pi values, the search will satisfy the constraints but move more slowly in optimizing the function. If Pi is too small, the search may terminate without satisfying the constraints adequately. The penalty factor can in general assume any function6, but the nature of the problem may often influence the selection. For example, when a forward model is being calibrated with experimental data, one has some prior knowledge of the numerical values of the model parameters. Instead of simply performing a calibration based on minimizing the least square errors, one could frame the problem as an unconstrained penalty factor problem where the function to be minimized consists of a term representing the root sum of square errors, and of the penalty factor term which may be the square deviations of the model parameters from their respective estimates. The following example illustrates this approach while it is further described in Sect. 9.5.2 when dealing with nonlinear parameter estimation. Example 7.3.5 Minimize the following problem using the penalty function approach: y = 5x21 þ 4x22
s:t: 3x1 þ 2x2 = 6
ð7:15Þ
Let us assume a simple form of the penalty factor and frame the problem as: J = min ðJ Þ = min = min
y þ PðhÞ2
5x21 þ 4x22 þ Pð3x1 þ 2x2 - 6Þ2
ð7:16Þ
Then :
∂J = 10x1 þ 6Pð3x1 þ 2x2 - 6Þ = 0 ∂x1
ð7:17aÞ
and
∂J = 8x2 þ 4Pð3x1 þ 2x2 - 6Þ = 0 ∂x2
ð7:17bÞ
Solving these equations results in x1 = 6x52 which, when substituted back into the constraint of Eq. 7.15, yields:
6
The penalty function should always remain positive; for example, one could specify an absolute value for the function rather than the square.
x2 =
12 P
36 þ 108 5 þ 12
The optimal values of the variables are found as the limiting values when P becomes very large. In this case, x2* = 1.071 and, subsequently, from Eq. 7.17b ■ x1* = 1.286; these are the optimal solutions sought.
7.4
Numerical Unconstrained Search Methods
Most practical optimization problems will have to be solved using numerical search methods. The search toward an optimum is done either exhaustively (or blindly) or systematically and progressively using an iterative approach to gradually zoom onto the optimum. Because the search is performed at discrete points, the precise optimum will not be known. The best that can be achieved is to specify an interval of uncertainty In which is the range of x values in which the optimum is known to exist after n trials or function calls. The explanation of the various solution methods of optimization problems is mainly meant for conceptual understanding.
7.4.1
Univariate Methods
The search methods differ depending on whether the problem is univariate/multivariate or unconstrained/constrained or have continuous/discontinuous functions. Some basic methods for univariate problems without constraints are described below. (a) Exhaustive or direct search: This is the least imaginative but very straightforward and simple to use, and is appropriate for simple situations. As illustrated, for a maximum-seeking situation, in Fig. 7.12, the initial range or interval I0 over which the solution is being sought is divided into a number of discrete equal intervals (that dictates the “interval of uncertainty”), the function values are calculated for all the seven points simultaneously. Then, the maximum point is easily identified. The interval of uncertainty after n function calculations In = I0/[(n + 1)/2]. (b) Basic sequential search: This involves progressively eliminating ranges or regions of the search space based on pairs of observations done sequentially. As illustrated in Fig. 7.13 for the unimodal single variate problem, one starts by dividing the interval I0 (a,b) into, say three intervals, and calculating the function values y1 and y2 as shown. If y2 < y1, one can then say that the maximum would lie in the range (a, x2), and if y2 > y1 it would lie in the range (x1, b). For the case when the function values y1 = y2, one would assume that the optimal value would lie close to the center of this interval. Note that the search
278
7
Optimization Methods
Fig. 7.12 Conceptual illustration of the direct search method
Fig. 7.13 Conceptual illustration of the basic sequential search process
y(x)
y(x)
y1
y2
y1
a
y(x)
x1
y1
x2 b
(a)
x
a
y2
x1 (b)
x2 b
y2 x
a
x1
x2 b
x
(c)
Fig. 7.14 Comparison of reduction ratio of different univariate search methods. Note the logarithmic scale of the ordinate scale. (Adapted from Stoecker 1989)
process reuses one of the two function evaluations for the next step. The search is continued until the optimum point is determined at the preset range of uncertainty. This search algorithm is not very efficient computationally and better numerical methods are available. (c) More efficient sequential search methods These methods differ from the basic sequential search method in that irregularly spaced intervals are used over the interval. The three most commonly used are the dichotomous search, the Golden Search and the Fibonacci method (see, e.g., Beveridge and Schechter 1970 or Venkataraman 2002). Numerical efficiency (or power) of a method of solution involves both
robustness of the solution and fast execution times. A metric used to compare the execution time of different search methods is the reduction ratio RR = (I0/In) where I0 is the original interval of uncertainty, while In is the range of uncertainty after n trials. Figure 7.14 allows a comparison of the three sequential search methods with the exhaustive search method. The dichotomous search algorithm places the two starting points closer to the mid-point than spacing them equally over the starting interval as is adopted in the basic search procedure. This narrows the search interval for the next iteration, which increases the search efficiency and the RR. The way the spacing of the two points is selected is what differentiates the three algorithms.
7.4 Numerical Unconstrained Search Methods
279
discussion is pertinent for unimodal functions; multi-model functions require much greater care. (d) Newton-Raphson method
Fig. 7.15 (a) The Golden section rule divides a segment into two intervals following ratio r = 0.618 as shown. (b) Two search points x1 and x2 are determined within the search interval {a,b} following r = r2 = r1
The Fibonacci method is said to be the most efficient especially when greater accuracy is demanded. It requires that the number of trials n be selected/decided in advance which is a limitation in cases where one has no prior knowledge of the behavior of the function near the maximum. If the function is very steep close to the maximum, selecting too few points would not yield an accurate estimate of the maximum. The other minor disadvantage is that it requires calculations of the function y(x) at rather odd values of x. A modified Fibonacci search method has also been developed for situations when the function is such that one is unable to select the number of trials in advance (Beveridge and Schechter 1970). The Golden Section method is a compromise, being slightly less efficient than the Fibonacci but requiring no preselection of the number of trials n. While the Fibonacci requires different ratios of the ranges to be selected as the search progresses, the Golden section only assumes a single value, namely 0.618. Figure 7.15 illustrates how the two initial intervals r = r1 = r2 allow the points x1 and x2 to be are determined. Two function calls at x1 and x2 allow one to narrow the interval (a,b) to (x1,b). The next calculation is done by considering the interval (x1,b) and dividing it again into two ranges with end points (x1, x2′). The function value at x1 can be reused, while that at the new point x′2, the function value has to be determined again. The search is thus continued until an acceptable range of uncertainty is reached. The Golden Section method is robust and is a widely used numerical search method. It is obvious that the above
Gradient-based methods allow faster convergence than the previous methods. They are numerical methods, but the step size is varied based on the slope of the function. The NewtonRaphson method is quite popular; it has been developed for finding roots of single non-linear equations but can be applied to optimization problems as well if the equation is taken to be the derivative of the objective function. It is iterative with each step determined based on a linear Taylor series expansion of the derivative of the objective function, namely (dJ/dx). The algorithm is essentially as follows: (i) assume a starting value x0 and calculate the first derivative (dJ/dx) at x0, (ii) determine the slope of the first derivative function (dJ2/dx2) at x0, (iii) update the search value based on the slope to determine next value x1 = (x0 + Δx), (iv) continue till the desired convergence is reached. Step (iii) of the algorithm is based on the following approximation: d 2 J=dx2 = dJ ðx0 þ ΔxÞ = dJ ðx0 Þ þ d2 J=dx0 2 :Δx
ð7:18aÞ
which can be rewritten in terms of the step size: Δx = - ðdJ=dx0 Þ= d2 J=dx0 2
ð7:18bÞ
Example 7.4.17 Minimize J ðxÞ = ðx - 1Þ2 ðx - 2Þ ðx - 3Þ
s:t: 2 ≤ x ≤ 4
ð7:19Þ
The numerical search process is illustrated graphically in Fig. 7.16. As shown, the function obviously has roots (i.e., the function cuts the x-axis) at points 1, 2, and 3. The function J(x) and the first derivative are also plotted in the figure. The first derivative of the objective function is the equation for which the roots must be determined: ðdJ=dxÞ = 2ðx - 1Þðx - 2Þðx - 3Þ þ ðx - 1Þ2 ðx - 3Þ þ ð x - 1Þ 2 ð x - 2Þ = 0 and the second derivative d2 J=dx2 = 2ðx - 2Þðx - 3Þ þ 4ðx - 1Þðx - 3Þ þ4ðx - 1Þðx - 2Þ þ 2ðx - 2Þ2
7
From Venkatraman (2002).
280
7
Optimization Methods
dJ dx
J(x) slope J =0
x0 = 3
Δx x Fig. 7.16 Graphical illustration of Example 7.4.1 using the NewtonRaphson method
Assume, say a starting value of x0 = 3 (note: this value is selected to fall within the range constraint stipulated in the problem). The function value J = 0, the first derivative (dJ/ dx) = 4 and (d2J/dx2) = 16. From Eq. (7.18b), Δx = -0.25, and the corresponding value of (dJ/dx) = 0.875. The function is not zero as yet and so another iteration is required. Assuming x1 = 2.75, the second iteration yields J = 0.5742, (dJ/dx) = 0.0875 and (d2J/dx2) = 9.25, from which Δx = -0.0946 and (dJ/dx) = 0.104. The value of the derivative has decreased indicating that we are closer to zero, but more iterations are required. In fact, five iterations are needed to reach a value of (dJ/dx) = 0 at which point x = 2.6404 and J = -0.6197. As a precaution it is urged, especially when dealing with non-linear functions, that the search be repeated with different starting values to assure oneself that a global minimum has indeed been reached. It is important to realize that if one had stipulated the constraint as 0 ≤ x ≤ 4 and chosen a starting value of x0 = 0.5, the solution would have converged to x = 1 which is a local minimum (see Fig. 7.16). This highlights the dangers of local convergence when dealing with non-linear functions. Further, it is clear that the number of iterations will reduce if the starting value is taken to be close to the global solution. Under certain circumstances, all gradient-based methods may fail to converge, such as when the function derivative during the search equals zero. Algorithms based on a combination of gradient-based and search methods have also been developed and exhibit desirable qualities of robustness while being efficient.
Fig. 7.17 Conceptual illustration of finding a minimum point of a bi-variate function using a pattern or lattice search method. From an initial point 1, the best subsequent move involves determining the function values around that point at discrete grid points (points 2 through 9) and moving to the point with the lowest function value
7.4.2
Multivariate Methods
Realistically, the efficiency of single variate search algorithms is not a concern given the computing power available nowadays. For multivariate problems, on the other hand, efficiency is a critical aspect since the number of combinations increases exponentially. Numerous multivariate search methods are available in the literature, but only the basic ones are described below. These methods can be grouped into zero-order (when only the function values are to be determined), first-order (based on the first derivative or linear gradient), and secondorder (requiring the second order derivatives or quadratic polynomials). One can also categorize the methods as “numerical” or “analytical” or a combination of both. Search methods are most robust and appropriate for nondifferentiable or discontinuous functions while calculus-based methods are generally efficient for problems with continuous functions. One distinguishes between valley-descending for minimization problems and hill-climbing methods for maximization problems. Three general solution approaches for non-constrained optimization problems are described below in terms of bivariate functions for easier comprehension. (a) Pattern or lattice search is a directed search method where one starts at one point in the search space (shown as point 1 in Fig. 7.17), calculates values of the function in several points around the initial point (points 2–9), and moves to the point which has the lowest value (shown as point 5). This process is repeated till the overall minimum is found. This combination of exploratory moves and heuristic search done iteratively is the basis of the Hooke-Jeeves pattern search method, which is quite popular. Sometimes, one may use a coarse grid search initially; find the optimum within an interval of
7.4 Numerical Unconstrained Search Methods
Fig. 7.18 Conceptual illustration of finding a minimum point of a bi-variate function using the univariate search method. From an initial point 1, the gradient of the function is used to find the optimal point value of x2 keeping x1 fixed, and so on till the optimal point 5 is found
uncertainty, then repeat the search using a finer grid. Note that this is not a calculus-based method since the function calls are made at discrete surrounding points, and is not very efficient computationally, especially for higher dimension problems. However, it is more robust than calculus-based algorithms and simple to implement. (b) Univariate search method (Fig. 7.18) involves finding the minimum of one variable at a time keeping the others constant and repeating this process iteratively. This method accelerates the process of reaching the minimum (or maximum) point of a function compared to the lattice search. One starts by using some preliminary values for all variables other than the one being optimized and finds the optimum value for the selected variable using a one-dimension search process or a calculus-based one-step process (shown as x1). One then selects a second variable to optimize while retaining this optimal value of the first variable, and finds the optimal value of the second variable, and so on for all remaining variables until no significant improvement is found between successive searches. The entire process often requires more than one iteration, as shown in Fig. 7.18, a real danger is that it can get trapped in a local minimum region. The one-dimensional searches do not necessarily require numerical derivatives giving this algorithm an advantage when used with functions that are not easily differentiable or discontinuous. However, the search process is inherently not very efficient.
Example 7.4.2 Illustration of the univariate search method Consider the following function with two variables which is to be minimized using the univariate search process starting with an initial value of x2 = 3.
281
Fig. 7.19 Conceptual illustration of finding a minimum point of a bi-variate function using the steepest descent search method. From an initial point 1, the gradient of the function is determined, and the next search point determined by moving in that direction, and so on till the optimal point 4 is found
y = x1 þ
x 16 þ 2 x1 :x2 2
ð7:20aÞ
First, the partial derivatives are found: ∂y 16 =1∂x1 x2 :x21
and
∂y 16 1 =þ ∂x2 x1 :x22 2
ð7:20bÞ
Next, the initial value of x2 = 3 is used to find the next iterative value of x1 from the (∂y/∂x1) function as follows: p ∂y 16 4 3 =1= 0 from where x = = 2:309 1 3 ∂x1 ð3Þ:x21 The other partial derivative is finally used with this value of x1 to yield: -
16
p 4 3 2 3 x2
:þ
1 = 0 fromwhere x2 = 3:722 2
The new value of x2 is now used for the next cycle, and the iterative process repeated until consecutive improvements turn out to be sufficiently small to suggest convergence. It is left to the reader to verify that the optimal values are x1 = 2, x2 = 4 : ■ (c) Steepest-descent search (Fig. 7.19) is a widely used calculus-based approach because of its efficiency. The computational algorithm involves three steps: one starts with a guess value (represented by point 1), which is selected somewhat arbitrarily but, if possible, close to the optimal value. One then evaluates the gradient of the function at the current point by computing the partial derivatives either analytically or numerically. Finally,
282
7
one moves along this gradient (hence, the terminology “steepest”) by deciding, somewhat arbitrarily, on the step size. The relationship between the step sizes Δxi and the partial derivatives (∂y/∂xi) is: Δx1 Δx2 Δxi = = ... = ∂y=∂x1 ∂y=∂x2 ∂y=∂xi
discrete function calls at surrounding points to direct the search direction as in pattern search (which is referred to as a zero-order model since it does not require derivatives to be determine), this method uses quadratic polynomial approximation for local interpolation of the objective function. The solution of the quadratic approximation serves as the starting point of the next iteration, and so on, while providing an indication of the step size as well. Note that if the objective function is quadratic to start with, the approximation is exact, and the minimum point is found in one step. Otherwise, several iterations are needed, but the convergence is very rapid. The Powell algorithm is not a calculus-based method; however, it is unsuitable for complicated non-linear objective functions. Also, it may be computationally inefficient for higher dimension problems, and worse, may not converge to the global optimum if the search space is not symmetrical. The Fletcher-Reeves Method greatly improves the search efficiency of the steepest gradient method by adopting the concept of quadratic convergence. The transformation allows determining good search directions and distances based on the shape of the target function near the initial guess minimum, and then progresses towards the local minimum. This is a calculus-based method which involves using the Hessian matrix to determine the search direction. The advantage is that this method uses information about the local curvature of the fit statistics as well as its local gradients, which often tends to stabilize the search results. Textbooks such as Venkataraman (2002) provide more details about the mathematical theory and ways to code these methods into a software program.
ð7:21Þ
Steps 2–3 are performed iteratively until the minimum (or maximum) point is reached. A note of caution is that too large a step size can result in numerical instability, while too small a step size increases computation time. The above valley-descending methods are suited for function minimization. The algorithm is easily modified to the hill-climbing situation for problems requiring maximization. Example 7.4.3 Illustration of the steepest descent method Consider the following function with three variables to be minimized: y=
72x1 360 þ þ x1 x2 þ 2x3 x1 x3 x2
ð7:22Þ
Assume a starting point of (x1 = 5, x2 = 6, x3 = 8). At this point, the value of the function is y = 115. These numerical values are inserted in the expressions for the partial derivatives: 72 360 72 360 ∂y = þ x2 = þ 6 = 16:2 2 x 6 ∂x1 x3 :x1 2 ð8Þ:ð5Þ2 72ð5Þ 72x ∂y = - 2 1 þ x1 = þ 5= -5 ∂x2 x2 ð 6Þ 2 360 360 ∂y =þ 2= þ 2 = 0:875 ∂x3 x1 x23 ð 5Þ ð 8Þ 2 ð7:23Þ In order to compute the next point, a step size must be assumed. Arbitrarily assume Δx1 = -1 (verify that taking a negative value results in a decrease in the function value y). Δx3 -1 2 = Δx Applying Eq. 7.23 results in 16:2 - 5 = 0:875 from where Δx2 = 0.309, Δx3 = -0.054. Thus, the new point is (x1 = 4, x2 = 6.309, x3 = 7.946). The reader can verify that the new point has resulted in a decrease in the functional value from 115 to 98.1. Repeated use of the search method will gradually result in the optimal value being found. ■ (d) More Efficient Methods based on Quadratic Convergence. An improvement over the pattern or lattice search algorithm is the Powell’s Conjugate Direction Method, which searches along a set of directions that is conjugate or orthogonal to the objective function. Instead of
Optimization Methods
7.5
Linear Programming (LP)8
7.5.1
Standard Form
Recall the concept of numerical efficiency (or power) of a method of solution involving both robustness of the solution and fast execution times. Optimization problems which can be framed as a linear problem (even at the expense of a little loss in accuracy) have great numerical efficiency. Only if the objective function and the constraints (either equalities or inequalities) are both linear functions is the problem designated as a linear optimization problem; otherwise, it is deemed a non-linear optimization problem. The objective function can involve one or more functions to be either minimized or maximized (either objective can be treated identically since it is easy to convert one into the other). “Programming” is synonymous or optimization in operations research.
8
with
planning
activities
7.5 Linear Programming (LP)
283
The standard form of linear programming problems is: minimize f ðxÞ = cT x
ð7:24aÞ
subject to : gðxÞ : Ax = b
ð7:24bÞ
where x is the column vector of variables of dimension n, b that of the constraint limits of dimension m, c that of the cost coefficients of dimension n, and A is the (m x n) matrix of constraint coefficients. Example 7.5.1 Express the following linear two-dimensional problem into standard matrix notation: Maximize
subject to
f ðxÞ : 3186 þ 620x1 þ 420x2
ð7:25aÞ
g1 ðxÞ : 0:5x1 þ 0:7x2 ≤ 6:5 g2 ðxÞ : 4:5x1 - x2 ≤ 35
ð7:25bÞ
g3 ðxÞ : 2:1x1 þ 5:2x2 ≤ 60 with range constraints on the variables x1 and x2 being that these should not be negative. This is a problem with two variables (x1 and x2). However, three slack variables need to be introduced to reframe the three inequality constraints as equality constraints. This makes the problem into one with five unknown variables. The three inequality constraints are rewritten as: g1 ðxÞ : 0:5x1 þ 0:7x2 þ x3 = 6:5 g2 ðxÞ : 4:5x1 - x2 þ x4 = 35
ð7:26Þ
g3 ðxÞ : 2:1x1 þ 5:2x2 þ x5 = 60 Hence, the terms appearing in the standard form (Eqs. 7.24a and b) are: c = ½ - 620 - 420 0 0 0T , x = ½ x1 x2 x3 x4 x5 T , 0:5 0:7 1 0 0 A = 4:5 2:1
-1 0
1
0 ,
5:2
0
1
b = ½ 6:5 35
0
ð7:27Þ
T
60
Note that the objective function is recast as a minimization problem simply by reversing the signs of the coefficients. Also, the constant does not appear in the optimization since it can be simply added to the optimal value of the function at the end. Step-by-step solutions of such optimization problems are given in several textbooks such as Edgar et al. (2001), Hillier and Lieberman (2001) and Stoecker (1989).
A commercial optimization software program was used to determine the optimal value of the above objective function: f ðxÞ = 9803:8 Note that in this case, since the inequalities are “less than or equal to zero,” the numerical values of the slack variables (x3, x4, x5) will be positive. The optimal values for the primary variables are: x1 = 8:493, x2 = 3:219, while those for the slack variables are x3 = 0, x4 = 0, x5 = 25:424 (implying that constraints 1 and 2 in Eq. 7.25b have turned out to be equality constraints). ■ There is a great deal of literature on efficient algorithms to solve linear problems, which are referred to as linear programming methods. Because of its efficiency, the Simplex algorithm is the most popular numerical technique for solving large sets of linear equations. It proceeds by moving from one feasible solution to another with each step improving the value of the objective function; it also provides the necessary information for performing a sensitivity analysis at the same time. Hence, formulating problems as linear problems (even when they are not strictly so) has a great advantage in the solution phase. Such problems arise in numerous real-world applications where limited resources such as machines (an airline with a fixed number of airplanes that must serve a preset number of cities each day), material, etc. are to be allocated or scheduled in an optimal manner to one of several competing solutions/pathways.
7.5.2
Example of a LP Problem9
This example illustrates how the objective function and the constraints are to be framed given the necessary data, and then expressed in standard form. Specifically, the problem involves optimizing the mix of different mitigation technology pathways available to reduce pollution from a steelmanufacturing company. A steel company wishes to reduce its pollution emissions (specifically particulates, sulfur oxides and hydrocarbons) which are generated in two of its processes: blast furnaces for making pig iron and open-hearth furnaces for changing iron into steel. For both equipment, there are three viable technological solutions: taller smokestacks, filters, and better fuels. The amounts of required reduction for each of the three pollutants, the reduction of emissions per pound from the abatement technological option are given in Table 7.1 while the cost of the abatement method is shown in Table 7.2. This problem is formulated in terms of the six fractions xi shown in Table 7.3. The unit costs (shown in Table 7.2) are 9
Adapted from Hillier and Lieberman, 2001
284
7
Optimization Methods
Table 7.1 Emission rate reduction for different abatement technologies and required total annual emission reductions for different pollutants Pollutant
Particulates Sulfur oxides Hydrocarbons
Maximum feasible reduction in emission rates (106 lb/year) Taller smokestacks Filters Blast Open-hearth Blast Open-hearth furnace furnace furnace furnace 12 9 25 20 35 42 18 31 37 53 28 24
Table 7.2 Annual costs) for different abatement technologies if maximum feasible capacity is implemented Abatement method Taller smokestacks Filters Better fuels
Cost-Blast furnace ($ millions) 8 7 11
Cost-Open-hearth furnaces ($ millions) 10 6 9
assumed to be constant, and so are the emission rates (shown in Table 7.1). The objective function is framed in terms of minimizing total emissions subject to the cost constraint and the fact that the fractions xi must be positive and less than 1. Minimize J = 8x1 þ 10x2 þ 7x3 þ 6x4 þ 11x5 þ 9x6 s:t: 12x1 þ 9x2 þ 25x3 þ 20x4 þ 17x5 þ 13x6 = 60 35x1 þ 42x2 þ 18x3 þ 31x4 þ 56x5 þ 49x6 = 150 37x1 þ 53x2 þ 28x3 þ 24x4 þ 29x5 þ 20x6 = 125 x1 , x 2 , x 3 , x 4 , x 5 , x 6 ≥ 0 x1 , x2 , x3 , x4 , x5 , x6 ≤ 1
Better fuels Blast furnace 17 56 29
Reqd emission reduction (106 lb/year) Open-hearth furnace 13 49 20
60 150 125
Table 7.3 Decision variables fractions (xi) of the three combinations of maximum feasible capacity of an abatement technology and the two manufacturing options. Abatement method Taller smokestacks Filters Better fuels
Blast furnaces x1 x3 x5
Open-hearth furnaces x2 x4 x6
Thus, since x*1 = 1.0, all the blast furnaces that can be converted to have taller smokestacks should be modified accordingly, and so on. It is left to the reader to perform a post-optimality analysis to determine the sensitivity of the solution. For example, the individual manufacturing units come in discrete sizes and so the six fractions have to be rounded up to the closest number of units. How do the total cost and the total emissions change in such a case needs to be assessed. This example is strictly a mixed integer programming problem which is usually harder to solve. The above approach is a simplification which works well in most cases.
ð7:28abÞ Following the standard notation (Eqs. 7.24a and b), the problem is stated as: Minimize J = c x T
Constraints : gðxÞ : Ax = b c = ½ 8 10 7 6 11 9 T x = ½ x1 x2 x3 x4 x5 x6 T 12 9 25 20 17 13 A = 35
42
18
37
53
28
b = ½ 60
150 125
31 56
49
24 29
20
ð7:29Þ
T
A commercial solver yields the following values at the optimal point: x*1 = 1.0, x*2 = 0.623, x*3 = 0.343, x*4 = 1, x*5 = 0.048, x*6 = 1.0 while the optimal objective function (total cost) J* = $32.154 million.
7.5.3
Linear Network Models
Network models are suitable for modeling systems with discrete nodes/vertices/points (such as junctions) interconnected by links/lines/edges/branches (such as water pipes for district energy distribution or power lines) through which matter or power can flow. They have been increasingly applied to engineered infrastructures when recovering from partial or complete failure due to extreme weather events10 (such as electric power transmission and wide-area water distribution systems) and have even found applications in social sciences and for modeling internet and social media communication interactions/dynamics (a comprehensive text is that by Newman 2010). A recent building sciences application of network modeling was developed by Sonta and Jain (2020) in which a social and organizational human network structure was learned using ambient sensing data from distributed plug load energy sensors in commercial buildings. In essence, network models are “simplified representations 10
Such issues are now being increasingly studied under the general area referred to as “resilience.”.
7.5 Linear Programming (LP)
285
that reduces a system to an abstract structure capturing only the basics of connection patterns with vertices or nodes for components and edges capturing some basic relationship of the node and of the system” (Alderson and Doyle 2010). The focus is on the topology of the essential structural interconnections among components (and not just on individual components), and the behavior of the system under eventbased disruptions of such interconnections. A network with m edges connecting n nodes (called a graph) can be formulated as a set of n linear equations using Kirchhoff’s conservation law which result in a (m × n) matrix called an incidence matrix. For simple cases without complex constraints, this leads to direct solutions (see Strang 1998 or Newman 2010). However, any realistic network would generally require a numerical method to solve the set of linear equations. It has been pointed out (for example, by Alderson et al. 2015) that the representation of an actual engineered system by a simplified surrogate network can be misleading if done simplistically. Hence, some sort of validation of the network topology, modeling equations and simulation is needed before one can place confidence in the analysis results.
7.5.4
Example of Maximizing Flow in a Transportation Network
This example is taken from Vugrin et al. (2014) to illustrate how to analyze transportation networks. Figure 7.20 represents a simple network with 7 nodes and 12 links where the objective is to maximize flow from one starting node 1 to another specified end node 7. The limiting capacities of various links are specified (and indicated in the figure). A fictitious return link (shown dotted) with infinite capacity needs to be introduced to complete the circuit. The optimization model for this flow problem can be framed as: max x71 ðt Þ xi ð t Þ xi ð t Þ = 0
s:t: i2I n
n = 1, . . . , 7
i2On
0 ≤ xi ðt Þ ≤ K i ðt Þ
ð7:30Þ
8i
Note that the time element t has been shown even though this analysis only involves the steady state situation. The second expression is the conservation constraint whereby the sum of inflows In is equal to the sum of outflows On at each node n. Ki denotes the limiting capacity of link i. The symbol 8 is used to state that the constraint range applies to all members i of the set. For the uninterrupted case, i.e., when no links are broken, the maximum flow is 14 units, while the flows through the
Fig. 7.20 Flow network topology with 7 nodes and 12 links. The intent is to maximize flow from node 1 to node 7 under different breakage scenarios. The limiting flow capacities of various links are shown above the corresponding lines. The dotted line is a fictitious link to complete the flow circuit
Table 7.4 Flows through different links for two scenarios in order to maximize total flow from link1 to 7 (see Fig. 7.20) Uninterrupted case Link Flow 1-2 3.0 1-3 7.0 1-4 4.0 2-3 0.0 2-5 3.0 3-4 0.0 3-5 4.0 3-6 3.0 4-6 4.0 6-5 1.0 5-7 8.0 6-7 6.0 Flow from 1-7 14.0
Compromised case (links 1-4, 2-3, 3-4 are broken) Link Flow 1-2 3.0 1-3 7.0 1-4 – 2-3 – 2-5 3.0 3-4 – 3-5 4.0 3-6 3.0 4-6 0.0 6-5 0.0 5-7 7.0 6-7 3.0 Flow from 1-7 10.0
individual links are assembled in Table 7.4. The same equations can be modified to analyze the situation when one or more of the links breaks. The corresponding flows for a breakage scenario when links 1-4, 2-3 and 3-4 are compromised are also assembled in Table 7.4. In this case, the maximum flow reduces to 10 units. Such analyses can be performed assuming different scenarios of one or more link breakages. Such types of evaluations are usually done in the framework of reliability analyses during the design of the networks.
7.5.5
Mixed Integer Linear Programing (MILP)
Mixed integer problems (MILP) are a special category of linear optimization problems where some of the variables are integers or even binary variables (such as a piece of
286
equipment being on or off or for a “yes/no” decision coded as 1 and 0). Integer/binary variables arise in scheduling problems involving multiple equipment. For example, if a large facility has numerous power generation units to meet a variable load, determining which units to operate so as to minimize operating costs would be an integer problem, while determining the fraction of their rated capacity at which to operate would be a continuous variable problem. Both issues taken together would be treated as a MILP problem (as illustrated in the solved example in Sect. 7.7). For example, Henze et al. (2008) developed and validated an optimization environment for a pharmaceutical facility with a chilled water plant with ten different chillers (electrical and absorption) that adopts mixed integer programming to optimize chiller selection (scheduling) and dispatch for any cooling load condition, while an overarching dynamic programming approach selects the optimal charge/discharge strategy of the chilled water thermal energy storage system. Another example is when a manufacturer who has the capability of producing different types of widgets must decide on how many items of each widget type to manufacture in order to maximize profit. Typically, such problems can be set up as standard linear optimization problems, with the added requirement that the some of the variables must be integers. MILP problems are generally solved using a LP based branch-and-bound algorithm (see Hillier and Lieberman 2001). The basic LP-based branch-and-bound can be described as follows. Start by removing all the integrality restrictions on decision variables which can only take on integer values. The resulting LP is called the LP relaxation of the original MILP. On solving this problem, if it so happens that the result satisfies all the integrality restrictions, even though these were not explicitly imposed, then that is the optimal solution sought. If not, as is usually the case, then the normal procedure is to pick one variable that is restricted to be integer, but whose value in the LP relaxation is fractional. For the sake of argument, suppose that this variable is x and its value in the LP relaxation is 3.3. One can then exclude this value by, in turn, imposing the constraints x ≤ 3.0 and x ≥ 4.0. This process is done sequentially for all the integer variables and is somewhat tedious. Usually, commercial optimization programs have in-built capabilities, and the user does not need to specify/perform such additional steps. Generally, it can be stated that (mixed) integer programming problems are much harder to solve than linear programming problems. MILP problems also arise in circumstances where the constraints are either-or, or the more general case when “K out of N” constraints need to be satisfied. A simple example illustrates the former case (Hillier and Lieberman 2001). Consider a case when at least one of the inequalities must hold:
7
Optimization Methods
Either 3x1 þ 2x2 ≤ 18 Or x1 þ 4x2 ≤ 16
ð7:31Þ
This can be reformulated as: 3x1 þ 2x2 ≤ 18 þ My x1 þ 4x2 ≤ 16 þ M ð1 - yÞ
ð7:32Þ
where M is a very large number and y is a binary variable (either 1 or 0). Solving these two constraints along with the objective function will provide the solution. A practical example of how MILP can be used for supervisory control is given in Sect. 7.7 wherein the component models of various energy equipment are framed as linear functions (a useful simplification in many cases).
7.5.6
Example of Reliability Analysis of a Power Network
Consider a simple electric power transmission system with seven loads and two generators (at nodes 1 and 3) as shown in Fig. 7.21.11 The step-down transformers are the nodes of this network while the high-voltage power lines are the links or lines indicated by arrows. The loss in power in the lines will be neglected, and the analysis will be done assuming DC current flow (the AC current analysis is more demanding computationally requiring solving non-linear equations and one needs to consider sophisticated stabilizing feedback control loops). Moreover, network models only capture energy/ power quantities, not current and voltage, as essential variables in a power system. This limits the ability to model cascading failures. Network models essentially capture snapshots of power systems at discrete time intervals, while actual power networks are continuous-time dynamical systems. (a) Mathematical model The flow model for the network shown in Fig. 7.21 is described below and illustrated by means of a numerical example. The discussion applies to the case when some of the links are broken, and partial interruption of power supply is experienced by some nodes. Since the line losses have been neglected, there will be multiple solutions for the case without line disruption.
11
This simple illustrative classroom example was analyzed by Alireza Inanlouganji (whom we thank) as part of a research project. The power transmission network for the entire island of Puerto Rico is described in Boyle et al. (2022).
7.5 Linear Programming (LP)
287
Fig. 7.21 Network topology of a simple electric power transmission system with loads at each of the seven nodes (denoting step-down transformers for distribution) and two generators (located at nodes 1 and 3) with generation capacities as shown in square boxes. The numbers above the links (or lines) inside circles indicate the maximum carrying capacity of the transmission lines
Sets and parameters N: set of nodes L: Set of links gci:Generation capacity at node i rdi: Demand for power at node i tcik: Transmission capacity between nodes i and k curi : Unit cost of unmet demand at node i
As the intent is to minimize the total unmet demand, the constraint given by Eq. (7.35) avoids excess power supply to any node. The constraint which imposes the limitation of generation capacity of the nodes in the network is expressed as:
Variables psri : power supplied at node i to meet demand pgi: power generated at node i ptik: power transmitted between nodes i and k wplik : status of the link between nodes i and k
Finally, the constraint meant to model the transmission capacity limit of each link of the network is: pt ik ≤ tcik
The following objective function J (in dollar cost) is to be optimized: J = min
curi rd i - psri
pgi ≤ gci
ð7:33Þ
8i 2 N
8i, k 2 N
ð7:36Þ
ð7:37Þ
Note that the network topology also can be presented using these capacity parameters. Particularly, if there is no link between two nodes, the above parameter will be zero, and the model would not allow for any power to get transmitted through that link.
i2N
(b) Case study to evaluate criticality of different links Constraints From Kirchoff’s first law, balance of the currents at a node:
pt ki þ pgi = psri þ k2Sþ i
j2Si-
pt ij
8i 2 N
ð7:34Þ
where Sþ i is the set of nodes which transfer power to node i and Si- the set of nodes to which node i transfers power. Recall that A 2 B denotes that B is an element of (or belongs to) set A. The symbol 8 is the “universal qualifier” which states that the condition holds for all instances of the given variable; i.e., all elements i belong to N. psri ≤ rd i
8i 2 N
ð7:35Þ
Tables 7.5 and 7.6 assemble the needed data to model and simulate this network. The demand at each node is an input to the model while the power actually supplied to the node is determined as a solution of the optimization model. A reliability analysis can be undertaken wherein one studies the impact on the unmet demand at all the nodes combined should only one single node break. This would provide some indication of the most critical link so that some appropriate action (such a line hardening or drawing extra lines) can be taken. As there are seven nodes in the network, seven different failure scenarios will be considered. In this case, it is assumed that the local demand is met first and any excess is transmitted. The optimization is performed individually assuming a single node failure. An index called Figure of Merit is defined as the proportion of total demand that is met with the remaining portion of the network which is functional. Thus,
288
7
Table 7.5 Inputs for network nodes Node # 1 2 3 4 5 6 7 Total
End-use demand (MW) 2 0.5 3 1 0.5 3 2 12
Cost of unmet demand ($/MWh) 300 300 300 300 300 300 300 –
Generation capacity (MW) 6 0 8 0 0 0 0 14
Table 7.6 Inputs for network links Link # 1 2 3 4 5 6 7 8 9
Starting-ending nodes 1-3 1-4 2-5 3-2 3-4 3-6 4-6 6-5 6-7
Transmission capacity (MW) 7 4 3 1 2 5 4 1 6
lines) breaking due to say a hurricane. These types of investigations fall largely under traditional reliability analysis. In case, the recovery process is to be modeled to allocate repair crew optimally, that would fall under resilience analysis, and a different set of more complex procedures needs to be adopted (Inanlouganji et al. 2022).
7.6
Nonlinear Programming
7.6.1
Standard Form
Non-linear problems are those where either the objective function or any of the constraints are non-linear. Such problems represent the general case and are of great interest. For example, in the analysis of complex operations research problems involving selecting optimal engineering system designs, or portfolio optimization or system model calibration. A widely used notation to describe the complete nonlinear optimization problem is to frame the problem as: Minimize subject to
Table 7.7 Results of node failure scenarios (usually done as part of reliability analysis). The total demand is 12 MW Node failure 6 3 1 7 4 2 5
Connecting links 3-6, 4-6, 6-5, 6-7 1-3, 3-2, 3-4, 3-6 1-3, 1-4 6-7 1-4, 3-4, 4-6 2-5, 3-2 2-5, 6-5
Total unmet demand (MW) 5 3 2 2 1 0.5 0.5
Figure of merit 58% 75% 83% 83% 92% 96% 96%
failure scenarios with smaller figure of merit imply greater loss of functionality and vice versa. Table 7.7 summarizes the consequences in terms of unmet demand for different node failure scenarios sorted by increasing figure of merit. Note that node 6 along with its links is the most critical one and if all links connected to it fail, only 58% [=(12-5)/12] of demand in the network can be met. The next critical node is node 3 with 25% decrease in network supply capacity in case of failure in all connected links. Failure of links 2 and 5 have minor implications. Such insights would allow the topology to be modified should the designer/operator not want the figure of merit to be less than a pre-selected threshold. Similar analysis can be done for cases when two nodes fail simultaneously, and so on. One could also evaluate the effect of links (i.e., power
Optimization Methods
y = y ð xÞ
objective function
ð7:38Þ
h ð xÞ = 0
equality constraints
ð7:39Þ
g ð xÞ ≤ 0 lj ≤ xj ≤ uj
inequality constraints
range or boundary constraints
ð7:40Þ ð7:41Þ
where x is a vector of p variables. The constraints h(x) and g(x) are vectors of independent equations of dimension m1 and m2 respectively. If these constraints are linear, then the problem is said to have linear constraints; otherwise, it is said to have non-linear constraints. The constraints lj and uj are lower and upper bounds of the decision variables of dimension m3. Thus, the total number of constraints is m = m1 + m2 + m3. Some of the popular search methods adopted in practice are briefly discussed in Sect. 7.6.3. A note of caution needs to be reiterated. Quite often, the optimal point is such that some of the constraints turn out to be redundant (but one has no way of knowing that from before), and even worse that the problem is found to have either no solution or an infinite number of solutions. In such cases, for a unique solution to be found, the optimization problem may have to be reformulated in such a manner that, while being faithful to the physical problem being solved, some of the constraints are relaxed or reframed. This is easier said than done, and even the experienced analyst may have to evaluate alternative formulations before deciding on the most appropriate one.
7.6 Nonlinear Programming
7.6.2
289
Quadratic Programming
A function of dimension n (i.e., there are n variables) is said to be quadratic when: f ðxÞ = a11 x21
Min J = 4x21 þ 4x22 þ 8x1 x2 - 60x1 - 45x2
þ a12 x1 x2 þ . . . þ aij xi xj þ . . . n
þ ann x2n =
Example 7.6.1 Express the following problem in standard quadratic programming formulation:
subject to 2x1 þ 3x2 = 30
ð7:42Þ
n
aij xi xj i=1 j=1
ð7:46Þ
In this case: c = ½ - 60 - 45 , x = ½ x1 8 -8 Q= -8 8 A = ½ 2 3 , b = ½30
where the coefficients are constants. Consider the function that is quadratic in two variables: f ðx1 , x2 Þ = 4x21 þ 12x1 x2 - 6x2 x1 - 8x22
ð7:43Þ
x 2 T ,
It can be written in matrix form as: As a verification: f ð x1 , x 2 Þ = ½ x 1 x 2
4
-6
x1
12
-8
x2
8 1 1 T x Qx = ½x1 x2 2 2 8
Because 12x1x2 - 6x2x1 = 6x1x2, the function can also be written as: f ðx 1 , x2 Þ = ½ x1 x2
4
3
x1
3
-8
x2
More generally, the coefficient matrix of any quadratic function can be written in symmetric form. Quadratic programming problems are in essence a type of non-linear problems whose functional form is such that they can be solved using linear methods. They differ from the linear ones in only one aspect: the objective function is quadratic in its terms, while constraints must be linear but can be either equalities or inequalities. Even though such problems can be treated as nonlinear problems, formulating the problem as a quadratic one allows for greater numerical efficiency in finding the solutions. Numerical algorithms to solve such problems are similar to the linear programming ones; a modified Simplex method has been developed which is quite popular. The standard notation is: 1 Minimize f ðxÞ = cx þ xT Qx 2
ð7:44Þ
Subject to : gðxÞ : Ax = b
ð7:45Þ
Note that the coefficient matrix Q is symmetric, as explained above.
=
1 ½ð8x1 þ 8x2 Þ 2
8
x1
8
x2
ð8x1 þ 8x2 Þ
x1
x2 1 = 8x21 þ 16x1 x2 þ 8x22 = 4x21 þ 8x1 x2 þ 4x22 2
The reader can verify that the optimal solution corresponds to: x1 = 3:75, x2 = 7:5 which results in an optimal value of J* = -56.25 for the objective function. ■
7.6.3
Popular Numerical Multivariate Search Algorithms
Section 7.4 discussed several numerical search methods. They were meant to provide a basic background to optimization problems and ways to solve them using rather simple examples. Most practical problems would be much more complex than these and would require using suitable software codes or commercial software. The general optimization multivariate problem involves both equality and inequality constraints, which can be linear or non-linear. There are two general approaches to solving such problems (see, e.g., Beveridge and Schechter 1970 or Venkataraman 2002). The indirect method is one where the optimization problem is transformed into an unconstrained problem (such as Lagrange Multipliers method) and uses search methods like those described in Sect.7.4 to find the optimal value. On the other hand, the direct method handles the constraints without any transformation. One example is
290
7
the gradient-based hemstitching method, which relies entirely on the gradients of the constraints to direct its move. It is based on linearizing the binding constraints, and so the search process proceeds in a zigzag pattern; hence, its appellation. Several software codes have been developed to solve constrained non-linear problems which involve rather sophisticated numerical methods (see, e.g., Venkataraman 2002). Only three of the most widespread algorithms are briefly discussed. (a) Generalized Reduced Gradient (GRG) (or its popular version GRG2) is widely used for nonlinear problems involving equality constraints (nonequalities constraints can be converted to equality ones by using slack variables). Its basic algorithm is a refinement of the gradient search method and modified in such a way that the search path does not penetrate any constraint boundary. Spreadsheet programs such as Excel use this method. (b) Sequential Linear Programming (SLP), where the solution is obtained by successively constructing a linear approximation of the objective function around the current search point. The solution at each search point is very quickly found, but several searches will be required depending on the complexity and non-linear behavior of the objective function. (c) Sequential Quadratic Programming (SQP), which uses a quadratic expansion (based on Taylor series expansion) of the objective function around the current search point to conduct the search. A quadratic model is successively or recursively minimized by either linearizing the constraints around the search point, or by incorporating the constraints into a penalty function (or barrier function) that is subtracted from the objective function to impose large penalties for violating constraints. The method is computationally expensive for higher dimension problems even though local convergence is very fast. It is suitable for small problems as well as those that involve large sparse matrix structures.
7.7
Illustrative Example: Integrated Energy System (IES) for a Campus
This section will illustrate by means of a practical example how the mixed integer programming (MIP) approach can be used for supervisory control of an IES with multiple types of equipment. The component models of the various energy equipment are intentionally simplified and framed as linear functions. An end-of-chapter problem (Pr. 7.22) extends this to a more realistic non-linear mixed integer problem. IES systems of which the combined heat and power (CHP) prime movers are described in several books and technical
Optimization Methods
papers (e.g., see Petchers 2003). Such systems when meant for commercial/institutional buildings involve multiple CHP units, chillers and boilers and require more careful and sophisticated equipment scheduling and control methods as compared to those in industrial applications. This is due to the large variability in building thermal and electric loads as well as the equipment scheduling issue. Equipment scheduling involves determining which of the numerous equipment combinations to operate, i.e., is concerned with starting or stopping prime movers, boilers, and chillers. The second and lower-level type of control is called supervisory control which involves determining the optimal values of the control parameters (such as loading of prime movers, boilers and chillers) under a specific equipment schedule. The complete optimization problem, for a given hour, would qualify as a mixed-integer programming (MIP) problem (see Sect. 7.5.5) because different discrete pieces of equipment may be on or off. The problem can be tackled by using algorithms appropriate for MIP where certain variables can only assume integer values (e.g., for the combinatorial problem, a certain piece of equipment can be on or off—which can be designated as 0 or 1 respectively). Usually, such algorithms are not too efficient, and a typical approach in engineering problems when faced with mixed integer problems is to treat integer variables as continuous and solve the continuous problem. The near-optimal values of these variables are then simply chosen by rounding to the nearest integer. In the particular case of IES optimization, another approach, which works well for medium sized situations (involving, say, up to about 50 combinations), is to proceed as follows. For a given hour specified by the climatic variables and the building loads, all the feasible combinations of equipment are first generated. Subsequently, a lower-level optimization is done for each of these feasible equipment combinations, from which the best combination of equipment to meet the current load can be selected. Currently, little optimization of the interactions among systems is done in buildings. Heuristic control normally used by plant operators often results in suboptimal operation due to the numerous control options available to them as well as due to dynamic, time-varying electric rate structures and seasonal changes in gas and electricity prices. Though reliable estimates are lacking in the technical literature, the consensus is that 5–15% of cost savings can be realized if these multiple-equipment IES plants were operated more rationally and optimally. Figure 7.22 is a generic schematic of how the important subsystems of an integrated energy system IES system (namely, CHP prime movers, vapor compression chillers, absorption chillers and boilers) are often coupled to serve the building loads (Braun 2006; Moslehi and Reddy 2018). The heat recovered in the CHP can be used to directly meet some or all of the heating loads and/or can also be diverted to
7.7 Illustrative Example: Integrated Energy System (IES) for a Campus
291
Fig. 7.22 Schematic of a simple integrated energy system with only one CHP machine, one boiler, one vapor compression, and one absorption cooling machine meeting the electricity, cooling, and heating needs of a campus of buildings (from Moslehi and Reddy 2018)
meet the cooling needs via an absorption cooling system. The electricity generated by the CHP system can meet all or part of the electric demand, and all sell-back option to the grid can also be envisioned. The static optimization case, without utility sell-back, involves optimizing the operating cost of the IES system for each time step. The simplifying assumptions are given below.
(viii) Auxiliary equipment power use is included along with the primary equipment (e.g., the COP of the chillers reflects that of the total cooling plant which includes cooling tower fans and pumps as well) (ix) No thermal losses in pipes (x) Equipment operation and maintenance (O&M) costs are neglected.
Simplifying Assumptions for the IES Example (i) Analysis done at hourly increments assuming steady state condition (ii) No electric sell-back to grid (iii) No thermal dumping of energy allowed (iv) Part load efficiencies of equipment assumed to be equal to rated condition12 (v) Equipment can be operated down to zero-part load (vi) No start up or shut-down costs of equipment (vii) No time locks or ramp up constraints of equipment
The objective functions and the constraints are shown in Eqs. 7.47, 7.48, 7.49, 7.50, and 7.51, while Table 7.8, 7.9, and 7.10 assemble relevant input data: assumed sizing, performance, and costs of the various IES equipment, building electric, heating and cooling hourly loads (for two different scenarios: one during summer with higher cooling loads, and one during winter with higher heating loads) and the utility costs for electricity and natural gas. The necessary nomenclature is also provided.
12
Assumptions (iv) and (v) are unrealistic simplifications; refer to Problem 7.22 for how to frame them more realistically.
292
7
Optimization Methods
Table 7.8 Equipment specification Symbol CHP prime mover (micro-turbine) EPMR ηPM, el ηPM, th
Boiler QBOR ηBO Vapor compression chiller (VC) QVCR COPVC Absorption chiller (AC) QACR COPAC
Description
Numerical value
Rated electric output Rated electric efficiency Thermal heat recovery
360 kW 30% 50% of waste heat = (1 - 0.3) × 0.5 = 0.35 or 35% of heat input to PM or = (0.35/0.3) of electric output by PM
Rated heat output Thermal efficiency
500 kW 85%
Rated cooling capacity COP
950 kW (~270 tons of cooling) 4.0
Rated cooling capacity COP
400 kW (about 11 Tons of cooling) 0.8
Table 7.9 Building loads for both scenarios analyzed Symbol ELE Qh Qc
Description Electric loads (excluding cooling plant) Thermal heating loads Thermal cooling loads
S#1 (Winter) 480 kWh 1.6 × 106 kJ/h 1.0 × 106 kJ/h
S#2 (Summer) 480 kWh 0.9 × 106 kJ/h 3.0 × 106 kJ/h
Table 7.10 Energy costs Symbol Cel Cng
Description Electricity rate (no demand charge) Natural gas cost
Unit cost $0.10/kWh $6 per 106 kJ
0 ≤ PLR ≤ 1
(a) Objective function (total cost of energy consumption) MinfJ g where J ¼ cost of grid electricity purchase
Alternative units $28 per 106 kJ 1 million Btu ~ 106 kJ
ð7:51Þ
(b) Component models of the major components
þ cost of natural gas purchase (b1) Prime mover or CHP:
ðfor boiler and PMÞ J = ðC el × EGR Þ þ C ng ×
QBO E þ PM ηBO ηPM,el
(i) Electricity generated per hour (output) (kJ/h) ð7:47Þ E PM = E PM R × PLRPM = ð360 kW × 3600 s=hÞ × PLRPM
subject to
ð7:52Þ
(i) Energy balance constraints (ii) Natural gas heat input (kJ/h) Electricity : E PM þ E GR = E LE þ E VC
ð7:48Þ
Heating : QBO þ QHR = Qh þ H AC,in
ð7:49Þ
Cooling : QAC þ QVC = Qc
ð7:50Þ
(ii) None of the quantities can be negative (iii) The part-load ratios (PLR) of the equipment are bounded:
H PM,in = E PM =ηPM,el = ðE PM =0:3Þ
ð7:53Þ
(iii) Thermal heat recovered) (kJ/h) QHR = E PM R ×
ηPM,th 0:35 = E PM × ηPM,el 0:3
ð7:54Þ
7.8 Introduction to Global Optimization
293
(b2) Boiler:
Subscripts
(i) Natural gas heat input) (kJ/h)
AC BO VC GR HR LE PM el in ng out th
QBO = QBO R × PLRBO = ð500 kW × 3600 s=hÞ × PLRBO ð7:55Þ (ii) Thermal energy output (kJ/h) H BO
in
= QBO =ηBO = QBO =0:85
ð7:56Þ
(b3) Vapor compression (VC) chiller:
absorption chiller boiler vapor compression chiller electric grid heat recovery unit building lights and equipment prime mover electric input natural gas output thermal
(i) Cooling energy output (kJ/h) QVC = EVC R × PLRVC = ð950 kW × 3600 s=hÞ × PLRVC ð7:57Þ
Superscripts R
at rated conditions
(ii) Electric input (KJ/h) E VC = QVC =COPVC = QVC =4:0
ð7:58Þ
(b4) Absorption chiller (AC) : (iv) Cooling energy output (kJ/h) QAC = QAC R × PLRAC = ð400 kW × 3600 s=hÞ × PLRAC ð7:59Þ (v) Thermal heat input (kJ/h)
H AC,in = QAC =COPAC = QAC =0:8
Nomenclature C COP E H PLR Q Q c,h
unit cost coefficient of performance electric energy thermal heat input part-load ratio thermal energy thermal cooling (or heating) load
Greek Letters η
efficiency
ð7:60Þ
The optimization results for both scenarios are summarized in Table 7.11. The results indicate that the prime mover (i.e., the CHP system) should be operated at full capacity (PLRPM = 1.0) while the boiler should be at 5% part-load during winter and shut down during summer. Of course, 5% part-load would require excessive cycling in the actual boiler plant, and practically, this is to be avoided! The absorption chiller is switched off during winter when cooling load is low. The VC chiller should be operated during both scenarios, while as expected, its loading fraction is higher during summer compared to winter (0.734 vs. 0.292). These results are supported by intuitive understanding, but the optimization analysis has provided better quantitative values for the supervisory control system. For S#2, a sensitivity analysis was done by varying the prime mover loading fraction PLRPM around the optimal value. The minimum value of the objective function J = $55.59. The objective function is fairly flat around the optimal value with small changes in the loading of the other equipment to compensate for departures from the optimal operating point.
7.8
Introduction to Global Optimization
Certain types of optimization problems can have local (or suboptimal) minima, and the optimization methods described earlier can converge to such local minima closest to the starting point and never find the global solution. Further, certain optimization problems can have non-continuous first-order derivatives in certain search regions, and calculus-
294
7
Optimization Methods
Table 7.11 Optimization results for the two different scenarios (S) Symbol J PLRPM PLRBO PLRVC PLRAC
Description Objective function ($) Part load ratio for prime mover Part load ratio for boiler Part load ratio for vapor comp. chiller Part load ratio for absorption chiller
based methods break down. Global optimization methods are those which can circumvent such limitations but can only guarantee a close approximation to the global optimum (often this is not a major issue). Unfortunately, they generally require large computation times. These methods fall under two general categories (Edgar et al. 2001): (a) Exact methods include such methods as the branch-andbound-methods and multistart methods. Most commercial non-linear optimization software programs have the multistart capability built-in whereby the search for the optimum solution is done automatically from many starting points. This is a conceptually simple approach though its efficient implementation requires robust methods of sampling the search space for starting points that do not converge to the same local optima, and also to implement rules for stopping the search process. (b) Heuristic search methods are those which rely on some rules of thumb or “heuristics” to gradually reach an optimum; an iterative or adaptive improvement algorithm is central. They incorporate algorithms which circumvent the situation of non-improving moves and disallow previously visited states from being revisited. Again, there is no guarantee that a global optimum will be reached, and so often the computation stops after a certain number of computations have been completed. There are four well-known methods which fall in this category: Tabu search, simulated annealing (SA), genetic algorithms (GA), and particle swarm optimization (PSO). The interested reader can refer to specialized texts on this subject, such as Berthold and Hand (2003). PSO algorithms are a recent development shown to be powerful for solving multiobjective problems. They simulate the centralized learning process of a group of individuals and are said to be simple in operation with fast convergence. GA algorithms mimic the process of natural selection in evolutionary biology and have been quite widely adopted in widely different engineering applications in the last few decades. Their popularity warrants as brief description. While Tabu search and simulated annealing operate by transforming a single solution at a given step, GA begins by
S#1 (Winter) 45.64 1.00 0.049 0.292 0.00
S#2 (Summer) 55.59 1.00 0.00 0.734 0.340
defining a chromosome or an array of parameter values to be optimized (Haupt and Haupt 1998). Constrained problems with parameter bounds should also be reframed into unconstrained ones, and a continuous variable should be converted to higher order binary variable by discretizing its range of variability and quantizing it. The result is an unconstrained discrete optimization problem of an array of n parameters. Obviously, the analyst should aim for lower dimension arrays, which are easier to handle numerically. An initial population of size 2n to 4n starting vectors (or initial strings) is selected (heuristically) as starting values. An example of a random initial population string 2n of a binary encoded chromosome with n = 5 is: Initial population chromosome string = ½ 110011 100012 001103 . . . :: 101009
0110010
An objective function or fitness function to be minimized is computed for each initial or parent string, and a subset of the strings which are “fitter,” that is, which yield a lower numerical value of the objective function to be minimized is retained. Successive iterations (called “generations”) are performed by either combining two or more fit individuals (called “crossover”) or by changing an individual (called “mutation”) to gradually minimize the function. This procedure is repeated several thousands of times until the solution converges. Because the random search was inspired by the process of natural selection underlying the evolution of natural organisms, this optimization method is called genetic algorithm. Clearly, the process is extremely computer intensive, especially when continuous variables are involved. Sophisticated commercial software is available, but the proper use of this method requires some understanding of the mathematical basis, and the tradeoffs available to speed convergence.
7.9
Examples of Dynamic Programming
Dynamic programming is a discrete-event recursive technique developed to handle a type of problem where one is optimizing a trajectory, i.e., a sequence of decisions, rather than finding an optimum point. The term “dynamic” is used to reflect the fact that subsequent choices are affected by
7.9 Examples of Dynamic Programming
earlier ones. It is based on Richard Bellman’s Principle of Optimality, namely any optimal policy has the property that, whatever the current state and past decisions leading up to it, the remaining decisions must constitute an optimal policy. Thus, it involves multistage decision-making of discrete processes or continuous functions that can be approximated or decomposed into stages. It is not a simple extension of the “static” or single-stage optimization methods discussed earlier, but one that, as shown below, involves solution methods that are much more computationally efficient. In essence, the dynamic programming approach transforms a complex problem into a sequence of simpler problems. Thus, instead of solving the entire problem at once, the sub-problems associated with individual stages are solved one after the other. The stages could be time intervals or spatial intervals. For example, determining the optimal flight path of a commercial airliner travelling from city A to city B which minimizes fuel consumption while taking into consideration vertical air density gradients (and hence, drag effects), atmospheric disturbances and other effects is a problem in dynamic programming. Example 7.9.1 Traveling salesman Consider the following classic example of a travelling salesman, which has been simplified for easier conceptual understanding (see Fig. 7.23). A salesman starts from city A and needs to end his journey at city D but he is also required to visit two intermediate cities B and C of his choosing among several possibilities (in this problem, three possibilities: B1, B2, and B3 at stage B; and C1, C2, and C3 at stage C). The travel costs to each of the cities at a given stage, from each of the cities from the previous stage, are specified. Thus, this problem consists of four stages (A, B, C and D) and three states (three different possible cities). The computational algorithm involves starting from the destination (city D) and working backwards to starting city A (see Table 7.12). The first calculation step involves adding the costs to travel from city D to cities C1, C2, and C3. The second calculation step involves determining costs from D through each of the cities C1, C2, and C3 and on to the three possibilities at stage B. One then identifies the paths through each of the cities C1, C2, and C3 which are the minimum (shown with an asterisk). Thus, path D-C1-B2 is cheaper than paths D-C1-B1 and D-C1-B3. For the third and final step, one limits the calculation to these intermediate sub-optimal paths and performs three calculations only. The least cost path among these three is the optimal path sought (shown as path D-C3-B1-A in Table 7.12). Note that one does not need to compute the other six possible paths at the third stage, which is where the computational savings arise. It is obvious that the computational savings increase for problems with increasing number of stages and states. If all
295
Fig. 7.23 Flow paths for the traveling salesman problem who starts from city A and needs to reach city D with the requirement that he visit one city among the three options under groups B and C
Table 7.12 Solution approach to the travelling salesman problem. Each calculation step is associated with a stage and involves determining the cumulative cost till that stage is reached
Start
First calculation step D-C1
D-C2 D D-C3
Second calculation step D-C1-B1 D-C1-B2* D-C1-B3 D-C2-B1 D-C2-B2* D-C2-B3 D-C3-B1* D-C3-B2 D-C3-B3
Third calculation step
Optimal path
D-C1-B2-A
D-C2-B2-A D-C3-B1-A*
←
Note: The cells with an asterix* are the optimal paths at each step
possible combinations were considered for a problem involving n intermediate stages (excluding the start and end stages) with m states each, the total number of enumerations or possibilities would be about (mn). On the other hand, for the dynamic programming algorithm described above, the total number would be approximately n(m × n). Thus, for n = m = 4, all possible routes would require 256 calculations as against about 64 for the dynamic programming algorithm. Basic features that characterize the dynamic programming problem are (Hillier and Lieberman 2001): (a) The problem can be divided into stages with a policy decision made at each stage. (b) Each stage has a number of states associated with the beginning of that stage. (c) The effect of the policy decision at each stage is to transform the current state to a state associated with the beginning of the next stage.
296
(d) The solution procedure is to divide the problem into stages, and given the current state at a certain stage, to find the optimal policy for the next stage only among all future states. (e) The optimal policy of the remaining stages is independent of the optimal policies selected in the previous stages. Dynamic programming has been applied to many problems; to name a few, control of system operation over time, design of equipment involving multistage equipment (such as heat exchangers, reactors, distillation columns, . . .), equipment maintenance and replacement policy, production control, economic planning, investment. One can distinguish between deterministic and probabilistic dynamic programming methods, where the distinction arises when the next stage is not completely determined by the state and policy decisions of the current stage. Only deterministic problems are considered in this section. Examples of stochastic factors that may arise in probabilistic problems could be uncertainty in future demand, random equipment failures, supply of raw material, etc. Example 7.9.2 Strategy of operating a building to minimize cooling costs This simple example illustrates the use of dynamic programming for problems involving differential equations. The concept of lumped parameter models was introduced in Sect. 1.4.4, and a thermal network model of heat flow through a wall was discussed in Fig. 1.10. The same concept of a thermal network can be extended to predict the dynamic thermal response of an entire building. Many electric utilities in the United States have summerpeaking problems, meaning that they are hard pressed to meet the demands of their service customers during certain hot afternoon periods, referred to as peak periods. The air-conditioning (AC) use in residences and commercial buildings has been shown to be largely responsible for this situation. Remedial solutions undertaken by utilities involve voluntary curtailment by customers via incentives or penalties through electric rates, which vary over time of day and by season (called time of day seasonal rates). Engineering solutions, also encouraged by utilities, to alleviate this situation include installing cool ice storage systems, as well as soft options involving dynamic control of the indoor temperature via the thermostat. This is achieved by sub-cooling the building during the night and early morning and controlling the thermostat in a certain manner during the peak period hours such that the “coolth” in the thermal mass of the building structure and its furnishings can partially offset the heat loads of the building and hence, reduce the electricity demands of the AC.
7
Optimization Methods
Figure 7.24 illustrates a common situation where the building is occupied from 6:00 am till 7:00 pm with the peak period being from noon till 6:00 pm. The total cost to the resident is the sum of electricity usage cost plus a demand cost which depends on the maximum hourly (or sub-hourly) usage during the whole month. The normal operation of the building is to set the thermostat to 72 °F during the occupied period and at 85 °F during unoccupied period (such a thermostat setup scheme is a common energy conservation measure). Three different pre-cooling options are shown, all three involve the building to be cooled down to 70 °F, representative of the lower occupant comfort level, starting from 6:00 am. The difference in the three options lies in how the thermostat is controlled during the peak period. The first scheme is to simply set up the thermostat to a value of 78°F representative of the high-end occupant discomfort value with the anticipation that the internal temperature Ti will not reach this value during the end of the peak period. If it does, the AC would come on and partially negate the electricity demand benefits which such a control scheme would provide. Often, the thermal mass in the structure will not be high enough for this control scheme to work satisfactorily. Another simple control scheme is to let the indoor temperature ramp us linearly, which is also not optimal but easy to implement. The third, and optimal option is to determine a control set-up path which would minimize the following cost function over the entire day:13 J = minfJ g 24
ce,t :Pt ðT i,t Þ þ cd max½Pt,t1 - t2 ðT i,t Þ
= min t=1
ð7:61aÞ subject to: T i, min ≤ T i,t ≤ T i, max and 0 ≤ Pt ≤ PRated
ð7:61bÞ
where ce,t
Pt cd
max (Pt,t1–t2) 13
is the unit cost vector of electricity in $/kWh (which can assume different values at different times of the day) as set by the electric utility is the electric energy use during hour t, and is function of Ti which changes with time t is the demand cost in $/kW, also set by the electric utility, which is usually imposed based on the maximum hourly use during the peak hours (note that there could be two demand costs, one for off-peak and one for on-peak during a given day) is the demand or maximum hourly use during the peak period t1 to t2
This is a continuous path optimization problem that can be discretized to a finite sum, say a convenient time period of 1 h (the approximation improves as the number of terms in the sum increases).
7.9 Examples of Dynamic Programming
Fig. 7.24 Sketch showing the various operating periods of the building discussed in Example 7.9.2, (a) the thermostat operation during normal operation taken to be the baseline. (b) the three different thermostat
297
set-point control strategies for reducing total electric cost. (From Lee and Braun 2008, # American society of Heating, Refrigerating and Air-Conditioning Engineers, Inc., www.ashrae.org)
The AC power consumed each hour represented by Pt is affected by TIt. It cannot be negative and should be less than the capacity of the AC denoted by PRated. The solution to this dynamic programming problem requires two thermal models: (i) One to represent the thermal response of the building (ii) Another for the performance (or efficiency) of the AC Thermal models of varying complexity have been proposed in the literature. A simple model following Reddy et al. (1991) is adequate to illustrate the approach. Consider the 1R1C thermal network shown in Fig. 7.25, where the thermal mass of the building is simplistically represented by one capacitor C and an overall resistance R. The internal heat gains Qg could include both solar heat gains coming through windows as well as thermal loads from lights, occupants and equipment generated from within the building. The simple one node lumped model for this case is:
Fig. 7.25 A simplified 1R1C thermal network to model the thermal response of a building (i.e., variation of the indoor temperature Ti) subject to heat gains from the outdoor temperature To and from internal heat gains Qg. The overall resistance and the capacitance of the building are R and C respectively and QAC is the thermal heat load to be removed by the air-conditioner
298
7
C
dT i T o ðt Þ - T i = þ Qg ðt Þ - QAC ðt Þ dt R
ð7:62Þ
where To and Ti are the outdoor and indoor dry-bulb temperatures and QAC is the thermal cooling provided by the AC. For the simplified case of constant Qg and To and when the AC is switched off, the transient response of this dynamic system is given by: Δt T i - T i, min = 1 - e- τ T o - T i, min þ RQg
ð7:63Þ
where Δt is the time from when the AC is switched off. The time required for Ti to increase from Ti,min to Ti,max is then: Δt = - τ ln 1 -
T i, max - T i, min T o - T i, min þ RQg
ð7:64Þ
where τ is the time constant given by (C.R). The savings in thermal cooling energy ΔQAC which can be avoided by the linear ramp-up strategy can also be determined in a straightforward manner by performing hourly calculations over the peak period using Eq. 7.62 since the Ti values can be determined in advance. The total thermal cooling energy saved during the peak period is easily deduced as: ΔQAC = C
T i, max - T i, min Δt peak
ð7:65Þ
where Δtpeak is the duration of the peak period. For the dynamic optimal control strategy, the rise in Ti(t) over the peak period has to be determined. For the sake of simplification, let us discretize the continuous path into say, hourly increments. Then, Eq. 7.62, with some rearrangement and minor notational change, can be expressed as:
QAC,t = - CT i,tþ1 þ T i,t C -
Optimization Methods
T 1 þ o,t þ Qg,t R R
ð7:66aÞ
while the electric power drawn by the AC can be modeled as: Pt = f ðQAC , T i , T o Þ
ð7:66bÞ
subject to conditions Eq. 7.61b. The above problem can be solved by framing it as one with several stages (each stage corresponding to an hour into the peak period) and states representing the discretized set of possible values of Ti (say in steps of 0.5 °F). One would get a set of simultaneous equations (the order being equal to the number of stages) which could be solved together to yield the optimal trajectory. Though this is conceptually appealing, it would be simpler to perform the computation using a software package given the nonlinear function of Pt (Eq. 7.66b) and the need to introduce constraints on P and Ti. An end-ofchapter problem is framed for the reader to gain working familiarity with the models discussed above. Table 7.13 assembles peak AC power savings and daily energy savings for a small test building located in Palm Desert, CA, which was modeled using higher order differential equations by Lee and Braun (2008). The model parameters have been deduced from experimental testing of the building and the AC equipment, which were then used to evaluate different thermostat control options. The table assembles daily energy use and peak AC data for baseline operation (NS) against which other schemes can be compared. Since the energy and demand reductions would depend on the outdoor temperature, the tests have been assembled for three different conditions (tests 1–3 for very hot days, tests 4–6 for milder days, and tests 7–8 for even milder days. The optimal strategy found by dynamic programming is clearly advantageous both in demand reduction and in diurnal energy savings although the benefits show a certain amount of variability from day-to-day. This could be because of diurnal differences in the driving functions and because of uncertainty in the determination of the model parameters. ■
Table 7.13 Peak AC power reduction and daily energy savings compared to base operation under different control strategies for similar days in October. (From Lee and Braun 2008 # American society of Heating, Refrigerating and Air-conditioning Engineers, Inc., www.ashrae.org) Test # 1 2 3 4 5 6 7 8 a
Tout,max °C 32.2 31.7 32.8 29.4 30.6 30.6 26.7 26.7
Control strategya NS (baseline) LR SU NS (baseline) DL DL NS (baseline) DL
Peak power kW 26.10 23.53 20.52 29.70 20.03 22.34 27.04 16.94
NS baseline operation, LR linear ramp-up, SU setup, DL demand limiting
Peak savings kW – 2.57 5.58 – 9.67 7.36 – 10.10
Energy use kWh 243.1 226.5 194.2 224.3 219.2 196.6 233.4 190.4
Energy savings kWh – 16.6 20.1 – 5.2 27.8 – 43.0
Problems
299
Problems
Pr. 7.4 Consider the following optimization problem:
Pr. 7.114 The collector area AC of a solar thermal system is to be determined which minimizes the total discounted savings Cs′ over n years. The solar system delivers thermal energy to an industrial process with the deficit being met by a conventional boiler system (this configuration is referred to as a solar-supplemented thermal system). Given the following expression for discounted savings: 17, 738:08 AC 1 - exp - 190:45
Cs 0 = 87, 875:66 -
- 2000 - 300AC ð7:67Þ
determine the optimal value of collector area AC. Verify your solution graphically and estimate a satisficing range of collector area values which are within 5% of the optimal value of Cs′. Pr. 7.2 Analytical approach (a) Use a graphical approach to determine the values of x1 and x2 which maximize the following function:
0:3x1 þ 0:1x2 ≤ 2:7 0:5x1 þ 0:5x2 = 6
x21
- x22
ð7:68Þ
0:6x1 þ 0:4x2 ≥ 6 x1 ≥ 0 and x2 ≥ 0
(a) Plot the constraints, identify the feasible solution space and locate the optimal point. Comment on impact that the constraints had on the optimal solution. (b) Solve this problem using calculus based methods and compare with results of part (a). Note that at the solution, only one constraint is active. Had this been known before hand, the solution would have been easier to find. Pr. 7.5 Solve this problem graphically: Objective function :
minimize fJ g = ðx1 - 1Þ2 þ x2 - 2
s:t:x2 - x1 = 1 x1 þ x2 ≤ 2
Plot the constraints, the objective function and locate the optimal point. Comment on how the solution is found. Pr. 7.6 Consider the following problem:
(b) Solve the problem analytically using the slack variable approach and verify your results. (c) Perform a sensitivity analysis of the optimum. Pr. 7.3 Consider this optimization problem
s:t:
ð7:70Þ
-4≤0
minimize
Objective function :
minimize f J g = x21 þ x22
ð7:71Þ
J = 0:4x1 þ 0:5x2 s:t:
Objective function : s:t: x1 - 10 ≤ 0
maximizef J g = ð3x1 þ 4x2 Þ 4 x1 þ 2x2 ≤ 80
2x1 þ 5x2 ≤ 180 x1 , x 2 ≥ 0 ð7:69Þ (a) Plot the constraints, identify the feasible solution space and locate the optimal point. (b) Solve this problem using calculus based methods and compare results with part (a).
ðx1 - 1Þ2 þ x2 - 2
subject to x2 - x1 = 1, x1 þ x2 ≤ 2:
Solve this problem graphically and clearly indicate the feasible solutions, the constraints and the optimal solution (similar to Fig. 7.11). Discuss results. Pr. 7.7 This problem is meant to give you a geometric understanding of how slightly different inequalities may complicate the solutions. The maximum is to be determined for the following function: y=
x2 5x4 35x3 þ þ - 25x2 þ 24x - 4 5 2 3
Solve this for two cases with different constraints: (a) subject to x2 - (4.1)2 = 0 (b) subject to x2 - (4.1)2 ≤ 0
14
From Reddy (1987).
ð7:72Þ
ð7:73Þ
300
7
Optimization Methods
Hint: Plot the function. You will see that for case (b), there are four solutions for the first derivative function, and you will have to determine the maximum or minimum by taking the second derivative. This problem also illustrates the phenomenon of local and global maxima.
Pr. 7.10 Lagrange approach The linear objective function to be minimized is
Pr. 7.8 Find the minimal value of the unimodal function by different numerical search methods such that the final interval of uncertainty is 5% within the range of variation of x:
Solve this problem using (i) the Lagrange approach and (ii) the successive substitution method and graphically verify the optimal values of (x1 = 3, x2 = 4).
y = ð2x - 9Þ2 subject to 0 ≤ x ≤ 10
Pr. 7.11 Consider the following quadratic programming problem:
ð7:74Þ
(a) Exhaustive search using equal intervals (b) Golden section search (c) Newton-Raphson method
f ðx1 , x2 Þ = 15x1 þ 30x2 þ 4x1 :x2 - 2x21 - 4x22 s:t: x1 þ 2x2 ≤ 30
Compare the number of function calls and discuss your results found by these methods Pr. 7.9 Use the steepest descent search method to minimize the following unconstrained optimization problem. y=
360 72x1 þ þ x1 x2 þ 2x3 x1 x3 x2
ð7:75Þ
(a) Perform the first search calculation step assuming starting points: x1 = 5, x2 = 6, x3 = 8 (y = 115 at this point). Start by changing x1 by 1. You should find x1 = 4, x2 = 6.309, x3 = 7.946. (b) Perform the second iterative calculation step by taking a value of 1.5 for x1. (c) Perform the third iteration. (d) The values at the location that minimizes the function are: x1 = 1.357, x2 = 8.485, x3 = 11.516 and the function value min (y) = 69.98. What can you say from the three steps?
Fig. 7.26 Sketch of a parabolic trough solar power system. The energy collected from the collectors can be stored in the thermal storage tanks, which is used to produce steam to operate a Rankine power engine (Downloaded from http://www1. eere.energy.gov/solar/)
y = 2x1 þ 3x2 s:t: x1 x2 2 = 48:
ð7:76Þ
ð7:77Þ
Pr. 7.12 Use the Lagrange multiplier approach in conjunction with a search method to minimize the following constrained optimization problem: J = x1 3 þ x2 2 þ x3 3 =2 s:t: x1 –2x2 = 3
ð7:78Þ
x1 þ 1:5 x2 –x3 = 5 Pr. 7.13 Solar thermal power system optimization Generating electricity via thermal energy collected from solar collectors is a mature technology which is cheaper (at least until now) than that from photovoltaic systems (if no rebates and financial incentives are considered). Such solar thermal power systems in essence comprise of a solar collector field, a thermal storage, a heat exchanger to transfer the heat collected from the solar collectors to a steam boiler and the conventional Rankine steam power plant (Fig. 7.26). A simper system without storage tank will be analyzed here such that the fluid temperature leaving the solar collector array (Tco) will directly enter the Rankine engine to produce electricity.
Problems
301
Solve this unconstrained problem analytically and verify your result graphically. Perform a post optimality analysis and identify influential variables and parameters in this problem. Pr. 7.1415 Minimizing pressure drop in ducts Using the method of Lagrange multipliers, determine the diameters of the circular duct in the system shown in Fig. 7.28 so that the drop in the static pressure between points A and B will be a minimum. Use the following additional information: (i) Quantity of sheet metal available = 60 m2 (ii) Pressure drop in a section of straight duct of diameter D (m) and length L (m) with fluid flowing at velocity V (m/s), Fig. 7.27 Combined solar collector and Rankine engine efficiencies dictate the optimum operating temperature of the system
Δp = f Problem 5.7 described the performance model of a flat-plate solar thermal collector; however, concentrating collectors are required for power generation. A simplified expression analogous to Eq. 5.67b for concentrating collectors with concentration ratio C is: ηC = F R ηopt -
FRUL C
T Ci - T a IT
(i) Concentration ratio C = 30 (ii) Beam solar irradiation IT = 800 W/m2 (iii) Trough collectors with: FRηopt = 0.75 and FRUL = 10.0 W/m2-°C (iv) Ambient temperature Ta = 20 °C (v) Assume that the Rankine efficiency of the steam power cycle is half of that of the Carnot cycle operating between the same high and low temperatures (take the low-end temperature to be 10 °C above Ta and the high temperature is to be determined from an energy balance on the solar collector: T co = T ci þ mQccp , where Qc is the useful energy collected per unit collector area (W/m2), the mass flow rate per unit collector area m = 8 kg/m2 h, and the specific heat of the heat transfer fluid cp = 3.26 kJ/kg°C).
V2 ρ 2
ð7:80Þ
where f is the friction factor = 0.02 and air density ρ = 1.2 kg/m3 Also, recall that the volumetric flow rate for circular ducts is given by:
ð7:79Þ
The efficiency of the solar system decreases with higher values of Tci while that of the Rankine power cycle increases with higher value of Tco such that the optimal operating point is the product of both efficiency curves (Fig. 7.27). The problem is to determine the optimal value of Tci, which maximizes the overall system efficiency given the following information:
L D
Q = πD2 V =4
ð7:81Þ
Neglect pressure drop in the straight-through section past the outlets and the influence of changes in velocity pressure. Use pertinent information from Fig. 7.28. Pr. 7.15 Replacement of filters in HVAC ducts The HVAC air supply in hospitals has to meet high standards in terms of biological and dust-free cleanliness for which purpose high quality filters are used in the air ducts. Fouling by way of dust buildup on these filters causes additional pressure drops, which translates into an increased electricity consumption of the fan-motor circulating the air. Hence, the maintenance staff is supposed to replace these filters on a regular basis. Changing them too frequently results in undue expense due to the high cost of these filters, while not replacing them in a timely manner also increases the expense due to that associated with the pumping power. Determine the optimal filter replacement schedule under the following operating conditions (neglecting time value of money): – The HVAC system operates 24 h/day and 7 days/week and circulates Q = 100 m3/s of air
15
From Stoecker (1989), with permission from McGraw-Hill.
302
7
Optimization Methods
Fig. 7.28 Ducting layout with pertinent information for Pr. 7.14
– The pressure drop in the HVAC duct when the filters are new is 5 cm of water or H = 0.05 m – The pressure drop across the filters increase in a linear fashion by 0.01 m of water gauge for every 1000 h of operation (this is a simplification – actual increase is likely to be exponential) – The total cost of replacing all the filters is $2000. The efficiency of the fan is 65% and that of the motor is 90% – The levelized cost of electricity is $0.10 per kW h. The electric power consumed in kW by the fan-motor is given by: E ðkWÞ =
Q Ls H ðmÞ ð102Þηfan ηmotor
ð7:82Þ
(Hint: Express and plot the electricity cost as a cumulative cost versus days of operation as shown in Table 7.14) Pr. 7.16 Relative loading of two chillers Thermal performance models for chillers have been described in Sect. 10.2.3 of Chap. 10. The black box model given by Eq. 10.17 for the COP often appears in a modified form with the chiller electric power consumption P being expressed as: P = a0 þ a1 ðT cdo - T chi Þ þ a2 ðT cdo - T chi Þ2 þ a3 Qch þ a4 Q2ch þ a5 ðT cdo - T chi ÞQch
ð7:83Þ
where Tcdo and Tchi are the leaving condenser water and supply chilled water temperatures respectively, Qch is the chiller thermal load, and ai are the model parameters. Consider a situation where two chillers, denoted by Chiller A and Chiller B, are available to meet a thermal cooling load. The chillers are to be operated such that Tcdo = 85 °F and Tchi = 45 °F. Chiller B is more efficient than Chiller A at low relative load fractions and vice versa. Their model coefficients from performance data supplied by the chiller manufacturer are given in Table 7.15.
(a) The loading fraction of a chiller is the ratio of the actual thermal load supplied by the chiller to its rated value Qch-R. Use the method of Lagrange multipliers to prove that the optimum loading fractions y1* and y2* occur when the slopes of the curves are equal, i.e., when ∂PA ∂PB = ∂Q : ∂Q ch,A
ch,B
(b) Determine the optimal loading (which minimizes the total power draw) of both chillers at three different values of Qch, namely 800, 1200, 1600 Tons, and calculate the corresponding power draw. Investigate the effect of near-optimal operation and plot your results in a fashion useful for the operator of this cooling plant. Pr. 7.17 Comparing different thermostat control strategies The three thermostat pre-cooling strategies whereby air-conditioner (AC) electrical energy use in commercial buildings can be reduced have been discussed in Example 7.9.2. You will compare these three strategies for the following small commercial building and specified control limits: Assume that the RC network shown in Fig. 7.25 is an acceptable representation of the actual building. The building time constant is 6 h and its overall heat transfer resistance R = 2.5 °C/kW. The internal loads of space can be assumed constant at Qg = 1.5 kW. The peak period lasts for 8 h and the ambient temperature can be assumed constant at T0 = 32 °C. The minimum and maximum thermostat control set points are Ti,min = 22 °C and Ti,max = 28 °C. A very simple model for the AC is assumed (Reddy et al. 2016): PAC =
QAC R ð0:023 þ 1:429 ×PLR - 0:471 ×PLR2 COPR
ð7:84Þ
where PLR = part load ration = (QAC/QACR) and COPR is the Coefficient of Performance of the reciprocating chiller. Assume the rated values of COPR = 4.0 and QAC,R = 4.0 kW. (a) Usually Ti,max is not the set point temperature of the space. For pre-cooling strategies to work better, a higher temperature is often selected since occupants can
Problems
303
Table 7.14 Intermediate working solution for Pr. 7.15
Table 7.15 Values of coefficients in Eq. 7.83 (From ASHRAE 1999 # American society of Heating, Refrigerating and Air-conditioning Engineers, Inc., www.ashrae.org) Qch-rated a0 a1 a2 a3 a4 a5
Units Tons (cooling) kW kW/°F kW/°F2 kW/ton kW/ton2 kW/ton.°F
Chiller A 1250 106.4 6.147 0.1792 -0.0735 0.0001324 -0.001009
Chiller B 550 119.7 0.1875 0.04789 -0.3673 0.0005324 0.008526
tolerate this increased temperature for a couple of hours without adverse effects. Calculate the AC electricity consumed and the demand during the peak period if Tr = 26 °C; this will serve as the baseline electricity consumption scenario. (b) For the simple-minded setup strategy, first compute the number of hours for the selected Ti,max to be reached starting from Ti,min. Then calculate the electricity consumed and the demand by the AC during the remaining number of hours left in the peak period if the space is kept at Ti,max. (c) For the ramp-up strategy, calculate electricity consumed and the demand by the AC during the peak period (by summing those at hourly time intervals).
(d) Assuming that the thermostat is being controlled at hourly intervals, determine the optimal trajectory of the indoor temperature? What is the corresponding AC electricity use and demand? (e) Summarize your results in a table similar to Table 7.13, and discuss your findings. Pr. 7.18 Maximizing flow under a compromised network topology Consider the flow transportation network shown in Fig. 7.21 in Sect. 7.5.6. (a) Analyze the same problem as described and compare your results with those assembled in Table 7.7. (b) Analyze this network for different failure scenarios by breaking one link at a time and determining the most critical link based on the criterion of maximizing the flow between nodes 1 and 7. (c) Heuristically identify ways to enhance the overall robustness of this network under the most critical case by making changes to the network topology. (d) Analyze these mitigation measures using your optimization model and identify better ways to maintain functionality as far as possible.
304
7
Pr. 7.19 Electric generation planning using linear programming The state of Hawaii consists of 137 islands (with eight big ones) and has set very ambitious goals of meeting its energy needs while greatly reducing their greenhouse gas (GHG) emissions. In fact, the target is to achieve a 100% Renewable Portfolio Standard (RPS) by 2045. Several detailed studies have been performed (e.g., GE 2015) and extensive data on electricity loads, generation technologies and costs of installing and operating them as well as associated GHG emissions are available (USEIA 2021; USEPA 2020). Such analyses fall under generation expansion planning (GEP) which involves determining the generation types and their mix, unit sizes and number of generation units, their location, so as to meet preset GHG targets with prespecified constraints on initial costs and limits placed on maximum capacities of different technologies. How to operate the “optimal” generation assets properly over the year under operating and cost constraints is a subsequent issue. The analysis involves dynamic programming since the expansion is done over several decades in finite planning periods (say in 5-year time intervals) and involves non-linear models and complex constraints in terms of rate of expansion, etc. A comprehensive review of various GEP models and studies is provided by Sadeghi et al. (2017). This analysis is limited to the island of Oahu which is served by a single independent and stand-alone electric utility. The detailed analysis of multiple scenarios over 30 years’ time horizon is treated as a dynamic programming problem by Sabatino (2021). This problem uses realistic data that have been slightly modified and greatly simplified in terms of data and constraints as to be amenable for analysis using linear programming techniques. Table 7.16 assembles pertinent data for the island of Oahu. There are eight different generation technologies consisting of two fossil-based and the rest falling under renewable technologies. The two solar photovoltaic (PV) options along with the two wind generation technologies are not dispatchable, i.e., they generate and can dispatch power
Optimization Methods
depending on the availability of the solar and wind resources available at that time and location (and cannot be stored). (i) The first row shows the current installed capacity (in 2015) of the various technologies while the second row shows the energy generated in GWh/year. (ii) The capacity factor indicates the fraction of the year when the plant is operated at maximum (or rated) capacity. For example, a capacity factor of 0.5 would imply that the plant is operated (8760/2) = 4380 h in the year. Dispatchable plants (such as oil, gas, biodiesel and biomass) can be operated at will while the four solar and wind technology plants are notd dispatchable and outside operator control. (iii) The initial cost or capital cost is assembled in the 5th row. (iv) The next three rows relate to the operation costs, namely, the variable cost of generation (e.g., the fuel cost), fixed cost. Decommissioning costs are also indicated (you will not need the last set of values; they are provided in case the reader wishes to evaluate alternate scenarios not specified in this problem) (v) The last row indicates the CO2 emissions of each technology. You will perform the following broad scenario evaluations at the annual level (neglecting the diurnal and seasonal fluctuations in load demand and resource availability which a rigorous study would entail): (a) Business-as-usual. For the specified generation mix and neglecting CO2 emission considerations, frame the optimization problem as one that minimizes the operating cost for the electric utility while meeting the annual load demand. Note that the oil, coal, biodiesel and biomass plants can be controlled and operated at will (i.e., their capacity factors can be changed within the specified range shown in the table) while those for solar and wind are fixed and cannot be changed.
Table 7.16 Pertinent electric generation data for the Hawaiian island of Oahu (Problem 7.19)a Generating units Current capacity-2015 (MW) Energy Gen (GWh/year) Capacity factor (range) Capital cost ($/MW) Variable cost ($/MWh) Fixed cost ($/kW-year) Decommission cost ($/MW) CO2 emissions (tons/MWh) a
Oil 1286 4200 0.48–0.87 1600 $4.49 $17.29 $31,000 0.72
Coal 180 1400 0.61–0.85 2400 $7.09 $50.49 $117,000 0.86
Data available electronically on book website
DistributedPV 471 464 0.2 3945 $0.00 $0.00 $0 0.10
CentralPV 18 41 0.2 2793 $0.00 $22.97 $57,000 0.10
Onshorewind 99 216 0,35 2465 $0.00 $27.40 $51,000 0.02
Offshorewind 0 0 0.15 5062 $0.00 $96.71 $212,000 0.02
Biodiesel 112 52 0.05–0.3 1237 $12.99 $9.01 $31,000 0.25
Biomass 69 386 0.6–0.7 5251 $12.98 $79.05 $117,000 1.20
Total 2235 6759
Problems
305
(b) Repeat the above analysis but assuming a societal cost of $100/ton for CO2 emissions. Compare the results with scenario (a). (c) Compute the total capital cost of the current installed power generation infrastructure, as well as the annual operating cost. Assuming a 25-year life for all technologies (strictly speaking oil and coal plants have longer lifetimes), calculate the present worth assuming a discount factor of 4% (d) 50% RPS scenario. If the CO2 emissions have to be reduced by 50% compared to the business-as-usual scenario, determine the necessary generation mix assuming a societal cost of $100/ton for CO2 emissions. The constraints are that (i) the total capital cost is the same as that determined in step (c), (ii) that biodiesel capacity 0
ð8:37bÞ
while PACF is φ11 = r1 and φ22 requires an iterative solution. There are no simple formulae to derive the PACF for orders higher than 2, and hence software programs involving iterative equations, known as the Yule-Walker equations, are used to estimate the parameters of the AR model (see, for example, Box and Luceno 1997). Two processes, one for AR(1) with a positive coefficient and the other for AR(2) with one positive and one negative coefficient are shown in Figs. 8.23 and 8.24 along with their respective ACF and PACF plots. The ACF function, though it is a model of order 1, dies down exponentially, and this is where the PACF is useful. Only one PACF term is statistically significant at the 95% significance level for AR1 in Fig. 8.23 while two terms are so in Fig. 8.24 (as it should be). The process mean line for AR(1) and the constant term appearing
Fig. 8.23 (a–c) One realization of an AR(1) process for Zt = (5 + 0.8Zt - 1 + εt) along with corresponding ACF and PACF with error term being Normal(0,1)
8.5 Stochastic Time Series Models
335
Fig. 8.24 (a–c) One realization of an AR(2) process for Zt = (25 + 0.8Zt - 1 - 0.8Zt-2 + εt) along with corresponding ACF and PACF with error term being Normal (0,1)
Fig. 8.25 One realization of an ARMA(1,1) process for Zt = (15 + 0.8Zt - 1 + εt + 0.9εt - 1) along with corresponding ACF and PACF with error term being Normal (0,1)
in the model are related: μ = 1 -c φ = 1 -50:8 = 25 which is consistent with the process behavior shown in Fig. 8.23a. For AR(2), the process mean line is μ = 1 - φc - φ = 1
2
= 25 which is consistent with Fig. 8.24a. For the ACF: r1 = (0.8/1 - (-0.8)) = 0.44 and r2 = - 0.8 - (0.8)2/ (1 - (-0.8)) = - 0.44. The latter value is slightly different from the value of about -0.5 shown in Fig. 8.24b which is due to the white noise introduced in the synthetic sequence. Finally, Fig. 8.25 illustrates an ARMA (1,1) process where elements of both MA(1) and AR(1) processes are present. Note the exponential damping of both the ACF and the PACF. These four sets of figures (Figs. 8.22, 8.23, 8.24, and 8.25) partially illustrate the fact that one can model a stochastic process using different models, a dilemma which one faces even when identifying classical OLS models. Hence, evaluation of competing models using a cross-validation sample is highly advisable as well as investigating whether there is any correlation structure left in the residuals of the series after the stochastic effect has been removed. These tests closely parallel those which one would perform during OLS regression. 25 1 - 0:8þ0:8
8.5.3.5 Identification and Forecasting One could use the entire ARIMA model structure as described above to identify a complete model. However, with the intention of simplifying the process of model identification, and in recognition of the importance of AR models, the procedure for identifying an AR( p) model with n data points, and then using it for forecasting purposes are summarized below. (a) To identify an AR(p) model: (i) Make the time series stationary. Evaluate different trend and seasonal models using OLS, and identify the best one based on internal and external predictive accuracies: yt = b0 þ b1 x1,t þ ⋯ þ bk xk,t
ð8:38aÞ
where the x terms are regressors which account for the trend and seasonal variation and k is the number of x terms. (ii) Calculate the residual series as the difference between observed and predicted: Z t = ðyt - yt Þ (iii) Determine the ACFs of the residual series for different lags: r1, r2, . . ., rp
336
8
(iv) Determine the PACF function of the residual series for different lags: φ11, φ22, . . ., φpp (v) Generate correlograms for the ACF and PACF, and make sure that the series is stationary (vi) Evaluate different AR models based on their internal and external predictive accuracies, and select the most parsimonious AR model (often, 1 or 2 terms should suffice). Then use an overall model as shown to predict future values: yt = ðb0 þ b1 x1,t þ ⋯ þ bk xk,t Þ þ ϕ1 Z t - 1 þ ϕ2 Z t - 2 þ ⋯ þ ϕ p Z t - p
ð8:38bÞ
(vii) Calculate the RMSE of the overall model (trend plus seasonal plus stochastic). (b) To forecast a future value yt + 1 when updating is possible at each step (i.e., the value yt is known): (i) Calculate the series: Zt, Zt - 1, Zt - 2, ⋯, Zt - p (ii) Estimate Z tþ1 = ϕ1 :Z t þ ϕ2 :Z t - 1 þ ... þ ϕp :Z t - pþ1 (iii) Finally, use the overall model (Eq. 8.38b) modified to time step (t + 1) as follows: ytþ1 = ðb0 þ b1 x1,tþ1 þ . . . þ bk xk,tþ1 Þ þ Z tþ1
ð8:38cÞ
(iv) Calculate approximate 95% prediction intervals for the forecast given by: ytþ1 ± 1:96 RMSE
ytþ2 = ðb0 þ b1 x1,tþ2 þ . . . þ bk xk,tþ2 Þ þ Z tþ2
ð8:40Þ
and so forth. . . (ii) An approximate 95% prediction interval for m-steps should be determined, and this is provided by the software program used. For the simple case of AR(1), for forecasts m time-steps ahead: ytþm ± 1:96:RMSE 1 þ φ1 2 þ . . . þ φ1 2ðm - 1Þ
1=2
ð8:41Þ
Note that the ARMA models are usually written as equations with fixed estimated parameters representing a stochastic structure that does not change with time. Hence, such models are not adaptive. This is the reason why some researchers caution the use of these models for forecasting several time steps ahead when updating is not possible. Example 8.5.2 AR model for peak electric demand Consider the same data shown in Table 8.1 for the electric utility which consists of four quarterly observations per year for 12 years (from 1974 to 1985). The use of the AR(1) model with this data set will be illustrated and its importance highlighted as compared to the various models described earlier.
ð8:39Þ
(v) Re-initialize the series by setting Zt as the residual of the most recent step and repeat steps (i) to (iv). (c) To forecast a future value when updating is not possible over the time horizon: In case one lacks observed values for future forecasts (such as having to make forecasts over a horizon involving several time steps ahead), one must proceed in a recursive manner as follows. The first forecast is made as before, but now, one is unable to compute the model error which is to be used for predicting the second forecast, and the subsequent accumulation of errors widens the confidence intervals as one predicts further into the future. (i) Future forecast yt + 2 for the case when no updating is possible (i.e., yt + 1 is not known and so one cannot determine Zt + 1): Z tþ2 = φ1 Z t þ φ2 Z t - 1 þ . . . þ φp Z t - pþ1
Analysis of Time Series Data
The trend and seasonal model is given in Example 8.4.2. This model is used to calculate the residuals {Zt} for each of the 44 data points. The ACF and PACF functions for {Zt} are shown in Fig. 8.26. Since the PACF cuts off abruptly after lag 1, it is concluded that an AR(1) model is adequate to model the stochastic residual series. The corresponding model was identified by OLS: Z t = 0:657 Z t - 1 þ at Note that the value φ1 = 0.657 is consistent with the value shown in the PACF plot of Fig. 8.26. The RMSE for the trend and seasonal model during training was 7.86 (see Example 8.4.3). For the complete trend and seasonal plus AR(1) model, this reduces to 4.86 indicating an important improvement. The various models illustrated in previous sections have been compared in terms of their internal prediction errors during training. The more appropriate manner of comparing them is in terms of their external prediction errors such as bias and RMSE when applied to the testing data. The peak loads for the next four quarters for 1986 will be used as the basis of comparison.
8.5 Stochastic Time Series Models
337
Fig. 8.26 The ACF and PACF functions for the residuals in the time series data after removing the linear trend and seasonal behavior (Example 8.5.2)
The AR model can also be used to forecast the future values for the four quarters of 1986 (y49 to y52). First, one determines the residual for the last quarter of 1985 (Z48) using the trend and seasonal model to forecast yLS,48 . The AR (1) correction is subsequently determined and the forecast for the first quarter of 1986 or y49 is computed from:
Table 8.6 Summary of the RMSE values of various OLS regression models when applied to the electric utility load data given in Tables 8.1 and 8.2 (one-step forecast for the 4 quarters of 1986)
Training (1974–1985) Testing (1986)
Linear 12.00
Linear + Seasonal 7.86
Linear + Seasonal + AR(1) 4.86
13.48
12.12
7.80
Z 48 = Y 48 - Y LS,48 = 135:1 - ð149:05Þ = - 13:95 Z 49 = r 1 :Z 48 = ð0:657Þð - 13:95Þ = - 9:16 Y 49 = Y LS,49 þ Z 49 = ð164:34Þ - 9:16 = 155:2
8.5.4
Recommendations on Model Identification
Finally,9 the 95% confidence limits can be stated as: y49 ± 1:96 RMSE = 155:2 ± 1:96 × 9:01 = 155:2 ± 17:66 = ð137:54, 172:86Þ: Similarly, the values of the three other forecast steps are found to be: y50 = 140:01, y51 = 162:12 and y52 = 147:76: Table 8.6 assembles the RMSE values of various regression models used in previous examples as well as for the [OLS regression +AR(1)] model. The RMSE prediction error for all models have increased (as is generally the case). The [OLS regression +AR(1)] model is clearly the most accurate. Note, however, that the models are tentative since the residual behavior is not satisfactory and detrending is inadequate (see Example 8.4.2). One of the problems at the end of this chapter requires that alternative regression models be evaluated to overcome these deficiencies and redo the entire analysis.
9
This step is statistically unsound since the model residuals are still ill-behaved.
As stated earlier, several texts suggest that, in most cases, it is not necessary to include both the AR and the MA elements; one of these two, depending on system behavior, should suffice. Further, it is recommended that low order models of 3 or less should be adequate in most instances provided the seasonality has been properly filtered out. Other texts state that adopting an ARMA model is likely to result in a model with fewer terms than those of a pure MA or AR process by itself (Chatfield 1989). Some recommendations on how to identify and evaluate a stochastic model are summarized below. (a) Stationarity check: Whether the series is stationary or not in its mean trend is easily identified (one has also got to verify that the variance is stable, for which transformations such as taking logarithms may be necessary). A non-constant trend in the underlying mean value of a process will result in the ACF not dying out rapidly. If seasonal behavior is present, one can detect it from the ACF as well since it will exhibit a decay with cyclic behavior. The seasonality effect needs to be removed by using an appropriate regression model (e.g., the traditional OLS or even a Fourier Series model) or by differencing. Figure 8.20 illustrates how a seasonal trend shows up in the ACF of the time series data of Table 8.1. The seasonal nature of the time series is reflected in the rectified sinusoidal behavior of the
338
ACF. Differencing is another way of detrending the series which is especially useful when the cyclic behavior is known (such as 24 h lag differencing for electricity use in buildings). If more than twice differencing does not remove seasonality, consider a transformation of the time series data using natural logarithms. (b) Model selection: The correlograms of both the ACF and the PACF are the appropriate means for identifying the model type (whether ARIMA, ARMA, AR or MA) and the model order. The identification procedure can be summarized as follows (McCain and McCleary 1979): (i) For AR(1): ACF decays exponentially, PACF has a spike a lag 1, and other spikes are not statistically significant, i.e., are contained within the 95% confidence intervals (ii) For AR(2): ACF decays exponentially (indicative of positive model coefficients) or with sinusoidalexponential decay (indicative of a positive and a negative coefficient), and PACF has two statistically significant spikes (iii) For MA(1): ACF has one statistically significant spike at lag 1 and PACF damps down exponentially (iv) For MA(2): ACF has two statistically significant spikes (one at lag 1 and one at lag 2), and PACF has an exponential decay or a sinusoidalexponential decay (v) For ARMA(1,1): ACF and PACF have spikes at lag 1 with exponential decay. Usually, it is better to start with the lowest values of p and q for an ARMA(p, q) process. Subsequently, the model order is increased until no systematic patterns are evident in the residuals of the model. Most time series data from engineering experiments or from physical systems or processes should be adequately modeled by low orders, i.e., about 1–3 terms. If higher orders are required, the analyst should check his data for bias or unduly large noise effects. Cross-validation using the sample handout approach is strongly recommended for model selection since this avoids over-fitting and would better reflect the predictive capability of the model. The model selection is somewhat subjective as described above. To circumvent this arbitrariness, objective criteria have been proposed for model selection. Wei (1990) describes several such criteria; the Akaike Information Criteria (AIC), the Bayesian Information Criteria (BIC) and the Criterion for Autoregressive Transfer function (CAT) to name three of several indices. (c) Model evaluation: After a tentative time series model has been identified and its parameters estimated, a diagnostic check must be made to evaluate its adequacy.
8
Analysis of Time Series Data
This check could consist of two steps as described below: (i) The autocorrelated function of the simulated series (i.e., the time series generated by the model) and that of the original series must be close. (ii) The residuals from a satisfactory model should be white noise. This would be reflected by the sample autocorrelation function of the residuals being close or equal to zero. Since it is assumed that the random error terms in the actual process are normally distributed and independent of each other (i.e., white noise), the model residuals should also behave similarly. This is tested by computing the sample autocorrelation function for lag k of the residuals. If the model is correctly specified, the residual autocorrelations rk (up to about k = 15 or so) are themselves uncorrelated, normally distributed random variables with mean 0 and variance (1/n), where n is the number of observations in the time series. Finally, the sum of the squared independent normal random variables denoted by the Q statistic is computed as: K
Q=n
r 2k
ð8:42Þ
k=1
Q must be approximately distributed as chi-square χ 2 with (K-p-q) degrees of freedom. Lookup Table A.5 provides the critical value to determine whether or not to accept the hypothesis that the model is acceptable. Despite their obvious appeal, a note of caution on ARMA models is warranted. Fitting reliable multi-variate time series models is difficult. For example, in case of non-experimental data, which is not controlled, there may be high correlation between and within series which may or may not be real (there may be mutual correlation with time). An apparent good fit to the data may not necessarily result in better forecasting accuracy than using a simpler univariate model. Though ARMA models usually provide very good fits to the data series, often, a much simpler method may give predictions just as good. An implied assumption of ARMA models is that the data series is stationary and normally distributed. If the data series is not, it is important to find a suitable transformation to make it normally distributed prior to OLS model fitting. Further, if the data series has non-random disturbance terms, the maximum likelihood estimation (MLE) method described in Sect. 9.4.2 is said to be statistically more efficient than OLS estimation. The reader can refer to Wei (1990), Box and Jenkins (1976), or Montgomery and Johnson (1976) for more details. The autocorrelation function and the spectral method are
8.6 ARMAX or Transfer Function Models
339
Fig. 8.27 Conceptual difference between the single-variate ARMA approach and the multivariate ARMAX approach applied to dynamic systems. (a) Traditional ARMA approach. (b) ARMAX approach
closely related; the latter can provide insight into the appropriate order of the ARMA model (Chatfield 1989).
8.6
ARMAX or Transfer Function Models
8.6.1
Conceptual Approach and Benefit
The ARIMA models presented above involve detrending the data (via the “Integrated” component) prior to modeling. The systematic stochastic component is modeled by ARMA models which are univariate by definition since they only consist of detrended lagged series of a single variable {Zt} and do not include the effects of known forcing variables. An alternate form of detrending is to use OLS models such as described in Sect. 8.4 which can involve indicator variables for seasonality, and the time variable to detrend. One can even have other “independent” variables X appear in the OLS model if necessary; for example: Yt = f(Xt). However, such models do not contain lagged variables in X, as shown in Fig. 8.27a. That is why the ARMA models are said to be basically univariate since they relate to the detrended series {Zt}. This series is taken to be the detrended response of white noise plus a feedback loop whose effect is taken into consideration via the variation of the lagged variables. Thus, the stochastic time series data points are in equilibrium over time and fluctuate about a mean value. However, there are systems whose response cannot be satisfactorily modeled using ARMA models alone since their mean values vary greatly over time. The obvious case is of dynamic systems which have some sort of feedback in the independent or regressor variables, and explicit recognition of such effects is warranted. Traditional ARMA models would then be of limited use since the error term would include some of the structural variation which one could directly attribute to the variation of the regressor variables. Thus, the model predictions could be biased with uncertainty bands so large as to make predictions very poor, and often
useless. In such cases, a regression model relating the dependent variable with lagged values of itself plus current and lagged values of the forcing inputs, plus the error term captured by the time-series model, is likely to be superior to the ARMA models alone (see Fig. 8.27b). Instead of two separate processes of detrending to reach stationarity and then regression, the two steps can be combined. Such a model formulation is called “Multivariate ARMA” (or MARMA) or more colloquially as ARMAX models or transfer function models. Such models have found extensive applications is engineering, econometrics, and other disciplines as well, and are briefly described below.
8.6.2
Transfer Function Modeling of Linear Dynamic Systems
Dynamic systems are modeled by differential equations. An example taken from Reddy et al. (2016) will illustrate how linear differential equations can be recast as transfer function models such that the order of the differential equation is equal to the number of lag terms in a time series. Consider a simple building represented by a 1C1R thermal network as shown in Fig. 7.25 (see Sect. 1.4.4 for an introductory discussion about representing the transient heat conduction through a plane wall by electrical network analogues). The internal node is the indoor air temperature Ti which is assumed to be closely coupled to the thermal mass of the building or room. This node is impacted by internal heat loads generated from people and various equipment (Q) and also by heat conduction from the outdoors at temperature To through the outdoor wall with an effective resistance R. The thermal performance of such a system is modeled by: •
CT i = •
To - Ti þQ R
where T i is the time derivative of Ti.
ð8:43Þ
340
8
Introducing the time constant τ = RC, the above equation can be re-written as: •
τ T i þ T i = T o þ RQ
ð8:44Þ
For the simplifying case when both the driving terms To and Q are constant, one gets •
τT ðt Þ þ T ðt Þ = 0
where T ðt Þ = T i ðt Þ - T o - RQ ð8:45Þ
The solution is tþ1 τ t 1 = T ð0Þ exp exp τ τ 1 = T ðt Þ exp τ
T ðt þ 1Þ = T ð0Þ exp -
a1 = - exp -
ð8:48Þ
1 τ
ð8:46Þ
The first order ODE is, thus, recast as the traditional single-variate AR(1) model with one-lag term. In this case, there is a clear interpretation of the coefficient a1 in terms of the time constant of the system. Example 7.9.2 in Chap. 7 illustrates the use of such models in the context of operating a building to minimize the cooling energy use and demand during the peak period of the day. Reddy et al. (2016) also give another example of a network with two nodes (i.e., 2R2C network) where, for a similar assumption of constant driving terms To and Q, one obtains a second order ODE which can be cast as a time series model with two lag terms, with the time series coefficients (or transfer function coefficients) still retaining a clear relation with the resistances and the two time constants of the system. For more complex models and for cases when the driving terms are not constant, such clear interpretation of the time series model coefficients in terms of resistances and capacitances would not exist since the same time series model can apply to different RC networks, and so uniqueness is lost. For the general case of the free (or floating) response of a non-airconditioned room or building represented by indoor air temperature Ti and which is acted upon by two driving terms To and Q which are time variant, the general form of the transfer function model of order n is: T i,t þ a1 T i,t - 1 þ a2 T i,t - 2 þ ⋯ þ an T i,t - n = b0 T o,t þ b1 T o,t - 1 þ b2 T o,t - 2 þ ⋯ þ bn T o,t - n
The model identification process, when applied to the observed time series of these three variables, would determine how many coefficients or weighting factors to retain in the final model for each of the variables. Once such a model has been identified, it can be used for accurate forecasting purposes. In some cases, physical considerations can impose certain restrictions on the transfer function coefficients, and it is urged that these be considered since it would result in sounder models. For the above example, it can be shown that at the limit of steady state operation when present and lagged values of each the three variables are constant, heat loss would be expressed in terms of the overall heat loss coefficient U times the cross-sectional area A perpendicular to heat flow: Qloss = UA(To - Ti). This would require that the following condition be met: ð1 þ a1 þ a2 þ . . . þ an Þ = ðbo þ b1 þ b2 þ . . . þ bn Þ
which can be expressed as: T ðt Þ þ a1 T ðt - 1Þ = 0 where
Analysis of Time Series Data
ð8:47Þ
The transfer function approach10 has been widely used in several detailed building energy simulation software programs developed over 40 years ago to model unsteady state heat transfer and thermal mass storage effects such as wall and roof conduction, solar heat gains and internal heat gains. It is a widely understood and accepted modeling approach among building energy professionals). The following example illustrates the approach. Example 8.6.1 Transfer function model to represent unsteady state heat transfer through a wall For the case of a wall subject to solar irradiation and an outdoor temperature on the outside surface, the combined effect can be modeled by the sol-air temperature concept (see any building science textbook, for example Reddy et al. 2016). The indoor air temperature Ti is kept constant by air-conditioning. The transfer function model for the unsteady state heat conduction through the wall is expressed as: Qcond,t = - d1 Qcond,t - 1Δt - d2 Qcond,t - 2Δt - ... þb0 T solair,t þ b1 T solair,t - 1Δt þ b2 T solair,t -2Δt þ⋯- T i
cn n≥0
ð8:49aÞ or Qcond,t = -
dn Qcond,t - nΔt þ
n>1
b0 T solair,t - nΔt - T i n≥0
cn n≥0
ð8:49bÞ where
þc0 Qt þ c1 Qt - 1 þ c2 Qt - 2 þ ⋯ þ cn Qt - n Strictly, this formulation should be called “discrete transfer function or z-transform” since it uses discrete time intervals (of one hour). 10
8.7 Quality Control and Process Monitoring Using Control Chart Methods Table 8.7 Conduction transfer function coefficients for a 4 inch (10 cm) concrete wall with 2 inch (5 cm) insulation
Qcond Tsolair
Ti Δt bn, cn, and dn
bn dn ∑cn
n=0 0.00099 1.00 0.01303
n=1 0.00836 -0.93970
341 n=2 0.00361 0.04664
n=3 0.00007 0.00
n=4 0.00 0.00
is the conduction heat gain through the wall is the sol-air temperature (a variable which includes the combined effect of outdoor dry-bulb air temperature and the solar radiation incident on the wall) is the indoor air temperature is the time step (usually 1 h) are the transfer function coefficients
Table 8.7 assembles values of the transfer function coefficients for a 4-inch (10 cm) concrete wall with 2 inches (5 cm) insulation. For a given hour, say 10:00 am, Eq. 8.49b can be expressed as: Qcond,10 = ð0:93970ÞQcond,9 - ð0:04664ÞQcond,8 þð0:00099ÞT solair,10 þ ð0:00836ÞT solair,9 þð0:00361ÞT solair,8 þ ð0:00007ÞT solair,7 - ð0:01303ÞT i ð8:50Þ First, values of the driving terms Tsolair must be computed for all the hours over which the computation is to be performed. To start the calculation, initial guess values for transient conduction for hours 8:00 and 9:00 denoted by Qcond,9 and Qcond,8 are assumed. For a specified Ti, one can then calculate Qcond,10 and repeat the recursive calculation for each subsequent hour. At the end of the diurnal cycle, the final transient conduction terms will differ from the initial guess values. The heat gains are periodic because of the diurnal periodicity of Tsol. The calculation cycle is repeated over as many diurnal cycles as needed to reach convergence. The effect of the initial guess values soon dies out and the calculations attain the desired accuracy after a few iterations. ■
8.7
Quality Control and Process Monitoring Using Control Chart Methods
8.7.1
Background and Approach
The concept of statistical quality control and quality assurance was proposed by Shewart in the 1920s (and, hence, many of these techniques bear his name) with the intent of using sampling and statistical analysis techniques to improve and maintain quality during industrial production. Process monitoring using control chart techniques provides an ongoing check on the stability of the process, and points to problems whose elimination can reduce variation and permanently improve the system (Box and Luceno 1997). It has
Fig. 8.28 The upper and lower three-sigma limits indicative of the UCL and LCL limits shown on a normal distribution. The corresponding probability that the mean of a sample will exceed these limits is only 0.26%.
been extended to include condition monitoring and performance degradation of various equipment and systems, as well as to control of industrial processes involving process adjustments using feedback control to compensate for sources of drift variation. The basic concept is that variation in any production process is unavoidable the causes of which can be categorized into: (i) Common causes or random fluctuations due to the overall process itself, such as variation in quality of raw materials and inconsistency of equipment performance—these lead to random and acceptable variability and statistical concepts apply; (ii) Special or assignable causes (or non-random or sporadic large changes) due to specific deterministic circumstances, such as operator error, machine fault, faulty sensors, or performance degradation of the measurement and control equipment. It is the instances associated with special causes which are to be detected. When the variability is due to random or common causes and within some stipulated confidence interval, the process is said to be in statistical control. The Gaussian or normal curve is assumed to describe the process measurements with the confidence limits indicated as the upper and lower control limits (UCL and LCL) as shown in Fig. 8.28. Thus, the detection power of the control chart method increases with number of observations in each sample (since the average values become more normally
342
8
8.7.2
Analysis of Time Series Data
Shewart Control Charts for Variables
The Shewart control chart method is a generic name which includes several different types of charts viewed in conjunction. It is primarily meant to keep track of or monitor a process and provide warning as soon as an abnormality occurs. The chart by itself cannot remedy the situation i.e., cannot provide the necessary suggestions of how to correct/adjust the process. This approach is appropriate for continuous measurements of variables such as diameter, temperature, flow, as well as for attributes or derived parameters or quantities such as overall heat loss coefficient, and efficiency).
Fig. 8.29 The Shewhart control chart with primary limits
distributed); however, this number is often dictated by practical and cost constraints. The practice of plotting the attributes or characteristics of the process over time on a plot is called monitoring via control charts. It consists of a horizontal plot which locates the process mean (called “centerline”) and two lines (the UCL and LCL limits), as shown in Fig. 8.29. The intent of statistical process monitoring using control charts is to detect the occurrence of special or non-random events which impact the central tendency and the variability of the process, and then to take corrective action to eliminate them. Thus, the process can again be brought back to stable and statistical control as quickly as possible. These limits are often drawn to correspond to 3 times the standard deviation so that one can infer that there is strong evidence that points outside the limits are faulty or indicate an unstable process. The decision as to whether to deem an incoming sample of size n at a particular point in time to be in or out of control is, thus, akin to a two-tailed hypothesis test:
8.7.2.1 Mean Charts Mean or X chart is used to monitor the central tendency and detect the onset of bias in measured quantities or estimated parameters. This detection is based on a two-tailed hypothesis test assuming a normal error distribution. Even if the variable is not normally distributed, the Central Limit Theorem (Sect. 4.2.1) states that the mean of a number of random samples will tend towards a normal distribution. The control chart plots are deduced as upper and lower control limits about the centerline where the norm is to use the 3-sigma confidence limits: When s is known : fUCL, LCLgx = X ± 3
H 0 : process is in control, X = μ0
Alternative hypothesis
H a : process out of control X = μ0 ð8:51Þ
where X denotes the sample mean where each sample consists of n observations and μ0 the expected mean deduced from an in-control process. Note that the convention in statistical quality control literature is to use upper case letters for the mean value. Just as in hypothesis testing (Sect. 4.2), type I and type II errors can result. For example, type I error (or false positive error) arises when a sample (or point on the chart) of an in-line control process falls outside the control bands. As before, the probability of occurrence is reduced by making appropriate choices of sample size and control limits.
ð8:52Þ
where s is the process standard deviation and (s/n1/2) is the standard error of the sample means. Recall that the 3-sigma limits include 99.74% of the area under the normal curve, and hence, that the probability of a type I error (or false positive) is only 0.26%. When the standard deviation of the process is not known, it is suggested that the average range R 11 of the numerous samples be used for 3-sigma limits as follows: When s is not known
Null hypothesis
s n1=2
fUCL, LCLgx = X ± A2 R ð8:53Þ
where the factor A2 is given in Table 8.8. Note that this factor decreases as the number of samples increases. Devore and Farnum (2005) cite a study which demonstrated that the use of medians and the interquartile range (IQR) was superior to the traditional means and range control charts. The former two quantities were found to lead to more robust detection, i.e., less influenced by spurious outliers. In this case, the suggested control limits were:
11
The range for a given sample is simply the difference between the highest and lowest values of the given sample.
8.7 Quality Control and Process Monitoring Using Control Chart Methods
343
Table 8.8 Numerical values of the three coefficients to be used in Eqs. 8.53 and 8.55 for constructing the three-sigma limits for the mean and range charts Number of observations in each sample n 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Factor for determining control limits, control chart for the mean A2 1.880 1.023 0.729 0.577 0.483 0.419 0.373 0.337 0.308 0.285 0.266 0.249 0.235 0.223
Factors for determining control limits, control chart for the range D3 D4 0 3.268 0 2.574 0 2.282 0 2.114 0 2.004 0.076 1.924 0.136 1.864 0.184 1.816 0.223 1.777 0.256 1.744 0.284 1.717 0.308 1.692 0.329 1.671 0.348 1.652
Source: Adapted from “1950 ASTM Manual on Quality Control of Materials,” American Society for Testing and Materials, in J. M. Juran, ed., Quality Control Handbook (New York: McGraw-Hill Book Company, 1974), Appendix II, p. 39 Table 8.9 Values of the factor kn to be used in Eq. 8.54 N kn
4 0.596
5 0.990
6 1.282
fUCL, LCLg = X ± 3
7 1.512
IQR k n ðnÞ1=2
8 0.942
ð8:54Þ
~ is the median and the values of kn are selected based where X on the sample size n given by the Table 8.9.
8.7.2.2 Range Charts Range or R-charts are meant to detect variation or non-uniformity or non-consistency of a process. The range is a rough measure of the “rate of change” of the observed variable which is a more sensitive measure than the mean. Hence, a point which is out of control on the range chart may be flagged as an abnormality before the mean chart does. Consider the case of drawing k samples each with sample size n (i.e., each sample consists of drawing n items or taking n individual measurements).12 The 3-sigma limits for the range chart are given by: mean line :
R
lower control limit :
LCLR = D3 R
ð8:55Þ
upper control limit : UCLR = D4 R
12
Note the distinction between the number of samples (k) and the size of each sample or sample size (n).
where R is the mean of the ranges of the k samples, and the numerical values of the coefficients D3 and D4 are given in Table 8.8 for different number of sample sizes. It is suggested that the mean and range chart be used together since their complementary properties allow better monitoring of a process. Figure 8.30 illustrates two instances where the benefit of using both charts reveal behavior which one chart alone would have missed. It is wise to check for normality of the mean variable across the samples using for example the normality plot (described in Sect. 3.5.3). If the process is stable, one would expect the same number of points above and below the center line. If there are steady upward or downward trends, that would indicate abnormality. Even though this can be detected by control charts such as Fig. 8.30, there are more sensitive methods (such as the Cusum described later on). There are several variants of the above mean and range charts since several statistical indices are available to measure the central tendency and the variability. One common chart is the standard deviation or s charts, while other types of charts involve s2 charts. Which chart to use depends to some extent on personal preference. Often, the sample size for control chart monitoring is low (around 5 according to Himmelblau 1978) which results is standard deviation being not a very robust statistic. Further, the range is easier to visualize and interpret, and is more easily determined than the standard deviation.
344
8
Analysis of Time Series Data
Example 8.7.1 Illustration of the mean and range charts (a) Consider a process where 20 samples, each consisting of 4 items, are gathered as shown in Table 8.10. The mean and range charts will be used to illustrate how to assess whether the process is in control or not. The X and R charts are shown in Fig. 8.31 with the two control limits set at 3-sigma. Note that no point is beyond the control limits in either plot indicating that the process is in statistical control.
Fig. 8.30 The advantage provided by combining the mean and range charts in detecting out-of-control processes. Two instances are shown: (a) where the variability is within limits, but the mean is out of control which is detected by the mean chart, and (b) where the mean is in control but not the variability which is detected by the range chart
(b) The data in Table 8.10 has been intentionally corrupted slightly such that the four items of sample# 1 are higher by 1% and those of sample# 8 higher by 3% (Fig. 8.32). Whether and how the mean and range plots flag these two occurrences can be seen in Fig. 8.33. The range shows almost no change from Fig. 8.31 while the X chart has been able to detect sample# 8 but not sample# 1 which has only a 1% change. This simple example is meant to illustrate that a process monitored using both the mean and range charts can provide additional insights not provided by each control chart variable alone. ■
8.7.3
Shewart Control Charts for Attributes
In complex assembly operations (or during condition monitoring involving several sensors as in many thermal systems), Table 8.10 Data table for the 20 samples consisting of four items and associated mean and range statistics (Example 8.7.1) Sample# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Grand mean
Item 1 1.405 1.407 1.385 1.386 1.382 1.404 1.409 1.399 1.408 1.399 1.394 1.409 1.405 1.390 1.393 1.413 1.410 1.407 1.411 1.404
Data available electronically on book website
Item 2 1.419 1.397 1.392 1.419 1.391 1.406 1.386 1.382 1.411 1.421 1.397 1.389 1.387 1.410 1.403 1.390 1.415 1.386 1.406 1.396
Item 3 1.377 1.377 1.399 1.387 1.390 1.404 1.399 1.389 1.394 1.400 1.396 1.398 1.399 1.388 1.387 1.395 1.392 1.396 1.392 1.391
Item 4 1.400 1.393 1.392 1.417 1.397 1.402 1.403 1.410 1.388 1.407 1.409 1.399 1.393 1.384 1.415 1.411 1.397 1.393 1.387 1.390
Mean X 1.400 1.394 1.392 1.402 1.390 1.404 1.399 1.395 1.400 1.407 1.399 1.399 1.396 1.393 1.400 1.402 1.404 1.396 1.399 1.395 1.398
Range (R) 0.042 0.030 0.014 0.033 0.015 0.004 0.023 0.028 0.023 0.022 0.015 0.020 0.018 0.026 0.028 0.023 0.023 0.021 0.024 0.014 0.022
8.7 Quality Control and Process Monitoring Using Control Chart Methods
345
Fig. 8.31 Shewart 3-sigma charts for mean and range using data from Table 8.10. (a) Mean chart. (b) Range chart
Fig. 8.32 Scatter plot of the 20 samples consisting of four items or observations in each sample. (Data from Table 8.10 but with sample#1 and sample#8 intentionally corrupted by 1% and 3%, respectively)
1.47
Item 1-Item 4
1.45
1.43
1.41
1.39
1.37 0
4
8
12
16
20
Subgroup
a
b
1.44
0.05
0.05
1.43 1.42
1.41 1.40
1.4 1.39
1.38 1.38 0
4
8
12
16
20
Subgroup
0.04
Range
1.42
X-bar
0.06
0.03 0.02
0.02 0.01
0.00
0 0
4
8
12
16
20
Subgroup
Fig. 8.33 Shewart charts when samples# 1 and 8 in Table 8.10 has been intentionally corrupted by 1% and 3%, respectively. (a) X chart. (b) Range chart
numerous quality variables would need to be monitored, and in principle, each one could be monitored separately. A simpler procedure is to inspect n finished products denoting a sample at regular intervals, and to simply flag the proportion of products in the sample found to be defective or non-defective. Thus, analysis using attributes would only differentiate between two possibilities: acceptable or not acceptable. Different types of charts have been proposed, two important ones are listed below:
(a) p-chart for fraction or proportion of defective items in a sample (it is recommended that typically n = 100 or so). An analogous chart for tracking the number of defectives in a sample, i.e., the variable (n.p.) is also widely used; (b) c-chart for rate of defects or minor flaws or number of non-conformities per unit time. This is a more sophisticated type of chart where an item may not be defective to render it useless but would nevertheless compromise the quality of the product. An item can have
346
8
non-conformities but still be able to function as intended. It is based on the Poisson distribution rather than the binomial distribution which is the basis for method (a) above. The reader can refer to texts such as Devore and Farnum (2005) or Walpole et al. (2007) for more detailed description of this approach. Method (a) is briefly described below. Let p be the probability that any particular item is defective. One manner of determining probability p is to infer it as the long run proportion of defective items taken from a previous in-control period. If the process is assumed to be independent between samples, then the expected value and the variance of a binomial random variable X in a random sample n with p being the fraction of defectives is given by (see Sect. 2.4.2): E ðpÞ = p pð 1 - pÞ varðpÞ = n
ð8:56Þ
Then, the 3-sigma upper and lower limits are determined as:
Analysis of Time Series Data
Table 8.11 Data table for Example 8.7.2 Sample 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Mean
Number of defective components 8 6 5 7 2 5 3 8 4 4 3 1 5 4 4 2 3 5 6 3
Fraction defective, p 0.16 0.12 0.10 0.14 0.04 0.10 0.06 0.16 0.08 0.08 0.06 0.02 0.10 0.08 0.08 0.04 0.06 0.10 0.12 0.06 0.088
Data available electronically on book website
pð1 - pÞ fUCL, LCLgp = p ± 3 n
1=2
ð8:57Þ
8.7.4 In case the LCL is negative, it must be set to zero, since negative values are physically impossible for proportions. Example 8.7.2 Illustration of the p-chart method Consider the data shown in Table 8.11 collected from a process where 20 samples are gathered with each sample size being n = 50. If prior knowledge is available as to the expected defective proportion p, then that value should be used. In case it is not, and provided the process is generally fault-free, it can be computed from the data itself as shown. From the table, the mean p value = 0.088. This is used as the baseline for comparison in this example. The centerline and the UCL and LCL values for the p-chart following Eqs. 8.56 and 8.57 are shown in Fig. 8.34. The process can be taken to be in control because the individual points are contained within the UCL and LCL bands. Since the p value cannot be negative, the LCL is forced to zero; this is the reason for the asymmetry in the UCL and LCL bands around the CTR. The analogous chart for the number of defectives is also shown. Note that the two types of charts look very similar (the very small differences are due to rounding) except for the numerical values of the UCL and LCL; this is not surprising since the number of samples n (taken as 50 in this example) is a constant multiplier. ■
Practical Implementation Issues of Control Charts
The basic process for constructing control charts is to first gather at least k = 25 – 30 samples of data with a fixed number of objects or observations of size n from a production process known to be working properly, i.e., one in statistical control. A typical value of n for X-bar and R charts is n = 5. As the value of n is increased, one can detect smaller changes but at the expense of more time and money. The mean, UCL and LCL values could be preset and unchanging during the course of operation, or they could be estimated anew at each updating period. Say, the analysis is done once a day, with four observations (n = 4) taken hourly, and the process operates 24 h/day. The limits for each day could be updated based on the statistics of the 24 samples taken the previous day or kept fixed at some pre-set value. Such choices are best done based on physical insights into the specific process or equipment being monitored. A practical consideration is that process operators do not like frequent adjustments made to the control limits. Not only can this lead to errors in resetting the limits, but this may lead to psychological skepticism on the reliability of the entire statistical control approach. When a process is in control, the points from each sample plotted on the control chart should fluctuate in a random manner between the UCL and the LCL with no clear pattern.
8.7 Quality Control and Process Monitoring Using Control Chart Methods
347
Fig. 8.34 Shewart p-charts (a) Chart for the proportion of defectives. (b) Chart for the number of defectives (n.p)
Several “rules” have been proposed to increase the sensitivity of Shewhart charts. Other than “no points outside the control limits,” one could check for such effects as: (i) whether the number of points above and below the centerline are about equal, (ii) that there is no steady rise or decrease in a sequence of points, (iii) that most of the points are close to the centerline rather than hugging the limits, (iv) that there is a sudden shift in the process mean, (v) whether cyclic behavior, etc. Devore and Farnum (2005) present an extended list of “out-of-control” rules involving counting the number of points falling within different bounds corresponding to one, two and three sigma lines. Eight different types of out-of-control behavior patterns are shown to illustrate that several possible schemes can be devised for process monitoring. Others have developed similar types of rules to increase the sensitivity of the monitoring process. However, using such types of extended rules also increases the possibility of false alarms (or type I errors), and so, rather than being ad hoc, there should be a solid statistical basis to these rules.
8.7.5
Time-Weighted Monitoring
8.7.5.1 Cusum Charts Traditional Shewhart chart methods are based on investigating statistics (mean or range, for example) of an individual sample data of n items. Time-weighted procedures allow more sensitive monitoring and detection of gradual shifts in the process mean. Cumulative sum (Cusum) charts are similar to the Shewart charts in that they are diagnostic tools which indicate whether a process has gone out of control or not due to the onset of special non-random causes. However, the Cusum approach makes the inference based on a sum of deviations rather than individual samples. This allows greater sensitivity (said to be about twice than Shewart charts) to detect deviations. It damps down random noise while amplifying true process changes and can indicate when and by how much the mean of the process has shifted.
Typical time-weighted approaches have a shorter average run length in detecting small-to-moderate process shifts since they incorporate past history of process,. Like the EWMA, Cusum is sensitive to small shifts in the process mean but does not match the ability of a Shewhart chart to detect larger shifts. For this reason, it is sometimes used together with a Shewhart chart. However, non-normality and serial correlation in the data has an important effect on conclusions drawn from Cusum plots (Himmelblau 1978). Consider a control chart for the mean with a reference or target level established at μ0. Let the sample means be given by X 1 , X 2 . . . X r . Then, the first r Cusums are computed as: S1 = X 1 - μ 0 S2 = S1 þ X 2 - μ0 = X 1 - μ0 þ X 2 - μ0 ... r X i - μ0 Sr = S r - 1 þ X r - μ 0 =
ð8:58Þ
i=1
The above discussion applied to the mean residuals, i.e., the difference between the measured value and its expected value. Such charts could be based on other statistics such as the range, absolute differences, or successive differences between observations, or even the variable itself. The Cusum chart is simply a plot of Sr over time like the Shewart charts, but they provide a different type of visual record. Since the deviations add up, i.e., cumulate, an increase (or decrease) in the process mean will result in an upward (or downward) slope of the value Sr. The magnitude of the slope is indicative of the size of the change in the mean. Special templates or overlays are generated according to certain specific rules for constructing them (see, for example, Walpole et al. 2007). A common representation is the V-mask Cusum plot. Here the overlay is shaped as a Vee (since the slope is indicative of a change in the mean of the process) which is placed over the most recent point. If the data points fall within the opening of the Vee, then the process is considered to be in control, otherwise it is not. If there is no shift in the mean, they should fluctuate around the horizontal line. Even a moderate change in the mean, however, would result
348
8
in the Cusum chart exhibiting a slope with each new observation highlighting the slope more distinctly. A more popular visualization of Cusum analysis is the tabular Cusum chart which closely resembles the intuitive Shewart charts. The horizontal limit lines above and below the centerline are the 3-sigma control line limits (5-sigma limits are also used). Values of means of sample j are also indicated as points on a vertical scale. Vertical bars above and below the centerline or nominal target value for each sample represent the one-sided cumulative sums, namely USL (upper specification limit) and the LSL (lower specification limit) given by: þ USL : Sþ j = max 0, X j –ðμ þ kσ Þ þ Sj - 1
ð8:59Þ
LSL : Sj- = max 0, - X j þ ðμ - kσ Þ þ Sj-- 1 There are several variants of the Cusum approach along with different analysis options, and the reader can refer to Himmelblau (1978) for these along with a discussion on the practical advantages and disadvantages of this approach.
Fig. 8.35 The Cusum chart with leading V-mask to determine when the process goes out of control. (Data from Table 8.10 with no out-ofcontrol points)
Most statistical software offer several in-built control chart technique options, and the user should exercise due diligence. Example 8.7.3 Illustrating the use of Cusum plots (a) The data shown in Table 8.10 represents a process known to be in control. Use the Cusum approach to verify that indeed this data is in control. Figure 8.35 shows the Cusum plot with the Vee mask. Since no point is outside the opening, the process is deemed to be in control. (b) Corrupt two of the 20 sample data as done in Example 8.7.1 (b) (i.e., the numerical values of four items forming the sample) and illustrate how the Cusum chart behaves under this situation. The tabular version of the Cusum chart is shown in Fig. 8.36. The various lines and points on this figure have already been described. Only samples whose mean values are outside the control limits are to be flagged as deviations. Suffice it to say that again sample #8 falls outside the 3-sigma lines and is flagged as an outlier, while sample #1 has been missed because of its small degree of corruption. These results are consistent with Fig. 8.33 using Shewart control charts. Note that the effect of the sharp and abrupt increase in USL at sample #8 subgroup due to the one single outlier point is only gradual overcome. This behavior illustrates the statement made earlier that Cusum is sensitive to small shifts in the process mean but does not match the ability of a Shewhart chart to detect larger shifts.
8.7.5.2 EWMA Process Moving average control charts can provide great sensitivity in process monitoring and control since information from past samples is combined with that of the current sample. There is, however, the danger that an incipient trend which gradually appears in the past observations may submerge any
(X 0.001) 51
31 CuSum
Fig. 8.36 The tabular Cusum chart to determine when the process goes out of control. (Data is from Table 8.10 but the four observations of sample#1 and sample #8 are corrupted by an intentional increase of 1% and 3%, respectively. Only sample #8 is seen to fall outside the 3-sigma lines and flagged as an outlier)
Analysis of Time Series Data
11
-9
-29 0
4
8
12 Subgroup
16
20
8.7 Quality Control and Process Monitoring Using Control Chart Methods
small shifts in the process. The exponential weighted moving average (EWMA) process (discussed in Sect. 8.3.2) which has direct links to the AR1 model (see Sect. 8.5.3) has redeeming qualities which make it attractive as a statistical quality control tool. This monitoring approach can be applied to either the sample mean of a set of observations forming a sample or to individual observations taken from a system while in operation. An exponential weighted average with a discount factor θ such that -1 ≤ θ ≤ + 1 would be (Box and Luceno 1997): yt = ð1 - θÞ yt þ θ:yt - 1 þ θ2 :yt - 2 þ . . .
ð8:60Þ
where the constant (1 - θ) is introduced in order to normalize the sum of the series to unity since (1 + θ + θ2 + . . .) = (1 - θ)-1. Instead of using Eq. 8.55 to repeatedly recalculate a new ≈ variable yt with each fresh observation, a convenient updating formula is: ≈
≈
yt = λyt þ θ: y t - 1
ð8:61Þ
where the new variable (seemingly redundant) is introduced by convention such that λ = 1 - θ. If λ = 1, all the weight is placed in the latest observation, and one gets the Shewhart chart. Example 8.7.4 Example of EWMA process control Consider the sequence of observations from an operating system shown in Table 8.12 (Box and Luceno 1997): Note that the starting value of 10 is taken to be the target value. If λ = 0.4, then: ≈
y 1 = ð0:4 × 6Þ þ ð0:6 × 10Þ = 8:4
≈
y 2 = ð0:4 × 9Þ þ ð0:6 × 8:4Þ = 8:64
≈
y 3 = ð0:4 × 12Þ þ ð0:6 × 8:64Þ = 9:98
and so on. If the process is in perfect state of control and any deviations can be taken as a random sequence with standard deviation σ Y, it can be shown that the associated standard deviation of the EWMA process is:
Table 8.12 Data for Example 8.7.4 Observation y
10
1 6
2 9
3 12
4 11
5 5
6 6
7 4
8 10
349
σY = σY
λ 2-λ
1=2
ð8:62Þ
Thus, if one assumes λ = 0.4, then σ ≈ =σ Y = 0:5: The Y
benefits of both the traditional Shewhart charts and the EWMA charts can be combined by generating a co-plot (in the above case, the three-sigma bands for EWMA will be half the width of the three-sigma Shewhart mean bands). Both sets of metrics for each observation can be plotted on such a co-plot for easier visual tracking of the process. An excellent discussion on EWMA and its advantage in terms of process adjustment using feedback control is provided by Box and Luceno (1997).
8.7.6
Concluding Remarks
There are several other related analysis methods which have been described in the literature. To name a few (Devore and Farnum 2005): (a) Process capability analysis: This analysis provides a means to quantify the ability of a process to meet specifications or requirements. Just because a process is in control does not mean that specified quality characteristics are being met. Process capability analysis compares the distribution of process output to specifications when only common causes determine the variation. Should any special causes be present, this entire line of enquiry is invalid, and so one needs to carefully screen data for special effects before undertaking this analysis. Process capability is measured by the proportion of output that can be produced within design specifications. By collecting data, constructing frequency distributions and histograms, and computing basic descriptive statistics (such as mean and variance), the nature of the process be better understood. (b) Pareto analysis for quality assessment: Pareto analysis is a statistical procedure that seeks to discover from an analysis of defect reports or customer complaints which “vital few” causes are responsible for most of the reported problems. The old adage states that 80% of reported problems can usually be traced to 20% of the various underlying causes. By concentrating one’s efforts on rectifying the vital 20%, one can have the greatest immediate impact on product quality. It is used with attribute data based on histograms/frequency of each type of fault and reveals the most frequent defect. There are several instances when certain products and processes can be analyzed with more than one method, and there is no clear-cut choice. X and R charts are quite robust in
350
8
Analysis of Time Series Data
Table 8.13 Relative effectiveness of control chart quantities in detecting a change in a process Control chart quantity Cause of change Gross error (blunder) Shift in mean Shift in variability Slow fluctuation (trend) Rapid fluctuation (cycle)
Mean X 1 2 – 2 –
Range (R) 2 – 1 – 1
Standard deviation (s) – 3 – – 2
Cumulative sum (CS) 3 1 – 1 –
From Himmelblau (1978) 1 = most useful, 2 = next best, 3 = least useful, and – = not appropriate
that they yield good results even if the data is not normally distributed provided that the sample size is about 20 or greater, while Cusum charts are adversely affected by serial correlation in the data. Table 8.13 assembles useful practical tips as to the effectiveness of different control chart quantities under different situations causing the change.
Problems Pr. 8.1 Consider the time series data given in Table 8.1 for years 1974–1985 and Table 8.2 for 1986 which was used to illustrate various concepts throughout this chapter. How to use and compare the forecasting ability of different AMA, EWA and linear and seasonal models is illustrated in Example 8.4.2 and summarized in Table 8.3. However, the models were not very satisfactory since the residuals still has a distinct trend. Perform the same types of analyses as illustrated in the text with slightly different training and testing periods. This involves determining whether the model is more accurate when fit to the first 40 data points (years 1974–1983), whether the residuals show less pronounced patterns, and whether the forecasts of the 12 quarters for 1984–1986 have become more accurate. Document your findings in a succinct manner along with your conclusions. As part of your analysis, you should: (a) investigate smoothing methods, such as AMA (4) because of the annual cyclic behavior, and EWA (0.4) and EWA (0.7), (b) evaluate alternative OLS models for making the data series stationary (including second order) and check whether the series has constant variance, (c) evaluate different orders of ARMA models. Pr. 8.2 Consider Fig. 8.11 which was meant to illustrate four basic constant and linear trend models with additive and multiplicative seasonal components. Using the numerical scales of the coordinates shown for each of the four sub-plots, suggest appropriate functional models and estimate, albeit approximately, the numerical values of the model coefficients.
Pr. 8.3 Use the following time series models for forecasting purposes assuming ε (0,1): (a) Zt = 20 + εt + 0.45εt - 1 - 0.35εt - 2. Given the latest four observations: {17.50, 21.36, 18.24, 16.91}, compute forecasts for the next two periods (b) Zt = 15 + 0.86Zt - 1 - 0.32Zt - 2 + εt. Given the latest two values of Z {32, 30), determine the next four forecasts. Pr. 8.4 Section 8.5.3 describes the manner in which various types of ARMA series can be synthetically generated as shown in Figs. 8.23, 8.24, and 8.25 and how one could verify different recommendations on model identification. These are useful aids for acquiring insights and confidence in the use of ARMA. You are asked to synthetically generate 50 data points using the following models and then use these data sequences to re-identify the models (because of the addition of random noise, there will be some differences in model parameters identified); (a) (b) (c) (d) (e)
Zt = 5 + εt + 0.7εt - 1 with N(0, 0.5) Zt = 5 + εt + 0.7εt - 1 with N(0, 1) Zt = 20 + 0.6Zt - 1 + εt with N(0, 1) Zt = 20 + 0.8Zt - 1 - 0.2Zt - 1 + εt with N(0, 1) Zt = 20 + 0.8Zt - 1 + εt + 0.7εt - 1 with N(0, 1)
Pr. 8.5 Time series of monthly atmospheric CO2 concentrations from 2002–2006 Figure 8.37 represents global CO2 levels but at monthly levels. Clearly there is both a long-term trend and a cyclic seasonal variation. The corresponding data is shown in Table B.4 (and can be found in Appendix B).13 You will use the first 4 years of data (2002–2005) to identify different moving average smoothing techniques, trend + seasonal OLS models, as well as ARIMA models as illustrated through 13
Data for this problem are also available electronically on book website.
Problems
351
Fig. 8.37 Monthly mean global CO2 concentration for the period 2002–2007. The smoothened line is a moving average over 10 adjacent months. (Downloaded from NOAA website http://www. cmdl.noaa.gov/ccgg/trends/index. php#mlo,2006)
several examples in the text. Subsequently, evaluate these models in terms of how well they predict the monthly values of the last year (i.e., year 2006) assuming one-step ahead strategy to be applicable. Report the forecast errors. Write a short report describing your entire analysis. Pr. 8.6 Time series analysis of sunspot frequency per year from 1770–1869 Data assembled in Table B.5 (in Appendix B)14 represents the so-called Wolf number of sunspots per year (n) over many years. (a) First plot the data and visually note underlying patterns. (b) You will develop at least 2 alternative models using data from years 1770–1849. The models should include different trend and/or seasonal OLS models, as well as sub-classes of the ARIMA models (where the trends have been removed by OLS models or by differencing). Note that you will have to compute the ACF and PACF for model identification purposes. (c) Evaluate these models for one-step ahead forecasting application using the expost approach where the data for years 1850–1869 are assumed to be known with certainty (as done in Example 8.5.2). Report on the forecast errors. Pr. 8.7 Time series of yearly atmospheric CO2 concentrations and temperature differences from 1979–2005 Table B.6 (refer to Appendix B)15 assembles data of yearly carbon-dioxide (CO2) concentrations (in ppm) in the atmosphere and the temperature difference with respect to a 14 Data for this problem are also available electronically on book website. 15 Data for this problem are also available electronically on book website.
base year (in °C) from 1979 to 2005 (from Andrews and Jelley 2007 by permission of Oxford University Press). (a) Plot the data both as time series as well as scatter plots and look for underlying trends. Evaluate at least two AMA and two EWA variants in terms of insights one can gain due to smoothing. Comment of your analysis results. (b) Using data from years 1979–1999, identify a regression model for the CO2 concentration variable with time as the only regressor variable. These could be trend and/or seasonal or ARIMA type models. (c) Repeat step (b) but for the Temp. difference variable with time as the only regressor variable. (d) Using the same data from years 1979–1999, develop a model for Temp. diff where CO2 is one of the regressor variables. Compare the goodness of fit of this model with that identified in (c). (e) Evaluate the models developed in (b) and (d) under the one-step ahead strategy using data from years 2000–2005 assumed explicitly known (this is the expost conditional case). Report the forecast errors. (f) Compare the above results for the exante unconditional situation. In this case, future values of CO2 are to be predicted based on the model identified in (d) and used as an input regressor for the Temp. difference model (g) Using the final model, forecast the CO2 concentration for 2006 along with 95% CL. Pr. 8.8 Transfer function analysis of unsteady state heat transfer through a wall You will use the conduction transfer function coefficients given in Example 8.6.1 to calculate the hourly heat gains (Qcond) through the wall for a constant room temperature of 24 °C and the hourly solar-air temperatures for a day
352
8
Analysis of Time Series Data
Table 8.14 Data table for Problem 8.8 Hour ending 1 2 3 4 5 6 7 8
Solar-air temp (°C) 24.4 24.4 23.8 23.3 23.3 25.0 27.7 30.0
Hour ending 9 10 11 12 13 14 15 16
Solar-air temp (°C) 32.7 35.0 37.7 40.0 53.3 64.4 72.7 75.5
Hour ending 17 18 19 20 21 22 23 24
Solar-air temp (°C) 72.2 58.8 30.5 29.4 28.3 27.2 26.1 25.0
Data available electronically on book website
given in Table 8.14 (adapted from Reddy et al. 2016). You will assume guess values to start the calculation and repeat the diurnal calculation over as many days as needed to achieve convergence assuming the same Tsol-air values for successive days. This problem is conveniently solved on a spreadsheet. Pr. 8.9 Transfer function analysis using simulated hourly loads in a commercial building The hourly loads (total electrical, thermal cooling and thermal heating) for a large hotel in Chicago, IL have been generated for 3 days in August using a detailed building energy simulation program. The data table shown in Table B.7 (given in Appendix B) consists of outdoor dry-bulb (Tdb) and wet-bulb (Twb) temperatures in °F as well as the internal electric loads of the building Qint (these are the three regressor variables). The three response variables are the total building electric power use (kWh) and the cooling and heating thermal loads (Btu/h). (a) Plot the various variables as time series plots and note underlying patterns. (b) Use OLS to identify a trend and seasonal model using indicator variables using the Fourier series approach but with no lagged terms for “total building electric power.” (c) For the same response variable, evaluate whether the seasonal differencing approach, i.e., ∇24yt = yt - yt-24 is as good as the trend and seasonal model in detrending the data series. (d) Identify ARMAX models for all three response variables separately by using 2 days for model identification and the last day for model evaluation (e) Report all pertinent statistics and compare the results of different models. Provide reasons as to why the particular model was selected as the best one for each of the three response variables. Pr. 8.10 Example 8.7.1 illustrated the use of Shewhart charts for variables.
(a) We wish to evaluate the effect of reducing the number of samples. Repeat the analysis in this example using only the first 10 samples in Table 8.10 to determine UCL and LCL based on mean and range. Evaluate whether the next 10 samples are within control or not. (b) Repeat step (a) but using the mean-std. dev charts. Compare the efficacy of both methods. (c) Repeat the analysis but using the Cusum charts (d) Repeat the analysis using EWMA (with λ = 0.4). (e) Document key results of all the above analysis. Pr. 8.11 Example 8.7.4 illustrated the use of EWMA monitoring method. (a) Use the same data with EWMA (λ = 0.4) to generate the graphical monitoring plot for 3-sigma and determine whether any deviations exceed the control limits. (b) Repeat the analysis using the traditional Shewart chart method. Do you see any benefit when both the plots are combined together as co-plots. (c) Repeat the analysis but using the Cusum and EWMA (with λ = 0.4) and compare results. (d) How do the results for part (b) affected when EWMA (λ = 0.2) or EWMA (with λ = 0.6) are assumed. Pr. 8.12 Interrupted time series- Analyzing energy use from utility bills by traditional and by time series approaches The data table shown in Table B.8 (given in Appendix B16) assembles actual electricity and natural gas (NG) usage for a rental home with five young college tenants in a NorthEastern city of the United States. Their occupancy patterns are somewhat erratic since they sometimes go home during weekends and also during school holidays and breaks. Further, the utility bills for electricity use align well with calendar months (even then there are differences of 2–3 days in read dates) but those for NG use have a week offset which will be overlooked in this problem. However, the major effect is the onset of covid lockdowns (starting around April 2020 16
Data also available electronically on book website.
References
353
Table B.8 (excerpt). Monthly utility usage for a multi-tenant rental house in a North-Eastern city of USA (Pr. 8.12) Year 2020
2022
Month Jan Feb ... Dec
Outdoor temp (deg F) 40 41
Elec use (kWh) 507 453
NG use (Cubic feet) 129 109
38
643
125
Data available in Appendix B.8 and also electronically on book website
and lasting till May 2021 or so17) where colleges shifted to online mode of instruction and even cancelled classes. This led to either the tenants moving back to their parents’ homes or staying indoors most of the day- which of these two options prevailed and during which dates are not known. You will keep this factor explicitly in mind while analyzing this data set. (a) Plot the data and look for trends. Try to identify the calendar months corresponding to normal period and to covid lockdown period either visually or by modeling (as described in Sect. 8.4) using a trial-and-error approach since the dates of these periods are unknown. Separate the data in Table B.8. into “normal” and “covid lockdown” periods (b) Develop a change point or segmented linear model (see Sect. 5.7.2) for electricity and for NG use using mean outdoor temperature for the two periods. Determine the effect of the lockdown on monthly energy and gas use. (c) Repeat the analysis using time series methods and models. You can investigate smoothing methods in conjunction with OLS trend and seasonal models as needed. You can also evaluate ARIMA models even though the data length is short (you usually need a minimum of 50 data points or more). (d) Write a short report summarizing your analysis results and your conclusions regarding the effect of the covid lockdown, and also on whether a time series approach provides additional insights compared to traditional regression methods. Include both visual conclusions as well as quantitative results.
References Andrews, J. and N. Jelley, 2007. Energy Science: Principles, Technologies and Impacts, Oxford University Press, Oxford.
17
During 2020, the worldwide annual energy consumption fell by 4%.
Bloomfield, P., 1976. Fourier Analysis of Time Series: An Introduction. John Wiley, New York. Box, G.E.P. and G.M. Jenkins, 1976. Time Series Analysis: Forecasting and Control, Holden Day. Box, G.E.P and A. Luceno, 1997. A., Statistical Control by Monitoring and Feedback Adjustment, John Wiley and Sons, New York. Chatfield, C., 1989. The Analysis of Time Series: An Introduction, 4th Ed., Chapman and Hall, London. Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. Dhar, A., T.A. Reddy and D.E. Claridge, 1999. Generalization of the Fourier series approach to model hourly energy use in commercial buildings, Journal of Solar Energy Engineering, vol. 121, p. 54, Transactions of the American Society of Mechanical Engineers, New York. Himmelblau, D.M. 1978. Fault Detection and Diagnosis in Chemical and Petrochemical Processes, Chemical Engineering Monograph 8, Elsevier. McCain L.J. and R. McCleary, 1979. The statistical analysis of the simple interrupted time series quasi-experiment, in Quasi-experimentation: Design and analysis issues for field settings, Houghton Mifflin Company, Boston, MA. McClave, J.T. and P.G. Benson, 1988. Statistics for Business and Economics, 4th Ed., Dellen and Macmillan, London. Montgomery, D.C. and L.A. Johnson, 1976. Forecasting and Time Series Analysis, McGraw-Hill, USA Montgomery, D.C., C.L. Jennings and M. Kulahci, 2016. Introduction to Time Series Analysis and Forecasting, John Wiley and Sons, Hoboken, NJ. Pindyck, R.S. and D.L. Rubinfeld, 1981. Econometric Models and Economic Forecasts, 2nd Edition, McGraw-Hill, New York, NY. Reddy, T.A., J.K. Kreider, P.S. Curtiss and A. Rabl, 2016. Heating and Cooling of Buildings, 3rd Ed., CRC Press, Boca Raton, Fl. Ruch, D.K., J.K. Kissock, and T.A. Reddy, 1993. “Model Identification and Prediction Uncertainty of Linear Building Energy Use Models with Autocorrelated Residuals”, Solar Engineering 1993, Proceedings of the ASME-ASES-SED International Solar Enery Conf. Washington DC Walpole, R.E., R.H. Myers, S.L. Myers, and K. Ye, 2007. Probability and Statistics for Engineers and Scientists, 8th Ed., Pearson Prentice Hall, Upper Saddle River, NJ. Wei, W.W.S., 1990. Time Series Analysis: Univariate and Multivariate Methods, Addison-Wesley Publishing Company, Redwood City, CA.
9
Parametric and Non-Parametric Regression Methods
Abstract
This chapter expands on the material presented in Chap. 5 which covered univariate linear regression models (i.e., models with one regressor variable) and multivariate linear regression (MLR) models using ordinary least squares (OLS) as the parameter estimation method. Other issues related to improper model residual behavior and variable selection using stepwise regression during multivariate analysis were also covered. This chapter starts by introducing the fundamental notion of estimation of parameters of a mathematical model which includes the important concepts of both structural and numerical identifiability. Next, the dangers of collinearity among regressors during multivariate regression are addressed, and ways to minimize such effects (such as principal component analysis, ridge regression, and shrinkage methods) are discussed along with a case study example. Then, an overarching class of models is introduced, namely Generalized Linear Models (GLM), which combines in a unified framework both the strictly linear models and non-linear models which can be transformed into linear ones. The latter is achieved by link functions which can be applied to continuous and to binary/categorical random variables (such as exponential, logistic, and Poisson). The parameters are estimated using the maximum likelihood estimation (MLE) approach. GLM also allows one to address error in variable (EIV) situations, i.e., when the errors in the regressor variables are relatively large. Subsequently, how to determine whether any sort of arbitrary or non-linear correlation exists between regressors, using the entropy concept proposed by Shannon, is presented. Next, parameter estimation of intrinsically non-linear models is introduced and the widely used Levenberg-Marquardt algorithm described. This is followed by non-parametric estimation methods involving smoothing and regression splines and the use of kernel functions. The latter approach has logical extensions into classification methods treated under data
mining (Chap. 11). With the advent of powerful computing capability, robust regression methods are being increasingly used nowadays since they allow the influence of outliers on model parameter estimation to be deemphasized in ways different than OLS. The multilayer neural network perceptron modeling approach is also described. This chapter deals primarily with parameter estimation of algebraic or steady-state black-box models which may have no clear relation with behavior of a physical system or process. Chapter 10 deals with inverse methods pertinent to algebraic and dynamic functions of structural models (gray-box models or white-box simulation programs).
9.1
Introduction
The terms regression analysis and parameter estimation can be viewed as synonymous, with statisticians favoring the former term, and scientists and engineers the latter. A practical distinction is to use the former term when the primary objective is to capture the data trend by a functional model, while the term parameter estimation is one where, given the functional form, the intent is to estimate the parameters of the mathematical function most appropriately. In addition, one needs to distinguish between parameter estimation and curve fitting. The latter is characterized by two degrees of arbitrariness (Bard 1974): (a) the class of functions used is arbitrary and dictated more by convenience (such as using a linear model) than by the physical nature of the process generating the data; (b) the best fit criterion is somewhat arbitrary and statistically unsophisticated (such as making inferences about prediction accuracy and standard errors of model parameters when it is strictly invalid to do so). Thus, curve fitting implies a naïve assumption of a blackbox functional model and estimating the parameters using
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_9
355
356
OLS (by default) without paying much attention to the strict conditions (called Gauss-Markov conditions) upon which OLS is based (these are stated in Sect. 5.5.1 and in Sect. 5.10) The equations and parameters determined from curve fitting usually provide little insight/understanding into the nature of the process, and the model parameter and their standard errors estimated are usually not reliable. A curvefitted model should not be used (or used with extreme caution) for predicting the response under conditions outside of the initial range of variability of the data (i.e., for extrapolation) and for ascertaining prediction accuracy. However, this approach is often adequate and suitable as a first pass for many applications involving very complex phenomenon, and/or when the project budget allows for limited monitoring and analysis effort. Parameter estimation, on the other hand, is a more formalized approach which includes curve fitting at its simplest form. Gray box models, i.e., model structures derived from theoretical considerations also require parameter estimation (see Sect. 1.5) and ought to, moreover, include previous knowledge concerning the values of the parameters as well as the statistical nature of the measurement errors. Some professionals use the word model fitting to distinguish it from curve fitting. Thus, in the framework of gray box inverse models, one attempts to identify the structural system behavior in an aggregated or reduced order manner so that parameters in the model can be interpreted in physical terms. In addition, parameters can be associated with fluid or mechanical properties. Some of the several practical advantages of using gray box models coupled with a proper parameter estimation technique are that they (often) provide better understanding of and insight into physical processes and more realistic modeling of system behavior, thereby allowing, finer control of the process, better prediction of system performance, and more robust online diagnostics and fault detection than provided by black-box models. The statistical process of determining best parameter values in the face of unavoidable measurement errors and imperfect experimental design is treated in Chap. 10. The concept of estimation (i.e., whether the model parameters can be determined at all in the first place once a functional model is chosen to describe a data set) applies to linear models over their entire spatial region (often referred to as “global level” and is usually simpler to address. For non-linear models, checking or verifying for estimation of model parameters must be done in the local sub-space and requires more sophistication. A basic understanding of such fundamental issues is of some importance and is discussed below. Chapter 5 covered univariate linear regression models (i.e., models with one regressor variable) and multivariate linear regression (MLR) models using ordinary least squares
9
Parametric and Non-Parametric Regression Methods
(OLS) as the parameter estimation method. The techniques discussed were mostly traditional ones in that they applied to regression problems with relatively few regressor variables (say, no more than 4–5). In the last few decades, statistical analysis in general has had to grapple with much higher dimension data with varying degrees of collinearity (even cases where the number of possible regressors is larger than the number of data/observations). Stepwise regression (discussed in Sect. 5.4.6) is one possible remedy. The effect of correlated regressors on the regression model and its parameter estimates can be reduced by techniques which are referred to as dimension reduction and shrinkage methods. This chapter treats principal component analysis and ridge regression in Sects. 9.3.2 and 9.3.3, respectively. The term “least squares” regression includes several sub-variants. OLS is a special and important sub-class strictly applicable only when several conditions (called GaussMarkov conditions- Sect. 5.10) regarding the model residuals or errors are met and when the model is strictly linear (stated in Sect. 5.5.1 and in Sect. 9.4.1). In summary, the model errors should be such that their expected value is zero, have constant variance, do not exhibit collinearity, and are uncorrelated with the regressors variables. Stepwise regression, discussed in Sect. 5.4.6, is one method of model identification when regressors are correlated. Further, OLS is applicable when the measurement errors in the regressors are small compared to that of the response and when the response variable and the model residuals are normally distributed. These conditions are often not met in practice, thereby requiring corrective measures. A unified treatment of all these instances as well as regressing discrete and binary data can be handled under the general umbrella of Generalized Linear Models (GLM) which is covered in Sect. 9.4.3. Non-linear models can be separated into those which can be converted into models linear in the parameters by suitable variable transformation, and those which are inherently non-linear, and no such transformation exists. Parameter estimation in the former case can be generally handled by the GLM approach. The closed form solutions for identifying the “best” OLS parameters are directly derived by minimizing an objective or loss function framed as the sum of square errors subject to certain inherent conditions (see Eq. 5.3). Such analytical solutions are no longer possible for non-linear models. Inherently non-linear parameter estimation requires search methods like those discussed under optimization (Sect. 7.4). A basic treatment of parameter estimation of intrinsically non-linear models is also covered in Sect. 9.5.2. Finally, a compendium of different non-parametric methods such as local smoothing using local regression, multilayer artificial neural networks and robust regression are also treated.
9.2 Important Concepts in Parameter Estimation
9.2
Important Concepts in Parameter Estimation
Being able to estimate parameters in mathematical models is directly related to the concept of ill-conditioning of the parameter vector β (see Eq. 5.29). It consists of two separate issues: structural identifiability and numerical identifiability, both of which are discussed below. The former is related to the mathematical model without consideration of any real-world noise in the observations. For example, does the assumed functional form to model the data allow the model parameters to be identified at all? Further, for many non-linear models, structural identifiability is an issue in that very similar curves may be obtained with different sets of parameter values. A basic understanding of this concept is provided in Sect. 9.2.1, while case study examples involving building energy flows and chiller modeling in Sects. 10.2.5, 10.4.2, 10.4.3, and 10.6.8 illustrate its relevance in practice. Numerical identifiability, on the other hand, considers the effect of measurement noise on the ability to identify the model parameters with some degree of confidence. Its relevance to parameter estimation is illustrated in several example in this book, for example, in calibrated simulation in Sect. 10.6.8. These issues closely parallel the mathematical concepts of ill-conditioning of functions and uniqueness of equations which are covered in numerical analysis textbooks (e.g., Chapra and Canale 1988). To summarize, “structural identifiability,” also called “deterministic identifiability,” is concerned with arriving at performance models of the system under perfect or noise free observations. If it is found that parameters of the assumed model structure cannot be uniquely identified, either the model must be reformulated or else additional measurements made. The problem of “numerical identifiability” is related to the quality of the input-output data gathered, i.e., to noise in the data and to ill-conditioning of the correlation coefficient matrix
9.2.1
Structural Identifiability
Structural identifiability is defined as the problem of investigating the conditions under which system parameters can be uniquely estimated from experimental data, no matter how noise-free the measurements. This condition can be detected before the experiment is conducted by analyzing the basic modeling equations. Two commonly used testing techniques are the sensitivity coefficient approach (see Sect. 6.7.4), and one involving looking at the behavior of the poles of the Laplace Transforms of the basic modeling equations (Sinha and Kuszta 1983). Only the sensitivity coefficient approach to detect identifiability is introduced below. Some simple, almost trivial, examples of structural identifiability of models, where y is the system response and x is the input variable are given below.
357
(a) A model such as y = (a + cb)x where c is a system constant will not permit unique identification of parameters a and b when measurements of y and x are made. At best, the overall term (a + cb) can be identified. (b) A model given by y = (ab)x will not permit explicit identification of a and b, merely the product (a b). (c) The model y = b1/(b2 + b3t) where t is the regressor variable will not allow all three parameters to be identified, only the ratios (b2/b1) and (b3/b1). This case is treated in Example 9.2.2 below. (d) The lumped parameter differential equation model for the temperature drop with time T(t) of a cooling sphere without any heat input (easily derived from Eq. 1.5) is given by: Mcp
dT = hAðT - T 1 Þ dt
ð9:1Þ
where M is the mass of the sphere, cp its specific heat, h the heat loss coefficient from the sphere to the ambient, A the surface area of the sphere, and T1 the ambient temperature assumed constant. If measurements over time of T are made, one can only identify the group (Mcp/hA). Note that the reciprocal of this quantity is the time constant of the system (see Sect. 1.4.5). The geometric interpretation of ill-conditioning can be related to structural identifiability. As stated by Godfrey (1983), one can distinguish between three possible outcomes: (i) The model parameters can be estimated uniquely, and the model is globally identifiable, (ii) A finite number of alternative estimates of model parameters is possible, and the model is locally identifiable, or (iii) An infinite number of model parameter estimates are possible, and the model is unidentifiable from the data (this is the over-parameterized case). Example 9.2.11 Consider the following first-order differential equation to illustrate the concept of a locally identifiable problem: xy0 = 2y
whose solution is yðxÞ = Cx2
ð9:2Þ
where C is a constant to be determined from the initial value. If y(- 1) = 1, then C = 1. Thus, one has a unique solution y(x) = x2 on some open interval to the left of x = - 1 which also passes through the origin (see Fig. 9.1). But to the right of the origin, one may choose any value for C in Eq. 9.2. From Edwards and Penney (1996) by # permission of Pearson Education.
1
358
9
Fig. 9.1 Plot of Example 9.2.1 used to illustrate local versus global identifiability
Three different solutions are shown in Fig. 9.1. Hence, though one has uniqueness of the solution near some sub-space, the solution may branch elsewhere, and the uniqueness property may be lost at the global level. ■ The problem of identifiability is almost a non-issue for simple models; for example, models such as (a) or (b) above, and even for Example 9.2.1. However, more complex models demand a formal method rather than depend on ad hoc manipulation and inspection. The sensitivity coefficient approach allows formal testing to be performed (see Section 6.7.4 for a discussion on sensitivity coefficients). Consider a model y(t, b) where t is an independent variable and b is the parameter vector. The first derivative of y with respect to bj is its sensitivity coefficient and is designated by (∂y/∂bj). Sensitivity coefficients indicate the magnitude of change in the response y due to perturbations in the values of the parameters. Let i be the number of observations representing the range under which the experiment was performed. The condition for structural identifiability is that the sensitivity coefficients over the range of the observations should not be linearly dependent. Linear dependence is said to occur when, for p parameters in the model, the following relation is true for all i observations even if all xj values are not zero (a formal proof is given by Beck and Arnold 1977): x1
∂ yi ∂ yi ∂ yi þ x2 þ . . . þ xp =0 ∂ b1 ∂ b2 ∂ bp
ð9:3Þ
Example 9.2.2 The condition given by Eq. 9.3 is applied to model (c) above, namely y = b1/(b2 + b3t). Though mere inspection indicates that all three parameters b1, b2 and b3 cannot be individually
Parametric and Non-Parametric Regression Methods
Fig. 9.2 Verifying linear dependence of model parameters of Eq. 9.6. (From Beck and Arnold 1977 by permission of Beck)
identified, the question is “can the ratios (b2/b1) and (b3/b1) be determined under all conditions?” In this case, the sensitivity coefficients are: ∂ yi 1 = , ∂ b 1 b2 þ b3 t i
∂ yi - b1 = and ∂ b 2 ð b2 þ b3 t i Þ 2
∂ yi - b1 t1 = ∂ b3 ðb2 þ b3 ti Þ2
ð9:4Þ
It is not clear whether there is linear dependence or not. One can verify this by assuming x1 = b1, x2 = b2 and x3 = b3. Then, the model can be expressed as b1
∂yi ∂y ∂y þ b2 i þ b3 i = 0 ∂b1 ∂b2 ∂b3
ð9:5Þ
or z = z1 þ z2 þ z3 = 0
ð9:6Þ
∂yi ∂yi ∂yi where z1 = ∂b , z2 = bb21 ∂b , z3 = bb31 ∂b 2 1 3 The above function can occur in various cases with linear dependence. Arbitrarily assuming b2 = 1, the variation in the sensitivity coefficients or more accurately those of z1, z2, and z3 are plotted against (b3 t) in Fig. 9.2. One notes that z = 0 throughout the entire range denoting, therefore, that identification of all three parameters is impossible. Further, inspection of Fig. 9.2 reveals that both z2 and z3 seem to have become constant, therefore becoming linearly dependent, for b3t > 3. This means that not only is it impossible to estimate all three parameters simultaneously from measurements of y and t, but that it is also impossible to estimate both b1 and b3 using data over the spatial range b3t > 3. ■
9.2 Important Concepts in Parameter Estimation
9.2.2
359
Ill-Conditioning
The condition of a mathematical equation relates to its sensitivity to changes in the data. A computation is numerically unstable with respect to round-off and truncation errors if these uncertainties are grossly magnified by the numerical method. Consider the first-order Taylor series equation: f ðxÞ = f ðx0 Þ þ f 0 ðx0 Þðx - x0 Þ
ð9:7Þ
The relative error of f(x) can be defined as: ε½f ðxÞ =
f ðxÞ - f ðx0 Þ f 0 ðx0 Þðx - x0 Þ ≈ f ð x0 Þ f ðx0 Þ
ð9:8Þ
x - x0 x0
Consider a set of linear equations represented by Ax = b
The relative error of x is given by εð xÞ =
(a) The system has exactly one solution—where both lines intersect (see Fig. 9.3a) (b) The system has no solution—the lines are parallel as shown in Fig. 9.3b. This will arise when the slopes of the two lines are equal, but the intercept is not, i.e., when: aa12 = bb12 ≠ cc12 (c) The system has an infinite number of solutions since they coincide as shown in Fig. 9.3c. This will arise when both the slopes and the intercepts are equal, i.e., when a1 b1 c1 a2 = b2 = c2
ð9:9Þ
whose solution is
ε½f ðxÞ x0 f 0 ðx0 Þ = f ð x0 Þ εð xÞ
ð9:10Þ
Thus, the condition number provides a measure of the extent to which an uncertainty in x is magnified by f(x). A value of 1 indicates that the function’s relative error is identical to the relative error in x. Functions with very large values are said to be ill-conditioned. Example 9.2.3 Calculate the condition number of the function f(x) = tan x at x = (π/2). x0 ð1= cos 2 x0 Þ From Eq. 9.10, The condition number C d = tan x0 The condition numbers at the following values of x are: 1:7279ð40:86Þ = - 11:2 - 6:314 1:5865ð4053Þ Cd = - 63:66 = - 101
(a) at x0 = π=2 þ 0:1ðπ=2Þ, Cd = (b) at x0 = π=2 þ 0:01ðπ=2Þ,
For case (b), the major source of ill-conditioning appears in the derivative which is due to the singularity of the function close to (π/2). ■ This concept of ill-conditioning can be extended to sets of equations (Lipschutz 1966). The simplest case is a system of two linear equations in two unknowns in, say x and y: a1 x þ b1 y = c1 a2 x þ b2 y = c2
ð9:11Þ
Three cases can arise which are best described geometrically.
ð9:12Þ
Whether this set of linear equations can be solved or not is easily verified by computing the rank of the matrix A which is the integer representing the order of the highest non-vanishing integer. Consider the following matrix:
The condition number is defined as the ratio of these relative errors: Cd
x = A - 1b
1 2
2 -1
-2 3
3 -2
-1 3
3 6
1 -6
-4 9
Since the first and last rows are identical except for a constant multiplier, or more correctly are “linearly dependent,” the rank is equal to 3. Computer programs would identify such deficiencies correctly and return an error message such as “matrix is not positive definite” indicating that the estimated data matrix is singular. Hence, one cannot solve for all four unknowns but only for three. However, such a test breaks down when dealing with real data which includes measurement as well as computation errors (during the computation of the inverse of the matrix). This is illustrated by using the same matrix, but the last row has been corrupted by a small noise term - of the order of 5% only. 1
2
-2
3
2 -1 63 20
-1 3 117 20
3 1 - 123 20
-2 -4 171 20
Most computer programs, if faced with this problem, would determine the rank of this matrix as 4. Hence, even small noise in the data can lead to misleading conclusions. The notion of condition number, introduced earlier, can be extended to include such situations. Recall that any set of linear equations of order n has n roots, either distinct or repeated, which are usually referred to as characteristic roots or eigenvalues. The stability or the robustness of the solution set, i.e., its closeness to singularity, can be characterized by the same concept of condition number (Cd) of the matrix A computed as:
360
9
Parametric and Non-Parametric Regression Methods
Fig. 9.3 Geometric representation of a system of two linear equations (a) the system has exactly one solution, (b) the system has no solution, c the system has an infinite number of equations
Cd =
largest eigenvalue smallest eigenvalue
1=2
ð9:13Þ
The value of the condition number for the above matrix is Cd = 371.7. Hence, a small perturbation in b induces a relative perturbation 370 times greater in the solution of the system of equations. Thus, a fundamentally singular matrix will be signaled as an ill-conditioned (or badly conditioned) matrix due to roundoff and measurement errors, and the analyst has to select (somewhat arbitrary) thresholds to determine whether the matrix is singular or merely ill-conditioned.
9.2.3
Numerical Identifiability
The concept of ill-conditioning impacts parameter estimability in yet another manner when the regressors are correlated. Even when the disturbing noise or experimental error in the system is low, OLS may not be adequate to identify the parameters with low variance because of multicollinearity effects between regressor variables. As the noise becomes more significant, the standard errors increase and so elaborate numerical schemes such as iterative methods or
multi-step methods have to be used Recall from Sect. 5.4.3 that the parameter estimator vector is given by b = XT X
-1
XT Y = C XT X
while varðbÞ = σ 2 XT X
-1
= σ2 C
ðEq:5:36Þ ðEq:5:37Þ
where the matrix C is called the variance-covariance matrix. Numerical non-identifiability, also called “redundancy,” is defined as the inability to obtain proper parameter estimates from the data even if the experiment is structurally identifiable. This can arise when the matrix C becomes close to singular, and results in the reciprocal becoming undetermined or assumes a very large numerical value. Such a condition associated with inadequate richness in the data rather than with model mis-specification is referred to as ill-conditioned data (or, more loosely, as weak data). If OLS estimation was used with ill-conditioned data, the parameters, though unbiased, are not efficient (unstable with large variance). More importantly, OLS formulae would understate both the standard errors and the model’s prediction uncertainty bands (even though the overall fit may be satisfactory). One option is not to rely on such closed formed formulae but to adopt a
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
resampling technique (Sect. 5.8). Other recourses are either to take additional pertinent measurements and/or to simplify or aggregate the model structure to remove some of the collinear variables from the model. How to deal with such situations is discussed in Sect. 9.3. There are three commonly used diagnostic measures to evaluate the magnitude of ill-conditioning, all three depend on the matrix C (Belsley et al. 1980). (i) The correlation matrix R is the normalized version of matrix C where the X data are centered and scaled to have unit length. R = XT X
-1
ð9:14Þ
allows one to investigate correlation between pairs of regressors in a quantitative manner but may be of limited use in assessing the magnitude of overall multicollinearity of the regressor set. (ii) Variance inflation factors provide a better quantitative measure of the overall collinearity in the regressor set. The diagonal elements of the R matrix are: Rjj =
1 1 - R2j
j = 1, 2, . . . k
ð9:15Þ
where R2j is the coefficient of “multiple determination” resulting from regressing xj on the other (k - 1) variables. Clearly, the stronger the linear dependency of xj on the remaining regressors, the larger the value of R2j. The variance of bj is said to be “inflated” by the quantity (1 - R2j). The variance inflation factors are given by VIF (bj) = Rjj. The VIF allows one to look at the joint relationship among a specified regressor and all other regressors. Its weakness, like that of the coefficient of determination R2, is its inability to distinguish among several coexisting near-dependencies and to assign meaningful thresholds between high and low VIF values (Belsley et al. 1980). Many texts suggest rules of thumb: any VIF > 10 or sum of VIF >10 suggest strong ill-conditioning, while 5 < VIF < 10 indicates a moderate problem. (iii) Condition number, given by Eq. 9.13, is also widely used by analysts to determine robustness of the parameter estimates, i.e., how low are their standard errors, since it provides a measure of the joint relationship among regressors. Some heuristic suggestions have been proposed. Evidence of collinearity is suggested for condition numbers > 15, and corrective action is warranted when the value exceeds 30 or so (Chatterjee and Price 1991).
361
To summarize, ill-conditioning of a matrix X is said to occur when one or more columns can be expressed as linear combinations of another column, i.e., det (XTX) = 0 or close to 0. Possible causes are either the data set is inadequate, or the model is over specified, i.e., too many parameters have been included in the model. Possible remedies one should investigate are: (i) collecting more data, or (ii) dropping variables from the model based on physical insights.
9.3
Dealing with Collinear Regressors: Variable Selection and Shrinkage
9.3.1
Problematic Issues
Another major problem with multivariate regression is that the “independent” variables are often not really independent but collinear to some extent (and hence, the suggestion that the term “regressors” be used instead of “independent variables”). The robustness of the model parameters is adversely affected as discussed in the above section. Strong collinearity has the result that the variables are “essentially” influencing or explaining the same system behavior. For linear models, the Pearson correlation coefficient (presented in Sect. 3.4.2) provides the necessary indication of the strength of this overlap. Non-linear collinearity is discussed in Sect. 9.5. This issue of collinearity between regressors is a very common phenomenon which has important implications during model building and parameter estimation. Due to the large standard errors associated with the parameter estimates, OLS regression coefficients can even have the wrong sign. Note that this could also happen if the range of variation in the regressor variables is too small, or if some important regressor variable has been left out. Example 9.3.1 Consider the simple example of a linear model with two regressors both of which are positively correlated with the response variable y. The data consists of six samples as shown in Table 9.1. The pairwise plots of Fig. 9.4 clearly depict the fairly strong relationship between the response y and variable x1 and that between two regressors. Table 9.1 Data table for Example 9.3.1 y 2 2 3 3 5 6
x1 1 2 2 5 4 5
x2 2 3 1 5 6 4
362
9
Parametric and Non-Parametric Regression Methods
Fig. 9.4 Data for Example 9.3.1 to illustrate how multicollinearity in the regressors could result in model coefficients with wrong signs
Table 9.2 Correlation matrix for Example 9.3.1 x1 x2 y
x1 1.000
x2 0.776 1.000
y 0.742 0.553 1.000
From the correlation matrix C for this data (Table 9.2), the correlation coefficient between the two regressors is 0.776, which can be considered to be of moderate strength. An OLS regression results in the following model:
suspect. The regression coefficients and the model predictions tend to have large standard errors and uncertainty bands which make the model unstable. It is imperative that sample cross-validation be performed to identify a suitable model (see Example 5.8.1 and the case study presented in Sect. 9.3.4). (ii) The regression coefficients in the model are no longer proper indicators of the relative physical importance of the regressor parameters.
9.3.2 y = 1:30 þ 0:75x1 - 0:05x2
ð9:16Þ
The model identified suggests a negative correlation between y and x2 which is contrary to both the correlation coefficient matrix and the graphical trend in Fig. 9.4. This irrationality is the result of the high inter-correlation between the regressor variables. What has occurred is that the inverse of the variance-covariance matrix (X′X) of the estimated regression coefficients has become ill-conditioned and unstable. A simple explanation is that x1 has usurped more than its appropriate share of explicative power of y at the detriment of x2 which, then, had to correct itself to such a degree that it ended up assuming a negative correlation. ■ Mullet (1976), discussing why regression coefficients in the physical sciences often have wrong signs, quotes: (i) Marquardt who postulated that multicollinearity is likely to be a problem only when correlation coefficients among regressor variables is higher than 0.95, and (ii) Snee who used 0.9 as the cutoff point. On the other hand, Draper and Smith (1981) state that multicollinearity is likely to be a problem if the simple correlation between two variables is larger than the correlation of any variable with the dependent variable. In summary, significant collinearity between regressor variables is likely to lead to two different problems: (i) Though the model may provide a good fit to the current data, its usefulness as a reliable predictive model is
Principal Component Analysis and Regression
Principal component analysis (PCA) is one of the best-known multivariate methods for removing the adverse effects of collinearity, while reducing data dimension and summarizing the main aspects of the variation in the regressor set (see, for example, Draper and Smith 1981 or Chatterjee and Price 1991). It has a simple intuitive appeal, and though very useful in certain disciplines (such as the social sciences), its use has been rather limited in engineering applications. It is not a statistical method leading to a decision on a hypothesis, but a general method of identifying which regressors are collinear and reducing the data dimension. This reduction in dimensionality is sometimes useful for gaining insights into the behavior of the data set. It also allows for more robust model building, an aspect which is discussed below. Note that PCA is an unsupervised technique, i.e., an analysis method which does not make use of the response or target variable for dimension reduction or for clustering2. It is entirely based on the regressor data set. However, the PCA transformed data set can subsequently be used in a regression context as will be illustrated later on. The premise in PCA is that the variance in the collinear multi-dimension data comprising of the regressor variable vector X can be reframed in terms of a set of orthogonal 2
On the other hand, regression analysis is a supervised learning problem since it makes use of the outcome or response variable which is to be modeled and then predicted.
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
363
Fig. 9.5 Geometric interpretation of what a PCA analysis does in terms of variable transform for the case of two variables. (a) Original data set, (b) Rotated data set. The rotation has resulted in the primary axis explaining a major part of the variability in the original data with the
rest explained by the second rotated axis. Reduction in dimensionality can be achieved by accepting a little loss in the variability contained in the original data set and dropping the second rotated variable altogether
(or uncorrelated) transformed variable vector U. This vector will then provide a means of retaining only a subset of variables which explain most of the variability in the data. Thus, the dimension of the data will be reduced without losing much of the information (reflected by the variability in the data) contained in the original data set, thereby allowing a more robust model to be subsequently identified. A simple geometric explanation of the procedure allows better conceptual understanding of the method. Consider the two-dimension data shown in Fig. 9.5. One notices that much of the variability in the data occurs along one dimension/direction/variable. If one were to rotate the orthogonal axis such that the major u1 axis was to lie in the direction of greatest data variability (see Fig. 9.5b), most of this variability will become unidirectional with little variability being left for the orthogonal u2 axis to account for. The variability in the two-dimensional original data set is, thus, largely accounted for by only one variable, i.e., the transformed variable u1. The real power of this method becomes apparent when one has a large number of dimensions; in such cases one needs to have some mathematical means of ascertaining the degree of variation in the multi-variate data along different dimensions. This is achieved by looking at the eigenvalues. The eigenvalue can be viewed as one which is indicative of the length of the axis, while the eigenvector specifies the direction of rotation. Usually, PCA analysis is done with standardized variables Z instead of the original variables X such that variables Z have zero mean and unit variance. Recall that the eigenvalues λ (also called characteristic roots or latent roots) and the eigenvector A of a matrix Z are defined by:
j Z0 Z - λI j = 0
AZ = λZ
ð9:17Þ
The eigenvalues are the solutions of the determinant of the covariance matrix of Z:
ð9:18Þ
where I is a diagonal unity matrix (i.e., its diagonal elements are 1 and off-diagonal elements are 0). The distinction between the covariance and the correlation coefficient between two variables has been discussed in Sect. 3.4.2 (and given by Eqs. 3.11 and 3.12). The correlation matrix R has been introduced earlier (Eq. 9.14) as a symmetric matrix of the correlation coefficients between the numerous individual pairs of regressor variables. Because the original data or regressor set X is standardized, an important property of the eigenvalues is that their sum is equal to the trace of the correlation matrix R: λ1 þ λ 2 þ . . . þ λ p = p
ð9:19Þ
where p is the dimension or number of variables. This follows from the fact that the diagonal elements for a correlation matrix should sum to unity. Usually, the eigenvalues are ranked such that the first has the largest numerical value, the second the second largest, and so on. The corresponding eigenvector represents the coefficients of the principal components (PC). Thus, the linearized transformation for the PC from the original vector of standardized variables Z can be represented by: PC1 : u1 = a11 z1 þ a12 z2 þ ⋯ þ a1p zp subject to a211 þ a212 þ ⋯ þ a21p = 1 PC2 : u2 = a21 z1 þ a22 z2 þ ⋯ þ a2p zp subject to
a221 þ
a222 þ ⋯ þ a22p = 1 . . . to PCp ð9:20Þ
where aii are called the component weights. Thus, the correlation matrix for the standardized and rotated variables is now transformed into:
364
9
R=
λ1
0
0
0 0
⋱ 0
0 λp
Parametric and Non-Parametric Regression Methods
Table 9.3 How to interpret PCA results and determine thresholds
where λ1 > λ2 > . . . > λp ð9:21Þ
Note that the off-diagonal terms are zero because the variable vector U is orthogonal. Further, note that the eigenvalues represent the variability of the data along the principal components. If one keeps all the PCs, nothing is really gained in terms of reduction in dimensionality, even though they are orthogonal (i.e., uncorrelated), and the model building by regression will be more robust. Model reduction is done by rejecting those transformed variables U which exhibit little variance (and hence contribute little to the model). Since the eigenvalues are ranked, PC1 explains the most variability in the original data while each succeeding eigenvalue accounts for increasingly less. A typical rule of thumb to determine the cutoff is to drop any factor which explains less than (1/p) of the variability, where p is the number of parameters or the original dimension of the regressor data set. PCA has been presented as an approach which allows the dimensionality of the multivariate data to be reduced while yielding uncorrelated regressors. Unfortunately, in most cases, the physical interpretation of the X variables, which often represent physical quantities, is lost as a result of the rotation. A few textbooks (Manly 2005) provide examples where the new rotated variables retain some measure of physical interpretation, but these are the exception rather than the rule in the physical sciences. In any case, the reduced set of transformed variables can now be used to identify multivariate models that are, often but not always, more robust. Example 9.3.2 Consider Table 9.3 where PC rotation has already been performed. The original data set contained nine variables which were first standardized, and a PCA resulted in the variance values as shown in the table. One notices that PC1 explains 41% of the variation, PC2 23%, and so on till all nine PC explain 100% i.e., all the variation present in the original data. Had the nine PC been independent or orthogonal, each one would have explained on an average (1/ p) = (1/9) = 11% of the variance. The eigenvalues listed in the table correspond to the number of variables which would have explained an equivalent amount of variation in the data that is attributed to the corresponding PC. For example, the first eigenvalue is determined as: 41/(100/9) = 3.69, i.e., PC1 has the explicative power of 3.69 of the original variables, and so on. The above manner of studying the relative influence of each PC allows heuristic thresholds to be defined. The typical
Extracted factors PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
% of total variance accounted for
Eigenvalues
Incremental 41% 23 14 7 5 4 3 2 1
Incremental 3.69 2.07 1.26 0.63 0.45 0.36 0.27 0.18 0.09
Cumulative 41% 64 78 85 90 94 97 99 100
Cumulative 3.69 5.76 7.02 7.65 8.10 8.46 8.73 8.91 9.00
From Kachigan 1991 by permission of Kachigan
rule of thumb as stated above would result in all PC whose eigenvalues are less than 1.0 being omitted from the reduced multivariate data set. This choice can be defended on the grounds that an eigenvalue of 1.0 would imply that the PC explains less than would the original untransformed variable, and so retaining it would be illogical since it would be defeating the basic purpose, i.e., trying to achieve a reduction in the dimensionality of the data. However, this is to be taken as a heuristic criterion and not as a hard and fast rule. A convenient visual indication of how higher factors contribute increasingly less to the variance in the multivariate data can be obtained from a scree plot generated by most PCA analysis software. This is simply a plot of the eigenvalues versus the PC (i.e., the first and fourth columns of Table 9.3), and provides a convenient visual representation as illustrated in the example below. ■ Example 9.3.3 Reduction in dimensionality using PCA for actual chiller data Consider data assembled in Table 9.4 which consists of a data set of 15 possible variables or characteristic features (CFs) under 27 different operating conditions of a centrifugal chiller. With the intention of reducing the dimensionality of the data set, a PCA is performed to determine an optimum set of principal components. Pertinent steps will be shown and the final selection using the eigenvalues and the scree plot will be justified. Also, the component weights table, which assembles the PC models, will be discussed. A principal component analysis is performed with the purpose of obtaining a small number of linear combinations of the 15 variables which account for most of the variability in the data. From the eigenvalue table (Table 9.5) as well as the scree plot shown (Fig. 9.6), one notes that there are three components with eigenvalues greater than or equal to 1.0, and that together they account for 95.9% of the variability in
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
365
Table 9.4 Data table with 15 regressor variables for Example 9.3.3a CF1 3.765 3.405 2.425 4.512 3.947 2.434 4.748 4.513 3.503 3.593 3.252 2.463 4.274 3.678 2.517 4.684 4.641 3.038 3.763 3.342 2.526 4.411 4.029 2.815 4.785 4.443 3.151
CF2 5.529 3.489 1.809 6.240 3.530 1.511 5.087 3.462 2.153 5.033 3.466 1.956 6.108 3.330 1.644 5.823 4.002 1.828 5.126 3.344 1.940 6.244 3.717 1.886 5.528 3.882 2.010
CF3 5.254 3.339 1.832 5.952 3.338 1.558 4.733 3.197 2.053 4.844 3.367 2.004 5.818 3.228 1.714 5.522 3.714 1.796 4.924 3.318 2.053 5.938 3.559 1.964 5.203 3.679 2.054
CF4 3.244 3.344 3.500 2.844 3.322 3.633 3.156 3.444 3.789 2.122 2.122 2.233 2.056 2.089 2.256 2.122 2.456 2.689 1.400 1.567 1.378 1.522 1.178 1.378 1.611 1.933 1.656
CF5 15.078 19.233 31.333 12.378 18.756 35.533 14.478 19.511 28.522 13.900 17.944 27.678 11.944 17.622 30.967 12.089 16.144 29.767 12.744 16.933 25.944 11.689 14.933 26.333 11.756 15.578 25.367
CF6 4.911 3.778 2.611 5.800 3.567 1.967 4.589 3.356 2.244 4.878 3.700 2.578 5.422 3.389 2.133 4.989 3.589 1.989 4.656 3.456 2.600 5.411 3.844 2.122 5.100 3.556 2.333
CF7 2.319 1.822 1.009 3.376 1.914 0.873 2.752 1.892 1.272 2.706 1.720 1.102 3.323 1.907 1.039 3.140 2.188 1.061 2.687 1.926 1.108 3.383 2.128 0.946 3.052 2.121 1.224
CF8 5.473 4.550 3.870 5.131 4.598 3.821 5.060 4.716 4.389 3.796 3.111 2.540 4.072 3.066 2.417 4.038 3.829 3.001 2.541 2.324 1.519 3.193 1.917 1.394 2.948 3.038 1.859
CF9 83.069 73.843 73.652 71.025 71.096 72.116 70.186 69.695 68.169 72.395 76.558 73.381 71.002 70.252 69.184 71.271 70.354 70.279 73.612 70.932 74.649 70.782 69.488 79.851 69.998 70.939 69.686
CF10 39.781 32.534 21.867 45.335 32.443 18.966 39.616 31.321 22.781 48.016 42.664 32.383 52.262 41.724 29.607 50.870 40.714 26.984 62.921 50.601 46.588 61.595 61.187 48.676 59.904 46.667 41.716
CF11 0.707 0.603 0.422 0.750 0.568 0.335 0.665 0.498 0.347 0.722 0.626 0.472 0.751 0.593 0.383 0.732 0.591 0.347 0.733 0.631 0.476 0.749 0.628 0.420 0.717 0.612 0.409
CF12 0.692 0.585 0.397 0.735 0.550 0.311 0.649 0.481 0.326 0.706 0.607 0.446 0.739 0.573 0.361 0.721 0.574 0.327 0.721 0.611 0.453 0.740 0.609 0.398 0.704 0.597 0.390
CF13 1.090 0.720 0.392 1.260 0.779 0.362 1.107 0.844 0.569 0.990 0.707 0.415 1.235 0.722 0.385 1.226 0.928 0.470 1.030 0.698 0.421 1.282 0.817 0.448 1.190 0.886 0.500
CF14 5.332 4.977 3.835 6.435 5.846 3.984 6.883 6.691 5.409 5.211 4.787 3.910 6.098 5.554 4.126 6.673 6.728 4.895 5.426 5.073 4.122 6.252 6.035 4.618 6.888 6.553 5.121
CF15 0.706 0.684 0.632 0.701 0.675 0.611 0.690 0.674 0.647 0.689 0.679 0.630 0.701 0.662 0.610 0.702 0.690 0.621 0.694 0.659 0.613 0.705 0.667 0.609 0.694 0.678 0.615
From Reddy 2007 from measured data supplied by James Braun a Data available electronically on book website
Table 9.5 Eigenvalue table for Example 9.3.3
Component number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Eigenvalue 10.6249 2.41721 1.34933 0.385238 0.150406 0.0314106 0.0228662 0.00970486 0.00580352 0.00195306 0.000963942 0.000139549 0.0000495725 0.0000261991 0.0000201062
Percent of cumulative Variance 70.833 16.115 8.996 2.568 1.003 0.209 0.152 0.065 0.039 0.013 0.006 0.001 0.000 0.000 0.000
Percentage 70.833 86.947 95.943 98.511 99.514 99.723 99.876 99.940 99.979 99.992 99.998 99.999 100.000 100.000 100.000
366
9
the original data. Hence, it is safe to only retain three components. The equations of the principal components can be deduced from the table of components weights shown (Table 9.6). For example, the first principal component can be written as shown by Eq. (9.22). PC1 = 0:268037 CF1 þ 0:215784 CF10 þ 0:294009
CF11 þ 0:29512 CF12 þ 0:302855 CF13 þ 0:247658
CF14 þ 0:29098 CF15 þ 0:302292 CF2 þ 0:301159 CF3 - 0:06738 CF4 - 0:297709 CF5 þ 0:297996
CF6 þ 0:301394 CF7 þ 0:123134 CF8 - 0:0168 CF9 ð9:22Þ
Note that the values of the variables in the equation are standardized by subtracting their means and dividing by their standard deviations. ■ Thus, in summary, PCA takes a group of n original regressor variables and re-expresses them as another set of n transformed variables, each of which represents a linear combination of the original variables. These transformed
Fig. 9.6 Scree plot of Table 9.5 data Table 9.6 Component weights for Example 9.3.3
CF1 CF10 CF11 CF12 CF13 CF14 CF15 CF2 CF3 CF4 CF5 CF6 CF7 CF8 CF9
Parametric and Non-Parametric Regression Methods
variables, which retain all the information found in the original regressor variables, are known as principal components (PC) and have several useful properties: (i) they are uncorrelated with one another, and (ii) they are ordered so that the first PC explains the largest proportion of the variation of the original data, the second PC explains the next largest proportion, and so on. When the original variables are highly correlated, the variance explained by many of the later PC will be so small that they can be ignored. Consequently, the number of regressor variables in the model can be reduced with little loss in model goodness-of-fit. The same reduction in dimensionality also removes the collinearity between the regressors and it is the expectation that this would lead to more stable parameter estimation and robust model identification. Such an expectation, often stated in textbooks, should not be taken on face value and an analysis as illustrated in Sect. 9.3.4 is warranted. The PC regression coefficients retained in the model are said to be more stable and, when the resulting model with the PC variables is transformed back in terms of the original regressor variables, the expectation was that the coefficients would offer more realistic insight into how the individual physical variables influence the response variable. It has been shown to be useful in social and other soft sciences as a way of finding effective combinations of variables. If the principal components can be interpreted in physical terms, then it would have been a more valuable tool. Unfortunately, this is often not the case. Draper and Smith (1981) caution that PCA may be of limited usefulness in physical engineering sciences contrary to social sciences where models are generally weak and numerous correlated regressors tend to be included in the model. Reddy and Claridge (1994) conducted synthetic experiments in an effort to evaluate the benefits of PCA against multiple linear regression (MLR) for modeling energy use in buildings, and reached the conclusion that only when the data is poorly explained by the MLR model and when correlation strengths among regressors
Component 1 0.268037 0.215784 0.294009 0.29512 0.302855 0.247658 0.29098 0.302292 0.301159 - 0.06738 - 0.297709 0.297996 0.301394 0.123134 - 0.0168
Component 2 0.126125 - 0.447418 - 0.0867631 - 0.0850094 0.0576253 0.117187 0.133689 0.0324667 0.0140657 0.620147 0.0859571 0.0180763 0.0112407 0.576743 - 0.0890017
Component 3 - 0.303013 - 0.0319413 0.155947 0.149828 - 0.0135352 - 0.388742 0.08111 0.0801237 0.0975298 0.100548 0.00911293 0.130096 - 0.0443937 0.151909 0.796502
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
were high, was there a possible benefit to PCA over MLR; with, however, the caveat that injudicious use of PCA may exacerbate rather than overcome problems associated with multicollinearity.
9.3.3
Ridge and Lasso Regression
i=1
ð yi - yi Þ =
2
p
n
2
i=1
yi - b o –
ð9:23Þ
bj xij j=1
Ridge regression is similar but with a bias term added (similar to the penalty function optimization method—see Sect. 7.3.5): n
2
ð y - yi Þ = i=1 i
n i=1
2
p
y i - bo –
bj xij j=1
bRidge = XT X þ kI
p
þk j=1
b2j
ð9:24Þ where k is the “relative weight factor” (James et al. 2013), also called the jiggling factor. The second term is called the “shrinkage penalty” which is small when the coefficients are small. With this approach, the parameter vector for ridge regression in matrix notation is:
-1
ð9:25Þ
XT Y
where I is the identity matrix. For different values of k, different values of the parameters would be obtained. Parameter variance is then given by var bRidge = σ 2 XT X þ kI
A regression model with collinear regressors will result in unbiased model parameter estimates, but unfortunately, they will have large prediction intervals. A well established and popular approach is to use ridge regression (see, for example, Chatterjee and Price 1991; Draper and Smith 1981). This method does not perform a dimension reduction (as did the PCA approach) but uses the entire set of regressors as is. Further, contrary to PCA, ridge regression is a supervised regression method in that it uses the response or target variable data as well. This method results in more stable estimates than those of OLS in the sense that they are narrower and more robust, i.e., less affected by slight variations in the estimation data. There are several alternative ways of defining and computing ridge estimates; the ridge trace is perhaps the most intuitive. It is best understood in the context of a graphical representation which clarifies both the bias and estimation concepts. Since the determinant of (X’X) is close to singular for collinear regressors, the approach involves introducing a known amount of “noise” via a factor k, leading to the determinant becoming less sensitive to multicollinearity. Recall from Sect. 5.3.1 that the OLS procedure involves minimizing the mean square error of the following residual sum of squares expression to estimate the parameters bj following: n
367
-1
XT X XT X þ kI
-1
ð9:26aÞ with prediction bands: varðyo ÞRidge = σ 2 1 þ X To XT X þ kI
-1
XT X XT X þ kI
-1
Xo
ð9:26bÞ where σ 2 is the mean square error of the residuals. The interested reader can refer to standard texts such as Draper and Smith (1981) for the derivation of the above equations. The essence of ridge regression is to determine the “best” value of the factor k. First, the entire set of regressors are standardized (i.e., the individuals observations subtracted by the mean and divided by the standard deviation) to remove the scale effect of the numerical magnitude of the different regressors. One then performs numerous calculations using Eqs. 9.23 and 9.24 with different values of the jiggling factor k from 0 (the OLS case) to 1.0. The “best” value of k is finally determined based on minimum parameter variance, or better still based on least model mean square prediction error (MSE). The latter should be based on the cross-validation or testing data set (see Sect. 5.8.2) and not on MSE of the training data set from which the model was developed. Thus, the underlying idea is to tradeoff bias in parameter estimates and model prediction variance. Usually, the value of k is in the range 0–0.2, and determination of this critical value is the crux of this approach. As illustrated in Fig. 9.7, the ridge estimators are biased but tend to be stable and (hopefully) have smaller variance than OLS estimators which have no bias but large variance (see Fig. 9.8); this is referred to as bias-variance tradeoff. Predictions of the response variable would tend to be more accurate than OLS and the uncertainty bands more realistic. Unfortunately, many practical problems exhibit all the classical signs of multicollinear behavior but, often, applying PCA or ridge analysis does not necessarily improve the prediction accuracy of the model over the standard multilinear OLS regression (or MLR) (as illustrated in Sect. 9.3.4). Ridge regression is a well-known approach among a class of techniques referred to as shrinkage methods whose intent is to reduce or shrink the uncertainty of the model parameter estimates, and thus arrive at more robust models than those deduced using OLS. Note that an obvious disadvantage of ridge regression is that it retains all the regressors in the model function even though some are known to be correlated.
368
9
Parametric and Non-Parametric Regression Methods
Fig 9.7 The optimal value of the ridge factor k is the value for which the mean square error (MSE) of model predictions is minimum. k = 0 corresponds to OLS estimation. In this case, k = 0.131 is optimal with MSE = 3.78 × 104
Fig. 9.8 Conceptual illustration showing that ridge regression (RR) estimates are biased compared to OLS estimates but that the variance of the parameters will (hopefully) be smaller as shown
This obfuscates interpretation of the regression model structure. An alternative and recent variant is lasso regression which has a similar formulation but first performs an additional step of identifying/selecting the best subset or performing variable/feature selection. Next, the objective function has a similar formulation as ridge regression except p
that the penalty term involving
j=1
b2j is now replaced by
p
k j=1
abs bj
. The resulting model is said to be more
sparse and easier to interpret (see James et al. 2013).
9.3.4
Chiller Case Study Involving Collinear Regressors
This section describes a study (Reddy and Andersen 2002) where the benefits of ridge regression compared to OLS are
evaluated in terms of model prediction accuracy in the framework of steady state chiller modeling. Hourly field monitored data (consisting of 810 observations) of a centrifugal chiller included four variables (refer to Fig. 5.14): (i) Thermal cooling capacity Qch in kW; (ii) Compressor electrical power P in kW; (iii) Supply chilled water temperature Tchi in K, and (iv) Cooling water inlet temperature Tcdi in K. The cross-validation approach (see Sect 5.8.2) is to divide the data set into two sub-sets: (i) a training set, which is meant to compare how different model formulations fit the data, and (ii) a testing (or validating) data set, which is meant to single out the most suitable model in terms of its predictive accuracy. The intent of having training and testing data sets is to avoid over-fitting and obtaining a more accurate indication of the model prediction errors. Given a data set, there are numerous ways of selecting the training and testing data sets. The simplest is to split the data time-wise (containing the first 550 data points or about 2/3rd of the monitored data) and a testing set (containing the second 260 data points or about 1/3rd of the monitored data). It is advisable to select the training data set such that the range of variation in the individual variables is larger than that of the same variables in the testing data set and that the same types of cross-correlations among variables be present in both data sets. This avoids the issue of model extrapolation errors interfering with the model building process. However, if one wishes to specifically compare the various models in terms of their extrapolation accuracy, i.e., their ability to predict beyond the range of their original variation in the data used to identify the model, the training and testing data set can be selected in several ways. An extreme form of data separation is to sort the data by Qch and select the lower 2/3rd portion of the data for model development and the upper 1/3rd for model evaluation. The results of such an analysis are reported in Reddy and Andersen (2002) but omitted here
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
369
Table 9.7 Descriptive statistics for the chiller data
Mean Std.Dev. Min Max
Training data set (550 data points) P Qch Tchi 222 1108 285 37.8 282 2.43 154 517 282 340 1771 292
Table 9.8 Correlation coefficient matrices for the training and testing data sets
P Qch Tchi Tcdi
Tcdi 302 0.73 299 304
Testing data set (260 data points) P Qch Tchi 202 1011 288 14.7 149 2.97 163 630 283 230 1241 293
Training data set (550 data points) P Qch – 0.98 0.99 – 0.86 0.91 0.54 0.61 Testing data set (260 data points)
since the evaluation results were similar (indicative of a robust model). Pertinent descriptive statistics for both sets are given in Table 9.7. Note that there is relatively little variation in the two temperature variables, while the cooling load and power experience important variations. Further, as stated earlier, the range of variation in the variables in the testing data set is generally within those of the training data set. Another issue is to check the correlations and serial correlations among the variables. This is shown in Table 9.8. Note that the correlations between (P, Tchi) and (Qch, Tchi) are somewhat different during the training and testing data sets (the correlation have increased from around 0.5 to about 0.9); one has to contend with this difference. In terms of collinearity, note that the correlation coefficient between (P, Qch) is very high (0.98), while those of the others are about 0.6 or less, which is not negligible but not very significant either. A scatter plot of the thermal load against COP is shown in Fig. 9.9. Two different steady-state chiller thermal performance models have been evaluated using OLS. These are the black-box model (referred to as MLR or multiple linear regression) and the gray-box model referred to as GN model (Gordon and Ng 2000). These models and their functional forms prior to regression are described in Sect 10.2.3 and Pr. 10.8 in Chap.10. Note that while the MLR model uses the basic measurement variables, the GN model uses transformed variables (X1, X2, X3). Recall that ridge regression should be performed with standardized variables to remove large differences in the numerical values of the different regressors (statistical software does this automatically). (a) GN Gray-Box Model (See Sect. 10.2.3 for full description of transformed variables and model structure) A look at the descriptive statistics for the regressors in the gray-box physical model provides an immediate indication as to whether the data set may be ill-conditioned or not. The
Tchi 0.52 0.62 – 0.54
Tcdi 302 0.77 297 303
Tcdi 0.57 0.59 0.37 –
Fig. 9.9 Scatter plot of chiller COP against thermal cooling load Qch in kW. High values of Qch are often encountered during hot weather conditions at which times condenser water temperatures tend to be higher, which reduce chiller COP. This effect partly explains the larger scatter and leveling off of COP at higher values of thermal cooling load
estimated correlation matrix for the transformed regressors in the GN model is given in Table 9.9. It is apparent that there is evidence of strong multi-collinearity between the regressors. From the statistical analysis, it is found that the GN physical model fits the field-monitored data well except perhaps at the very high end (see Fig. 9.10a). The adjusted R2 = 99.1% and the coefficient of variation (CV) = 1.45%. An analysis of variance also shows that there is no statistical evidence to reduce the model order. The model residuals have constant variance, as indicated by the studentized residual plots versus time (row number of data) and by regressor variable (Fig. 9.10b, c). Further, since most of them are contained within bounds of 2.0, one need not be unduly concerned with influence points and outlier points even though a few can be detected.
370 Table 9.9 Correlation coefficient matrix for transformed regressors of the GN model during training
9
Y X1 X2 X3
Parametric and Non-Parametric Regression Methods
Training data set (550 data points) Y X1 1.0 0.93 1.0
X2 0.82 0.96 1.0
X3 - 0.79 - 0.95 - 0.92 1.0
Fig. 9.10 Analysis of chiller data using the GN model (gray-box model): (a) x-y plot, (b) residual plot versus time sequence in which data was collected, (c) residual plot versus predicted response, and (d) variance inflation factors for the regressor variables
9.3 Dealing with Collinear Regressors: Variable Selection and Shrinkage
371
Table 9.10 Results of ridge regression applied to the GN model. Despite strong collinearity among the regressors, the OLS model has lower CV and NMBE than those from ridge regression when applied to the testing data set
k k k k k a
= = = = =
0.0a 0.02 0.04 0.06 0.08
Training data set Adj-R2 99.1 89.8 85.1 82.3 80.2
CV(%) 1.45 3.02 4.16 4.87 5.32
VIF(X1) 18.43 7.67 4.22 2.69 1.88
VIF(X2) 12.69 6.39 3.99 2.78 2.07
VIF(X3) 9.61 5.71 3.87 2.82 2.17
Testing data set CV(%) 1.63 2.86 4.35 5.20 5.73
NMBE(%) -1.01 1.91 3.18 3.86 4.27
Equivalent to OLS estimation
Table 9.11 Results of ridge regression applied to the MLR model. Despite strong collinearity among the regressors, the OLS model has lower CV and NMBE than those from ridge regression when applied to the testing data set
k k k k k a
= = = = =
0.0a 0.005 0.01 0.015 0.02
Training data set Adj-R2 99.2 93.1 89.9 87.8 86.3
CV(%) 0.95 1.81 2.39 2.80 3.10
VIF(X1) 49.0 26.3 16.4 11.2 8.17
VIF(X2) 984.8 15.0 6.43 3.89 2.69
VIF(X3) 971.3 15.2 6.57 4.00 2.75
Testing data set CV(%) 1.13 2.86 3.52 3.95 4.25
NMBE(%) 0.62 2.38 2.89 3.21 3.43
Equivalent to OLS estimation
The regressors are strongly correlated (Table 9.9). The condition number of the matrix is close to 81 suggesting that the data is ill-conditioned, since this value is larger than the threshold value of Cd = 30 stated earlier. To overcome this adversity, ridge regression is performed with the ridge factor k varied from 0 (which is the OLS case) to k = 0.2. The ridge trace for individual parameters is shown in Fig. 9.10d and the variable inflation factors are shown in Table 9.10. As stated earlier, OLS estimates will be unbiased, while ridge estimation will be biased but likely to be more efficient. The internal and external predictive accuracies of both estimation methods using the training and testing data sets respectively, were evaluated. It is clear from Table 9.10 that, during model training, CV values increase as k is increased, while the VIF values of the parameters decrease. Hence, one cannot draw any inferences about whether ridge regression is better than OLS and if so, which value of ridge parameter k is optimal. Adopting the rule-of-thumb that the lower bound for the VIF values should be 5 would suggest k = 0.02 or 0.04 to be reasonable choices. The CV and normalized mean bias error (NMBE) values of the models for the testing data set are also shown in Table 9.10. For OLS (k = 0), these values are 1.63 and - 1.01% indicating that the identified OLS model can provide extremely good predictions. Again, both these indices increase as the value of k is increased, indicating poorer predictive ability both in variability or precision and in bias. Hence, in this case, even though the data is ill-conditioned, the OLS identification turns out to be the better estimation approach if the chiller model identified is to be used for predictions only. It is worth
pointing out that this conclusion may be because the OLS regression model is excellent to start with (very high R2 and very low CV values). (b) MLR Black-Box Model The MLR model is a black-box model with linear first- and second-order terms in the three regressor variables. Some or many of the variables may be statistically insignificant, and so a step-wise OLS regression was performed. Both forward selection and backward elimination techniques were evaluated using the F-ratio of 4 as the cutoff (recall from Sect. 5.3.3 that the F-test evaluates the significance of the overall regression model). While the backward elimination retained seven terms (excluding the constant), forward selection only retained three. The Adjusted-R2 and CV statistics were almost identical and so the forward selection model is retained for parsimony. The final MLR model contains the three following variables: [Qch2, Tcdi. Qchi, Tchi. Qch]. The fit is again excellent with adjusted R2 = 99.2% and CV = 0.95% (very slightly better than those of the GN model). An analysis of variance also shows that there is no statistical evidence to reduce the model order, while the residuals are well-behaved. Unfortunately, the regressors are very strongly correlated (the correlation coefficients for all three variables are 0.99) and indicates ill-conditioned data. This is also supported by the large value of the condition number of the matrix (Cd = 76). The ridge regression results for the MLR model are shown in Table 9.11. How the CV values, during model training, increase as the ridge factor k is increased from 0 to 0.02 can
372
9
be noted along with the VIF of the regressors. The CV and NMBE values of the models for the testing data set are also assembled. Again, despite the strong ill-conditioning of the OLS model, the OLS model (with k = 0) turns out to be the best predictive model for the testing data set, with CV and NMBE values of 1.13% and 0.62%, respectively, which are excellent. As previously, this conclusion could be because the OLS model is very good to start with. Thus, a takeaway from this example is that remedial transformations suggested in textbooks are not always effective and that one ought to evaluate different approaches prior to making a final determination.
9.3.5
Other Multivariate Methods
Canonical correlation analysis is an extension of multiple linear regression (MLR) for systems which have several response variables. It is based on determining a transformed vector of the response vector. Both the regressor set (X) and the response set (Y) are first standardized, and then represented by weighted linear combination vectors U and V respectively. They are finally regressed against each other (akin to MLR regression) to yield canonical weights (or derived model parameters) which can be ranked. These canonical weights can be interpreted as beta coefficients (see Sect. 5.4.4) in that they yield insights into the relative contributions of the individual derived variables. Thus, the approach allows one to understand the relationship between and within two sets of variables X and Y. However, in many systems, the response variables Y are not all “equal” in importance, some may be deemed to be more influential than others based on physical insights of system behavior. This relative physical importance is not retained during the rotation since it is based purely on statistical criteria. This approach is said to be more relevant to the social and softer sciences than in engineering and the hard sciences. Factor analysis (Manly 2005) is similar to PCA in that a reduction in dimension is sought by shrinking a large interdependent and correlated set of X variables into a smaller set of more understandable and manageable factors or indices. However, while PCA seeks to identify new variables that are composites of the observed variables, factor analysis replaces the observed variables by a new set of more general latent/ abstract factors which, though not directly measured, are more easily interpreted. Each of the original variables is now reformulated in terms of a small number of common latent factors which impact all the X variables, and a set of errors or specific factors which affect only a single x variable. Thus, the original variables, after being standardized, are modeled as linear functions: xi = ai1 F 1 þ ai2 F 2 þ ai3 F 3 þ . . . þ εi
ð9:27Þ
Parametric and Non-Parametric Regression Methods
where aii is called the factor loading, Fj is the value of the jth common factor, and εi is the part of the test result specific to the ith variable. The factors have mean values of zero and standard deviation of unity. First, the number of factors to include in the model can be estimated from a PCA analysis (equal to the number of eigenvalues which are greater than one) or defined by the analyst. Next, the factors can be redefined for more equitable variance distribution unlike the PCA analysis where the first rotation explains the largest portion of the total variance in the data set, the second rotation the second most etc. There are different strategies for factor rotation. Varimax rotation maximizes the variance of the squared loadings in each column (i.e., ai12 + ai22 + ai32 + . . . = 1). Quartimax maximizes the variance of the squared loadings in each row, while equimax attempts to achieve a balance between rows and columns. The high-loading variables for each factor can be identified and, hopefully, suitable names can be assigned to each factor descriptive of the relevant abstraction or latent feature or attribute. Example 9.3.3 illustrated the use of principal component analysis with 15 variables. It was found that three principal components explained 95.9% of the variation in the data set of 27 sets of observations. The same data was analyzed using factor analysis with 3 factors and selecting the varimax rotation scheme. One obtains very similar results (the selection of number of factors and using different rotation strategies were found to have very minor effects on the results). Again, only three factors account for 98.3% of the variability in the original data. The factor loading plot of these three factors is shown in Fig. 9.11 which indicates that there are six clusters: CF4, CF5, CF9, and CF10 are stand alone, one cluster formed by CF1, CF8, and CF14), and the rest making up the last cluster. The CF variables forming a cluster can be combined (this is called a cluster analysis as discussed further in Sect. 11.3). The factor loading magnitude also provides a direct link between variables and factors. Thus, factor analysis has provided some important additional insights into variable dependency and interpretability which PCA did not. Note that while PCA is not based on a model, factor analysis presumes a model where the data is made up of inter-related factors. Factor analysis is frequently used for identifying hidden or latent trends or structure in the data whose effects cannot be directly measured. Like PCA, several authors are skeptical of factor analysis since it is somewhat subjective. As with any kind of process that simplifies complexity, there is a trade-off between the accuracy of the data and ease in working. With factor analysis, the best solution is the one that yields a simplification representative of the true nature of the data with minimum loss of precision. On the other hand, other authors point out its descriptive benefit as a means of understanding the causal structure of multivariate data. Factor analysis can be adopted for several
9.4 Going Beyond OLS
373
Fig. 9.11 Plot of factor loadings using data from Example 9.3.3 CF14 CF1 CF8 CF7 CF13 CF15 CF2 CF3 CF6 CF12 CF11 CF10
0.6
CF4 CF5
Factor 3
0.2 -0.2 -0.6
CF9
-1 -1
-0.6
-0.2
0.2
0.6
1
-0.4 -0.8
0
1.2 0.8 0.4 Factor 2
Factor 1
Fig. 9.12 (a) Sketch and nomenclature to explain the different types of parameter estimation problems for the simple case of one exploratory variable (x) and one response variable ( y). Noise and errors are assumed to be additive. (b) The idealized case without any errors or noise
applications: identification of underlying factors (discussed above), screening of variables, summary of data, and clustering of objects. These tasks are commonly required in disciplines such as market research, medicine, biology, psychology, and sociology, but much less in engineering. It has been used to screen indicators/variables characterizing the different attributes related to vulnerability and lack of resilience of communities subject to extreme events and threats (e.g., Asadzadeh et al. 2015). PCA and factor analyses are basically multivariate data analysis tools which involve analyzing the (XTX) matrix only (i.e., they qualify as unsupervised analysis methods). Subsequent application to regression model building is secondary, if at all. Canonical regression uses both (XTX) and (YTY) but still the regression is done only after the initial transformations are completed. A more versatile and flexible model identification approach than PCA which considers the covariance structure between predictor and response variable during model identification is called partial least squares (PLS) regression. This supervised analysis method uses the cross-product matrix (YTXXTY) to identify the multivariate model. However, since the variable rotation is not orthogonal, there is often some amount of variance inflation. Hence, some
authors (e.g., James et al. 2013) state that PLS is often no better than PCA or ridge regression. PLS has been used in instances when there are fewer observations than predictor variables, and further is useful for exploratory analysis and for outlier detection. It has found applications in numerous disciplines where many predictors are used, such as in economics, medicine, chemistry, psychology, . . . . Note that PLS is not really meant to understand the underlying structure of multivariate data (as do PCA and factor analysis) but is meant to be an accurate tool for predicting system response.
9.4
Going Beyond OLS
9.4.1
Background
Ordinary least squares (OLS) has been addressed extensively in Chap. 5. This widely used parameter estimation technique applies when the function is known or assumed to be linear and subject to certain strict assumptions regarding origin and type of certain errors. Figure 9.12 is a sketch depicting how errors influence the parameter estimation process. Errors which can corrupt the process can be viewed as either
374
9
additive, multiplicative, or mixed. For the simplest case, namely the additive error situation, such errors can appear as measurement error (γ) on the regressor/exploratory variable, on the response variable (δ), and also on the postulated model (ε). Note from Fig. 9.12, that xi and yi are the true values of the variables, while xi and yi are the measured values at observation i. Even when errors are assumed additive, one can distinguish between three broad types of situations: (a) Measurement errors (γ i, δi) and model error (εi) may be: (i) unbiased or biased, (ii) normal or non-normally distributed along different vertical slices of xi values (see Fig. 5.3), (iii) variance may be zero, uniform, or non-uniform over the range of variation of x. (b) Covariance effects may, or may not, exist between the errors and the regressor variable, i.e., between (x, δ, γ, ε). (c) Autocorrelation or serial correlation over different temporal lags may, or may not, exist between the errors and the regressor variables, i.e., between (x, δ, γ, ε). The reader is urged to refer back to Sect. 5.5.1 and to Sect. 5.10, where these conditions as applicable to OLS were stated. To recall, the OLS situation strictly applies when there is no error in x. Further, OLS applies when δ ≈ N 0, σ 2δ i.e., unbiased, normally distributed with constant variance, when cov(xi, δi) = 0 and when cov(δi, δi + 1) = 0. Notice that it is impossible to separate the effects of δi from εi in the OLS case, and so the combined effect is to increase the error variance of the following model which will be reflected in the RMSE value: y = a þ bx þ ðε þ δÞ
ð9:28Þ
For cases when δ is a known function, maximum likelihood estimation (MLE) is more appropriate (discussed in next sub-section). The popularity of linear estimation methods stems from the facts that the computational effort is relatively low, the approach is intuitively appealing, and there exists a wide body of statistical knowledge supporting them. OLS analysis (discussed in Chap. 5) is best suited for general linear models where the model residuals are normally distributed. This occurs when the response variable is normally distributed and the functional form among the regressor variable is linear. Other criteria to be met for during OLS to be applicable are given by the Gauss-Markov criteria- Sect. 5.10). The approach adopted in OLS was to minimize an objective function (also referred to as the loss function) expressed as the sum of the squared residuals (given by Eq. 5.3). One was able to derive closed form solutions for estimating
Parametric and Non-Parametric Regression Methods
the model parameters and their standard errors as well as the uncertainty intervals for mean and individual response. Such closed form solutions cannot be obtained for many situations where the function to be minimized must be framed differently, and these require the adoption of search methods. Thus, parameter estimation problems are, in essence, optimization problems where the objective function is framed in accordance with what one knows about the errors in the measurements and in the model structure. Consider the following model: y = a expðbxÞ
ð9:29Þ
One would proceed to take logarithms resulting in the simple linear model. ln y = ln a þ bx
ð9:30Þ
Since the log-transformed regressor variable is unlikely to be normally distributed, one cannot use OLS. Non-linear estimation applies to instances when the model is non-linear in the parameters. The parameter estimation can be done by either least squares or by MLE of a suitably defined loss function. Models non-linear in their parameters are of two types: (i) those which can be made linear by a suitable variable transformation, and (ii) those which are intrinsically non-linear. The former is discussed first; the latter is very similar to a search for optimizing a function and is discussed in Sect. 9.5. One should distinguish between the following important cases relevant to parameter estimation: – If the response variable is not normally distributed, Maximum Likelihood Estimation (MLE) method is a popular approach (discussed next in Sect. 9.4.2). – Another method is to transform the response variable to make it normal, and then use OLS (this is the Box-Cox transformation method discussed in Sect. 9.4.4). – If the response variable is not normally distributed and/or if the functional form is not strictly a linear combination of the response variables, the generalized linear model or GLM approach (not to be confused with the general linear model) can be adopted in conjunction with MLEdiscussed in Sect. 9.4.3. – If measurement errors in the regressors are not small compared to those of the response, use the error in variable (EIV) or corrected least squares (Sect. 9.4.6) – For autocorrelated errors, the generalized least squares (GLS) method is appropriate (but not treated in this book).
9.4 Going Beyond OLS
9.4.2
375
Maximum Likelihood Estimation (MLE)
Ordinary least squares (OLS) estimation is based on moments of the data (the first moment is the mean, the second is the variance, and so on); hence, this approach is referred to as the Method of Moments Estimation (MME). Maximum Likelihood Estimation (MLE) is another approach and is generally superior to MME since it can handle any type of error distribution in the response variable y provided it is known/ specified, while MME is limited to normally distributed errors. MLE allows generation of estimators of unknown parameters that are generally more efficient and consistent than MME, though sometimes estimates can be biased. In many situations, the assumption of normally distributed errors is reasonable, and in such cases, MLE and MME give identical results. Thus, in that regard, MME can be viewed as a special (but important) case of MLE. Consider the following simple example meant to illustrate the concept of MLE. Suppose that a shipment of computers is sampled for quality, and that two out of five are found defective. The commonsense approach of estimating the population proportion of defectives is π = 2/5 = 0.40. One could use an alternative method by considering a whole range of possible π values. For example, if π = 0.1, then the probability of s = 2 defectives out of a sample of n = 5 would be given by the binomial formula: n s π ð1 - π Þn - s = s
5 0:12 ð0:9Þ3 = 0:0729 2
the most likely value is 0.4, the same value as per the commonsense approach. Thus, the MLE approach is to simply determine the value of π which maximizes the likelihood function given by the above binomial formula. In other words, MLE is the population value that is more likely than any other value to generate the sample which was actually observed, or which maximizes the likelihood of the observed sample. The above approach can be generalized as follows. Suppose a sample (x1, x2, . . .xn) of independent observations is drawn from a population with probability function p(xi /θ) where θ is the unknown population parameter to be estimated. If the sample is random, then the joint probability function for the whole sample is: pðx1 , x2 , . . . xn =θÞ = pðx1 =θÞpðx2 =θÞ . . . pðxn =θÞ
The objective is now to estimate the most likely value of θ among all its possible values which maximizes the above probability function. The likelihood function is thus: n
Lðθ=x1 , . . . xn Þ = ∏ pðxi =θÞ i=1
ð9:33Þ
where Π denotes the product of n factors. The parameter θ is easily determined by taking natural logarithms, in which case,
ð9:31Þ
In other words, if π = 0.1, there is only 7.3% probability of getting the actual sample that was observed. However, if π = 0.2, the chances improve since one gets p = 20.5%. By trying out various values, one can determine the best value of π. How the maximum likelihood function varies for different values of π is shown in Fig. 9.13 from which one finds that
ð9:32Þ
n
lnðLðθÞÞ =
ln pðxi =θÞ
ð9:34Þ
i=1
Thus, one could determine the MLE of θ by performing a least-squares regression provided the probability function is known or assumed. Say, one wishes to estimate the two regression parameters β0 and β1 of a simple linear model assuming the error distribution to be normal with variance σ (Sect. 2.4.3), the probability distribution of the residuals would be given by: pð y i Þ =
1 1 exp - 2 ðyi - β0 - β1 xi Þ2 1=2 2 2σ ð2πσ Þ
ð9:35Þ
where (xi, yi) are the individual sample observations. The maximum likelihood function is then: L y1, y2 , . . . , yn , β0 , β1 , σ 2 = pðy1 Þpðy2 Þ⋯pðyn Þ n 1 1 exp - 2 ðyi - β0 - β1 xi Þ2 = ∏ 1=2 2σ i = 1 ð2πσ 2 Þ
Fig. 9.13 The maximum likelihood function for the case when two computers out of a sample of five are found defective
ð9:36Þ
The three unknown parameters (β0, β1, σ) can be determined either analytically (by setting the partial derivatives to zero), or numerically. In this instance, it can be shown that
376
9
Table 9.12 Data table for Example 9.4.1a
2100 2412 2738 a
Table 9.13 Data table for Example 9.4.2a
2128 2438 2996
2138 2456 3369
2167 2596
2374 2692
6.9 8.1 9.5
7.2 8.2 10.1
7.4 8.4 10.4
Data available electronically on book website
4.7 7.7 8.6 a
2107 2435 2985
Parametric and Non-Parametric Regression Methods
5.8 7.9 8.9
6.5 8.0 9.1
Data available electronically on book website
MLE estimates and OLS estimates of β0 and β1 are identical, while those for σ 2 are biased (though consistent). MLE is said to be the dominant estimation method used by statisticians. Its advantages go beyond its obvious intuitive appeal because of the following reasons:
value is 2508.18. The parameter λ is to be estimated using MLE. Following Eq. 9.33, the likelihood function n
Lðλ=xi Þ = ∏ pðxi =λÞ = λe - λx1
λe - λx2 λe - λx3 . . .
i=1
(a) though biased for small samples, the bias reduces as the sample size increases, (b) where MLE is not the same as MME, the former is generally superior in terms of yielding minimum variance of the model parameter estimates, (c) MLE is very straightforward and even though it does not have closed form solutions in most cases, the estimation can be easily done by computers, (d) in addition to providing estimates, MLE is useful to show the range of plausible values for the parameters, and also for deducing confidence limits. The main drawback is that MLE may lack robustness in dealing with a population of unknown shape, i.e., it cannot be used when one has no knowledge of the underlying error distribution. In such cases, one can evaluate the goodness of fit of different plausible probability distributions using the Chi-square criterion (see Sect. 4.2.6), identify the best candidates and pick one based on some prior physical insights. The numerical computation poses no problem on a computer, but the final selection is, to some extent, at the discretion of the analyst. When sample data is available, one can infer one (or more) underlying theoretical PDF and infer the underlying MLE. The following examples illustrate the approach. Example 9.4.1 MLE for exponential distribution The lifetime of several products and appliances can be described by the exponential distribution (see Sect. 2.4.3) given by the following one parameter model: Εðx; λÞ = λ:e
- λx
if x > = 0
= 0 otherwise
ð9:37Þ
Sixteen appliances have been tested and operating life data in hours are assembled in Table 9.12 whose mean
= λn e - λ
xi
ð9:38Þ Taking logs, one gets ln (L(λ)) = n ln (λ) - λ ∑ xi Differentiating with respect to λ and setting it to zero yields: d lnðLðλÞÞ n = xi = 0, dλ λ 1 n λ= = x xi
from which
Thus, the MLE estimate λ = (2508.188)-1= 0.000399.■
of
the
ð9:39Þ
parameter
Example 9.4.2 MLE for Weibull distribution The observations assembled in Table 9.13 are values of wind speed (in m/s) at a certain location. The Weibull distribution (see Sect. 2.4.3) with parameters (α, β) is appropriate for modeling wind distributions:
pð x Þ =
α α - 1 - ðx=βÞα x e βα
ð9:40Þ
Estimate the values of the two parameters using MLE. As previously, taking the partial derivatives of ln(L(α, β)), setting them to zero, and solving for the two equations results in (Devore and Farnum 2005): α= β=
xαi lnðxi Þ xαi xαi n
1=α
-
lnðxi Þ n
-1
and ð9:41Þ
9.4 Going Beyond OLS
377
This approach is tedious and error-prone, and so one would tend to use a computer program to perform MLE. Resorting to this option resulted in MLE parameter estimates of (α = 7.9686, β = 0.833). The goodness-of-fit of the model can be evaluated using the Chi-square distribution which is found to be 0.222. The resulting plot and the associated histogram of observations are jointly shown in Fig. 9.14. ■
9.4.3
Generalized Linear Models (GLM)
A flexible and unified generalization of OLS for continuous variables and discrete/categorical/binary data3 is the Generalized Linear Model (GLM) approach. A brief overview is provided here while several books such as Dobson and Barnett (2018) provide in-depth treatment. The technique is an umbrella term applicable to several different types of models requiring their unique variants/nuances. It unifies statistical models such as linear regression, logistic regression
Fig. 9.14 Fit to data of the Weibull distribution with MLE parameter estimation (Example 9.4.2)
(covered in Sect. 9.4.5) and Poisson regression. One application is for non-linear models wherein the non-linearity can be overcome by transforming the response variable using a link function, and then identifying a linear additive model between it and the predictor variables (even though the underlying relationships may be neither linear nor additive). The link transformation, however, may result in a non-normal distribution of the errors in the response variable. Hence, OLS may be often inappropriate for parameter estimation, and the maximum likelihood estimation (MLE) method along with a assumed/presumed distribution for the response variable is adopted. Note that a distinctive characteristic of GLM is that it is not the mean of the response at specific values of the regressor variable (refer to Fig. 5.3), but a function of the mean that is made linearly dependent of the predictors. Some common transformations are shown in Table 9.14. Of special interest are the link functions for the exponential, power and logistic function types. The exponential function (Fig. 9.15a) appears frequently in practice; notice the shape of the curves for different values of coefficient b. Recall that the Poisson distribution deals with the number of occurrences in a fixed period of time, and so applies to discrete random variables; the exponential distribution deals with continuous random variables since it applies to the time between occurrences of successive events. The assumption of multiplicative models in certain cases, such as exponential models, is usually a good one since one would expect the magnitude of the errors to be greater as the magnitude of the variable increases. However, this is by no means obvious for other transformations. Consider the solution of a first-order linear differential equation of a decay process: T(t) = T0 exp (-t/τ) where T0 and τ (interpreted as the initial condition and the system time constant, respectively) are the model parameters. Taking logarithms on both sides results in: ln(T(t) = α +βt where α= lnTo and β = 1/τ. The model has, thus, become linear; but if the probability distribution of ln(T ) is not normal, OLS
Table 9.14 Some common GLM transformations (link functions) to convert models non-linear in the parameters into linear additive ones Function type Exponential/Poisson Power/ multiplicative Logarithmic Reciprocal Hyperbolic Saturation Logistic
3
Functional model y = exp (a + b1x1 + b2x2) y = ax1bx2c y = a + b log x1 + c log x2 y = (a + bx1 + b2x2)-1 y = x / (a + bx) y = ax / (b+x) y=
expðaþb1 x1 þb2 x2 Þ 1þ expðaþb1 x1 þb2 x2 Þ
GLM transformation or Link functions y* = ln y y* = log y, x* = log x x* = log x y* = 1/y y* = 1/y; x* = 1/x y* = 1/y; x* = 1/x y = ln 1 -y y
Recall the terminology that factors are qualitative or discrete regressor variable and their categories are called “levels”, while covariates are quantitative continuous regressor variables.
Transformed linear regression model y* = a+b1x1+ b2x2 y* = a + bx1* + cx2* y = a + bx1* + cx2* y* = a + b1x1+ b2x2 y* = b + ax1* y* = 1/a + (b/a) x1* y* = a + b1x1+ b2x2
378
9
Parametric and Non-Parametric Regression Methods
Fig. 9.15 Diagrams depicting different non-linear functions (with slope b) which can be transformed to functions linear in the parameters as shown in Table 9.14. (a) Exponential function. (b) Power Function. (c) Reciprocal function. (d) Hyperbolic function. (From Shannon 1975)
should not be used to estimate α and β, and from there determine T0 and τ. The parameter estimates and model predictions will not be biased but would be inefficient, i.e., the magnitude of the confidence and the prediction intervals provided by OLS model will be understated. Note also that the statistical goodness-of-fit indices, such as R2 and RMSE, as well as any residual checks apply to the transformed variables and not to the original ones. Natural extension of the above single variate power models to multivariate ones are obvious. For example, consider the multivariate power model: y = b0 xb11 xb22 ::xbpp
ð9:42aÞ
If one defines: z = ln y, c = ln b0 , wi = ln xi for i = 1, 2 . . . p
ð9:43Þ
for linear : f ðXÞ = βo þ β1 xi,1 þ . . . þ βp xi,p
ð9:45aÞ
for Poisson : f ðXÞ = exp βo þ β1 xi,1 þ . . . þ βp xi,p
ð9:45bÞ
For GLM it does not matter if the transformed response variable or the model residual errors are normally distributed or not since MLE estimation is used. However, a suitable variance function with respect to the mean value of the response variable y is required. For the Poisson regression model, one assumes that y has the Poisson distribution with variance v = mean (see Eq. 2.41b). In logistic and binomial regression, v(mean) = [mean - (mean2 / n)] for sample size n. Thus, GLM requires both the specification of the link function and a suitable variance function to be used for the transformed response variable along with MLE. The estimation is a bit more involved than OLS, but is easily done with computer software.
Then, one gets the linear model: p
z=c þ
bi wi
ð9:42bÞ
i=1
More generally, the expected value of the response of a statistical modeling equation is : E ðyi Þ = f ðXÞ = f xi,1, xi,2 , . . . , xi,p Two important specific cases are:
ð9:44Þ
9.4.4
Box-Cox Transformation
The Box-Cox transformation (Box and Cox 1964) is meant to convert a non-normal distribution for any variable into a normal one. This method was popular prior to the introduction of the GLM approach since linear regression with non-normally distributed response variables could be done using OLS. Should the Box-Cox transformation not prove to be satisfactory, one of the several alternatives of weighted least squares (see Sect. 5.6.3) or GLM method can be adopted. The Box-Cox transformation stabilizes the
9.4 Going Beyond OLS
379
Fig. 9.16 Converting a beta distribution into a near normal distribution using OLS the Box-Cox transformation
Table 9.15 Data table for Example 9.4.3
x y
1 0.5
1.5 1
variance of the response variable by transforming the scale of the variable, making the deviations around the model more normally distributed. Stated differently, it converts a response variable with non-normal distribution into one which closely resembles a normal distribution thereby allowing many of the well-known OLS statistical techniques to apply. The class of transformations often used is the power transformations in which the data is raised to a power λ1 after shifting it by a certain amount λ2 (often, the shift parameter is set equal to 0). Some common power transformations are (Dobson and Barnett 2018): y = yλ - 1 with λ ≠ 0 y = 1 þ yλ - 1 =
λ:a - ð1 - λÞ
ð9:46aÞ ð9:46bÞ
A family of curves will result for different values of λ and the selection is made based on approximating the distribution of the transformed y* variable most closely to a normal distribution. When λ = 0, it can be shown that Eq. (9.46b) for y* essentially becomes the (log y) transformation. Figure 9.16 illustrates how the beta distribution can be converted to a near normal distribution following the BoxCox transformation. The confidence intervals of model predictions are likely to be more realistic and many of the standard OLS inferences can be made (provided the other OLS conditions are met). Additionally, transforming the
2 1.7
2.5 2.2
3 3.4
3.5 4.7
4 5.7
4.5 6.2
5 8.4
variables can improve the model predictive power because transformations usually reduce white noise. Also, the transformation given by Eq. (9.46a and b) is likely to result in a poorer interpretability in the target variable than that provided by the simple log-transformation. Finally, the reverse transformation is said to apply to the median of the prediction distribution (and not to the mean value as is the case for the simple OLS). Example 9.4.3 This example is meant to illustrate the log-transformed regression analysis and one involving the Box-Cox transformation. Data shown in Table 9.15 needs to be fit using the simple power equation: y = axb. (a) Analysis with Log Linearized Model and OLS Taking natural logarithms results in : ln y = ln a þ b ln x or : y = a0 þ bx
ð9:47Þ
Subsequently, a simple OLS regression yields: a′ = 0.70267 and b′ = 1.7386 with R2 = 0.996 (excellent) and RMSE = 0.0617 for the log-transformed model. From here, a = 0.4953 and b = b′ = 1.7386. In terms of prediction at x = 3.5, y = 4.373, while the observed value is 4.70. Note that the data scatter within 95% CL for individual predictions but not the 95% mean prediction intervals (Fig. 9.17a). The model residuals of the
380
9 2.5
1.8
1.5 Studentized residual
2.3
Ln(y)
1.3 0.8 0.3
Parametric and Non-Parametric Regression Methods
0.5 -0.5
-1.5
-0.2 -0.7
-2.5 0
0.3
0.6
0.9 Ln(x)
1.2
1.5
1.8
0
0.3
0.6
0.9 Ln(x)
1.2
1.5
1.8
Fig. 9.17 Regression results of the transformed power model ln( y) = -0.70267 + 1.7386.ln(x) (Example 9.4.3). (a) Plot of fitted model (b) Residual plot
Table 9.16 Measured and predicted values of the response following the log-transformed model (Eq. 9.48a) and the Box-Cox transformed model (Eq. 9.48b) converted back to the original response variable
xi 1 1.5 2 2.5 3 3.5 4 4.5 5 RMSE
y_meas 0.5 1 1.7 2.2 3.4 4.7 5.7 6.2 8.4
transformed model (Fig. 9.17b) reveal some amount of improper residual behavior (non-constant variance). The analysis flagged the fourth observation (x = 2.5) as an unusual residual (studentized value of - 2.17) and there are no influential points. The RMSE of the back-transformed predictions compared with the original data is 0.2736 (Table 9.16). (b) Analysis with Box-Cox Transformation and OLS The most suitable Box-Cox transformation which minimizes the mean squared error following Eq. 9.46b was first determined. It was found to be: BoxCoxðyÞ = 1 þ y0:545 - 1 = 0:545 × 2:74576 - 0:455 ð9:48aÞ The equation of the fitted model was determined by simple OLS regression: BoxCoxðyÞ = - 1:66683 þ 1:78257x
ð9:48bÞ
with R2 = 0.993 and RMSE = 0.219. In this case x = 4.5 was flagged as an unusual residual (studentized residual = -3.44)
y_Ln model 0.4953 1.002353 1.652871 2.436278 3.344963 4.373056 5.515817 6.769298 8.130132 0.2736
y_BoxCox model 0.513806 1.004441 1.639405 2.412385 3.318622 4.354326 5.516354 6.802039 8.209064 0.2756
and no influential points were detected. The RMSE of the back-transformed predictions compared with the original data is 0.2756. How does one determine which approach is better? Both modeling approaches give results which are close in terms of predictions and in RMSE of the back-transformed y-variable (see Table 9.16). However, the standard errors of the parameter estimates and the model prediction bands of the Box-Cox transformed y-variable are likely to be more statistically robust. The model prediction uncertainties can be determined by resampling methods; the k-fold cross-validation approach is unsuitable given the small data set, while the bootstrap method could be used. In fact, prior to undertaking the Box-Cox transformation analysis, the analyst should have investigated the shape of the distribution of the original measured data and the log-transformed data. Figure 9.18 indicates that both the original measured response and the log-normal transformation result in normal distributed data, and the Box-Cox transformation was probably not needed in the first place. ■
9.4 Going Beyond OLS
381
99
99
95
95
80
80
percentage
99.9
percentage
99.9
50 20
50 20
5
5
1
1
0.1
0.1 0
2
4
6
8
10
-0.7
-0.2
0.3
y
0.8 Ln(y)
1.3
1.8
2.3
Fig. 9.18 Normal probability plots with 95% limits of (a) the measured y-variable, and (b) the log-transformed y-variable. Since the data points fall within the limits, these plots indicate that the variables can be considered to be normally distributed
that reaches a steady-state value (see Fig. 9.19). One can note two phases: (i) an early phase during which the environmental conditions are optimal and allow the rate of growth to be exponential, and (ii) a second phase where the rate of growth is restricted by the amount of growth yet to be achieved and assumed to be directly proportional to this amount. The following model captures this behavior of population N over time t: dN N = rN 1 dt k
Fig. 9.19 Exponential and logistic growth curves for annual worldwide primary energy use assuming an initial annual growth rate of 2.4% and under two different values of carrying capacity k (Example 9.4.4)
9.4.5
Logistic Functions
The exponential model was described in Sect. 2.4.3 in terms of modeling unrestricted growth. Logistic models are extensions of the exponential model in that they apply to instances where growth is initially unrestricted, but gradually changes to restricted growth as resources get scarcer. They are an important class of equations which appear in several fields for modeling various types of growth such as populations (humans, animal and biological) as well as energy and material use patterns (Draper and Smith 1981; Masters and Ela 2008). Such models also fall under the GLM umbrella since a link function allows the necessary transformation (see Table 9.14).4 The non-linear shape of these models is captured by an S curve (called a sigmoid function) 4
The logistic model shown in Table 9.14 is a more general formulation with a slightly different functional form.
N ð 0Þ = N 0
ð9:49Þ
where N(0) is the population at time t = 0, r is the growth rate constant, and k is the carrying capacity of the environment. The factor k can be constant or time varying; an example of the latter is the observed time variant periodic behavior of predator-prey populations in closed ecosystems. The factor [(1 - N/k)] is referred to as the environmental factor. One way of describing this behavior in the context of biological organisms is to assume that during the early phase, food is available for both growth and sustenance, while at the saturation level it is restricted and is available for sustenance only resulting in stoppage of growth. The solution to Eq. 9.49 is: N ðt Þ =
k 1 þ exp½ - rðt - t Þ
ð9:50aÞ
where t* is the time at which N = k/2, and is given by: t =
1 k ln -1 r N0
ð9:51Þ
If the instantaneous growth at t = 0 is R(0) = R0, then it can be shown that
382
9
r=
R0 1 - Nk0
ð9:52Þ
exponential curve also drawn for comparison. The asymptotic behavior of the logistic curves and the fact that the curves start deviating from each other quite early on are noteworthy points. ■
ð9:50bÞ
The interested reader can follow a similar analysis to compare the energy use in 2020 predicted by the above model (if none of the parameters have changed) with the reported value of 555.6 EJ and speculate on reasons for this difference (one reason is the Covid outbreak). Logistic models have numerous applications other than modeling restricted growth. Some of them are described below:
and Eq. 9.50a can be rewritten as: N ðt Þ =
k 1þ
k N0
- 1 expð- R0 t Þ
Parametric and Non-Parametric Regression Methods
Thus, knowing the quantities k, initial conditions N0 and R0 (at t = 0) allows r, t*, and N(t) to be determined. Another useful concept in population biology is the concept of maximum sustainable yield of an ecosystem (Masters and Ela 2008). This corresponds to the maximum removal rate which can sustain the existing population and would 2 occur when ddtN2 = 0, i.e., from Eq. 9.50b, when N = k/2. Thus, if the fish population in a certain pond follows the logistic growth curve when there is no fish harvesting, then the maximum rate of fishing would be achieved when the actual fish population is maintained at half its carrying capacity. Many refinements to the basic logistic growth model have been proposed which allow such factors as fertility and mortality rates, population age composition, migration rates, interaction with other species, etc. to be considered.
(a) In marketing and econometric applications for modeling the time rate at which new technologies penetrate the market place (e.g., the saturation curves of new household appliances) or for modeling changes in consumer behavior (i.e., propensity to buy) when faced with certain incentives or penalties (see Pindyck and Rubinfeld 1981 for numerous examples). (b) In neural network modeling (a form of non-linear blackbox modeling discussed in Sect. 9.8) where logistic curves are used because of their asymptotic behavior at either end. This allow the variability in the regressors to Example 9.4.4 be squashed or clamped within pre-defined limits. Use of logistic models for predicting growth of worldwide (c) To model probability of occurrence of an event against energy use one or several predictors which could be numerical or The primary energy use in the world in 2008 was about categorical. An example of a medical application would 14 Tera Watts (TW) or 441.5 Exajoules (EJ) per year (1 EJ = be to model the probability of a heart attack for a 1018J). The annual growth rate is close to 2.4%. population exposed to one or more risk factors. Another application is to model the spread of disease in epidemi(a) If the energy growth is taken to be exponential (implyological studies. A third application is to model the ing unrestricted growth), in how many years would the somewhat random event of window positions being energy use double? open or closed by the occupant in a naturally ventilated The exponential model is given by: Q(t) = Q0 exp (R0t) house. A fourth is for dose response modeling meant to where R0 is the growth rate (= 0.024) and Q0(= 14 identify a non-linear relationship between the dose of a TW) is the annual energy use at the start, i.e., for year toxic agent to which a population is exposed and the 2008. The doubling time would occur when response or risk of infection of the individuals stated as a Q(t) = 2Q0, and by simple algebra: t doubling = 0:693 = probability or proportion. This is discussed further R0 below. 28:9 years, or at about year 2037. (b) If the growth is assumed to be logistic with a carrying (d) To model binary responses. For example, manufactured items may be defective or satisfactory; patients may capacity of k = 45 TW (i.e., the value of annual energy respond positively or not to a new drug during clinical use is likely to stabilize at this value in the far future), trials; a person exposed to a toxic agent could be determine the annual energy use for year 2037. With infected or not. Logistic regression could be used to t = 28.9, and R0 = 0.024, Eq. 9.50b yields: discriminate between the two groups or multiple groups 45 (see Sect. 11.4) where classification methods are = 21:36 TW Qðt Þ = 45 covered). 1 þ 14 - 1 exp½ - ð0:0024Þ ð28:9Þ which is (as expected) much less than that predicted by the exponential model of 28 TW. The plots in Fig. 9.19 illustrate logistic growth curves for two different values of k with the
As stated in (c) above, logistic functions are widely used to model how humans are affected when exposed to different toxic loads (called dose-response) in terms of a proportion or
9.4 Going Beyond OLS
383
Table 9.17 Data table for Example 9.4.5a Concentration (g/100 cc) 0.10 0.15 0.20 0.30 0.50 0.70 0.95 a
Number of insects 47 53 55 52 46 54 52
Number killed 8 14 24 32 38 50 50
Percent killed 17.0 26.4 43.6 61.5 82.6 92.6 96.2
Data available electronically on book website
a probability P. For example, if a group of people is treated by a drug, not all of them are likely to be responsive. If the experiments can be performed at different dosage levels x, the percentage of responsive people is likely to change. Here the response variable is called the probability of “success” (which actually follows the binomial distribution) and can assume values between 0 and 1, while the regressor variable can assume any appropriate numerical value. Industrial applications include failure analysis, fatigue testing, and reliability testing. For example, functional electrical testing on a semiconductor can yield: (i) “success” in which case the device works, (ii) “failure” due to a short or open circuit, or some other failure mode. Such binary variables can be modeled by the following two parameter model: PðxÞ =
1 1 þ exp½ - ðβ0 þ β1 xÞ
ð9:53Þ
where x is the dose to which the group is exposed and (β0, β1) are the two parameters to be estimated by MLE. Note that this follows from Eq. 9.50a when k = 1. One defines a new variable, called the odds ratio, as [P/(1 - P)]. Then, the log of this ratio is: π = ln
P 1-P
ð9:54Þ
where π is called the logit function.5 Simple manipulation of Eqs. 9.53 and 9.54 leads to a linear functional form for the logit model: π = β0 þ β1 x
Parameters β0 β1
Estimate - 1.7361 6.2954
Standard error 0.2420 0.7422
Chi-square 51.4482 71.9399
p-value < 0.0001 < 0.0001
be conveniently determined. For example, π = 3 would imply that success P is e3 = 20 times more likely than failure. Thus, a unit increase in x will result in β1 change in the logit function. The above model can be extended to multi-regressors. The general linear logistic model allows the combined effect of several variables or doses to be included: π = β 0 þ β 1 x1 þ β 2 x2 . . . þ β k xk
ð9:56Þ
Note that the regression variables can be continuous, categorical, or mixed. The above models can be used to predict the dosages which induce specific levels of responses. Of particular interest is the dosage which produces a response in 50% of the population (median dose). The following example illustrates these notions. Example 9.4.56 Fitting a logistic model to the kill rate of the fruit fly A toxicity experiment was conducted to model the kill rate of the common fruit fly when exposed to different levels of nicotine concentration for a pre-specified time interval (recall that the product of concentration and duration of exposure equals the dose). Table 9.17 assembles the experimental results. ■
ð9:55Þ
Thus, rather than formulating a model for P as a non-linear function of x, the approach is to model π as a linear function of x. The logit allows the number of failures and successes to 5
Table 9.18 Results of maximum likelihood estimation (MLE) for Example 9.4.5
The logit model uses the cumulative distribution function of the logistic distribution while a similar function, called the probit model, uses the cumulative distribution function of the standard normal distribution. Both functions will take any number and rescale it to be in between 0 and 1.
The single variate form of the logistic model (Eq. 9.55) is used to estimate the two model parameters using MLE. The regressor variable is the concentration, while the response variable is the percent, or the proportion killed. An MLE analysis yields the results shown in Table 9.18. To estimate the dose or concentration which will result in P = 0.5 or 50% fatality or success rate is straightforward. 6
From Walpole et al. (2007) by # permission of Pearson Education.
384
9
From Eq. 9.54, the logit value is π = ln 1 -P P = lnð1Þ = 0: Then, using Eq. 9.55, one gets: x50 = - (β0/β1) = 0.276 g/100 cc. Example 9.4.6 Dose response modeling for sarin gas Figure 9.20a shows the dose response curves for sarin gas for both casualty dose (CD), i.e., which induces an adverse reaction, and lethal dose (LD) which causes death. The CD50 and LD50 values, which will affect 50% of the population exposed, are specifically indicated because of their importance as stated earlier. Though various functions are equally plausible, the parameter estimation will be done using the logistic curve. Specifically, the LD curve is to be fitted whose data has been read off the plot and assembled in Table 9.19. In this example, let us assume that all conditions for standard multiple regression are met (such as equal error variance across the range of the dependent variable), and so MLE and OLS will yield identical results. In case they were not, and the statistical package being used does not have the MLE capability, then the weighted least squares method could be adopted to yield maximum likelihood estimates.
Parametric and Non-Parametric Regression Methods
First, the dependent variable, i.e., the fraction of people affected is transformed into its logit equivalent given by Eq. 9.54; the corresponding numerical values are given in the last row of Table 9.19 (note that the entries at either end are left out since log of 0 is undefined). A second-order model in the dose level, but linear in the parameters of the form is: π = b0 þ b1 x þ b2 x 2
ð9:57Þ
This model was found to be more appropriate than the simple model given by Eq. 9.55. The model parameters were statistically significant, while the model fits the observed data very well (see Fig. 9.20b) and with Adj R2 = 94.3%. The numerical values of the parameters and the 95% CL of the second-order probit model are listed in Table 9.20. ■
9.4.6
Error in Variables (EIV) and Corrected Least Squares
Parameter estimation using OLS is based on the premise that the regressors are either known without error (γ = 0), or that their errors are very small compared to those of the response variable. There are situations when the above conditions are
Fig. 9.20 (a) Dose-response curve for sarin gas (from Kowalski 2002 by permission of McGraw-Hill). (b) Plot depicting the accuracy of the identified second-order probit model with observed data
Table 9.19 Data used for model building (for Example 9.4.6)a LD Dose (x) Fatalities% Logit values a
800 0 Undefined
1100 10 - 2.197
1210 20 - 1.386
Data available electronically on book website
1300 30 - 0.847
1400 40 - 0.405
1500 50 0
1600 60 0.405
1700 70 0.847
1880 80 1.386
2100 90 2.197
3000 100 Undefined
9.4 Going Beyond OLS
385
Table 9.20 Estimated model parameters of the second-order logit model (Example 9.4.6) Parameter b0 b1 b2
Estimate - 10.6207 0.0096394 - 1.700 E-6
95% CL limits Lower limit - 11.411 0.00880463 - 1.902 E-6
Standard error 0.334209 0.00035302 8.5546 E-8
Upper limit - 9.83045 0.0104742 - 1.498 E-6
Fig. 9.21 Figure depicting how a physical parameter in a model (in this case the heat exchanger thermal resistance R of a chiller) estimated using OLS becomes gradually more biased as noise is introduced in the x variable. No such bias is seen when EIV estimation is adopted. The uncertainty bands by both methods are about the same as the magnitude of the error is increased. (From Andersen and Reddy 2002)
invalid (for example, when the variable x is a function of several basic measurements, each with their own measurement errors), and this is when the error in variable (EIV) approach is appropriate. If the error variances of the measurement errors are known, the bias of the OLS estimator can be removed and a consistent estimator, called Corrected Least Squares (CLS) can be derived. This is illustrated in Fig. 9.21, which shows how the model parameter estimation using OLS gradually increases in bias as more noise is introduced in the x variable, while no such bias is seen for CLS estimates. However, note that the 95% uncertainty bands by both methods are about the same. CLS accounts for both the uncertainty in the regressor variables as well as that in the dependent variable by minimizing the distance given by the ratio of these two uncertainties. Consider the case of simple linear regression model: y = α + βx, and assume that the errors in the dependent variables and the errors in the independent variables are
uncorrelated. The basis of the minimization scheme is described graphically by Fig. 9.22. Next, a parameter representing the relative weights of the variance of measurements of x and y is defined as: λ=
varðγ Þ varðδÞ
ð9:58Þ
where γ is the measurement error of the regressor variable and δ that of the response variable. The loss function assumes the form (Mandel 1964): ð xi - xi Þ 2 þ ð yi - yi Þ 2 λ
S=
ð9:59Þ
i
subject to the condition that yi = a þ bxi and ðyi , xi Þ are the estimates of (yi, xi). Omitting the derivation, the CLS estimates turn out to be:
386
9
Parametric and Non-Parametric Regression Methods
Fig. 9.22 Differences in both slope and the intercept of a linear model when the parameters are estimated under OLS and under EIV. Points shown as “*” denote data points and the solid lines are the estimates of the two models. The dotted lines, which differ by angle θ, indicate the distances whose squared sum is being minimized in both approaches. (From Andersen and Reddy 2002)
b=
λSyy - Sxx þ
λSyy - Sxx
2
þ 4λS2xy
2λSxy
1=2
ð9:60aÞ
and a=y-b x
ð9:60bÞ
where the sum of square quantities Sxx, Syy, and Sxy are defined in Sect. 5.5.1. The extension of Eq. 9.60a to the multivariate case is given by Fuller (1987): bCLS = X 0 X - S2 xx
-1
X 0 Y - S2 xy
ð9:61Þ
where S xx is a ( p × p) matrix with the covariance of the measurement errors and S2xy is a ( p × 1) vector with the covariance between the regressor variables and the dependent variable. A simple conceptual explanation is that Eq. 9.59 performs on the estimator matrix an effect essentially the opposite of what ridge regression does. While ridge regression “jiggles” or randomly enhances the dispersion in the numerical values of the X variables in order to reduce the adverse effect of multi-collinearity on the estimated parameter bias, CLS tightens the variation in an attempt to reduce the effect of random error on the X variables. For the simple regression model, the EIV approach recommends that minimization or errors be done following angle θ (see Fig. 9.22) given by: 2
tanðθÞ =
b sy =sx
Fig. 9.23 Simple linear regression with errors in the x variable. Plot of Eq (9.62) with b = 1
rule of thumb that if the measurement uncertainty of the x variable characterized by the standard deviation is less than 1/5th than that in the response variable, then there is little benefit in applying EIV regression as compared to OLS. The 1/5th rule has been suggested for the simple regression case and should not be used for multi-regression with correlated regressors. The interested reader can refer to Beck and Arnold (1977) and to Fuller (1987) for more in-depth treatment of the EIV approach.
9.5
Non-Linear Parametric Regression
9.5.1
Detecting Non-Linear Correlation
ð9:62Þ
where s is the standard deviation. For the case of b = 1, one gets a curve such as Fig. 9.23. Note that when the ratio of the two standard deviations is less than about 5, the angle θ varies little and is about 10 degrees. This is the basis of the rough
The Pearson correlation coefficient r was introduced in Sects. 3.4.2 and 4.2.7 as a statistical index to quantify the linear relationship between two continuous variables: in a regression context, either two regressors or one regressor and one response variable. Its numerical value lies in the range -1 ≤ r ≤ +1, with -1 representing perfect negative correlation and
9.5 Non-Linear Parametric Regression
387
Table 9.21 The number of occurrences of two random variables (total of 100 points) partitioned into 25 cells. The hand drawn curve is meant to show the presence of a non-linear trend between both variables
X2 = 1
X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5
X2 = 2 10 10
10
X2 = 3
X2 = 4
X2 = 5 10
5
5 20 10
20
Table 9.22 Data from Table 9.21 recoded as probabilities for different categories along with row/column entropies computed following Eqs. 9.64 and 9.65 X2 = 1 X1 = 1 X1 = 2 X1 = 3 X1 = 4 X1 = 5
0.10
Sum Entropy
0.30 0.5211
X2 = 2 0.10 0.10
X2 = 3
0.05
X2 = 5 0.10 0.05
0.20 0.10
0.20 0.20 0.4644
0.05 0.2161
+1 denoting perfect positive correlation. A visual representation of the variation in two variables for different values of r is provided by Fig. 3.13. A value of r = 0 as shown in subplots Fig. 3.13e, f indicates no correlation whatsoever. However, there is a fundamental distinction between how the data scatter in both these subplots. While Fig. 3.13e exhibits random behavior between the two variables with no correlation of any sort detected by the eye, a clear quadratic relationship can be seen in Fig. 3.13f. It is simply that a quadratic correlation is not captured by the Pearson correlation coefficient. It is logical to look for another method which can detect the presence of any sort of relationship or interdependence between two variables. The approach below follows that by Wolberg (2006). The concept of non-linear correlation coefficient rnc was introduced by Shannon in the context of information theory7 related to two binary variables. The presence of any sort of correlation or interdependence between, say, two random variables X1 and X2 can be interpreted in terms of being able to infer a probability value of X2 for a given value of X1. In other words, does one gain any sort of information on X2 when the value of X1 is specified. The process of gaining information is equivalent to that of reducing uncertainty or disorder, a concept which Shannon termed entropy (borrowed from the second law of thermodynamics). This forms the basis of the method 7
X2 = 4
The introductory textbook on information theory by Pierce (1980) is recommended for further reading.
0.30 0.5211
0.15 0.4105
Sum 0.20 0.10 0.20 0.20 0.30
Entropy 0.4644 0.3322 0.4644 0.4644 0.5211 Σ=2.2465
1.0 Σ=2.1332
proposed to quantify the degree of dependence. Importantly, here higher entropy reflects more information content (a positive outcome), while in thermodynamics it reflects greater irreversibility in a process (an undesirable trait). Consider two discrete random variables X1 and X2 with the range between the minimum and maximum values binned and categorized into five sub-ranges as shown in Table 9.21 (if the variables are continuous, they need to be discretized into a finite number of categories). The data set consists of 100 data points and the numbers in the individual cells indicate their associated number of occurrences. As suggested by the hand-drawn line, some sort of non-linear correlation between both variables does seem to exist. To quantify this relationship, the data in Table 9.21 is recoded as probabilities- see Table 9.22. The concept of entropy interpreted as information content IC of a particular value (or cell) is given by: I C = - pi × log 2 ðpi Þ
ð9:63Þ
The log is to the base 2 and a minus sign (-) is introduced to simply obtain positive values for IC. For example, the probability sum of X1 = 1 is p = 0.20 and so the information content is IC = - (0.2) × log2 (0.2) = 0.4644. Next, this concept is extended to the probability over all categorical values of both variables with entropy denoted by H(X). For the variable X1, the sum across the second row is:
388
9
H ðX 1 Þ = -
n1 i = 1 pi
log 2 ðpi Þ
9.5.2
¼ - ð0:1Þ log 2 ð0:1Þ - ð0:1Þ log 2 ð0:1Þ = 0:2 ð9:64Þ where n1 is the number of categories (5 in this case). Entropy H(X2) is similarly computed, and the values are also shown in Table 9.22. From Table 9.22, Σ H(X1) = 2.2465 and Σ H(X2) = 2.1332. Once the entropy of each variable separately is computed, the entropy of the two variables combined is determined from: H ðX 1 X 2 Þ = -
n1 i=1
n2 p j = 1 ij
× log 2 pij
ð9:65Þ
where n1 and n2 are the number of categories of the two variables. Typically, they are chosen to be the same, but need not be so. For the illustrative example, H ðX 1: X 2 Þ = - ½2 × 0:05: log 2 ð0:05Þ þ 5 × 0:10: log 2 ð0:10Þ þ 2 × 0:20: log 2 ð0:20Þ = 3:022 Finally, the non-linear correlation coefficient rnc is determined from: r nc =
2:½H ðX 1 Þ þ H ðX 2 Þ - H ðX 1 X 2 Þ 0 ≤ r nc ≤ 1 H ð X 1 Þ þ H ðX 2 Þ
ð9:66Þ
or rnc = 2 x ð2:246 þ 2:133–3:022 Þ=ð2:246 þ 2:133Þ = 0:620: Note that rnc = 0 indicates total lack of correlation, while rnc = 1 indicates perfect correlation. In the solved example, a value of 0.620 suggests a good correlation, but a statistical significance test is still warranted. This is done by defining a new random variable T as: T = 2n:½H ðX 1 Þ þ H ðX 2 Þ - H ðX 1 :X 2 Þ = ðn:r nc Þ × ½H ðX 1 Þ þ H ðX 2 Þ
ð9:67Þ
where n is the number of observations. The variable T approaches the Chi-square distribution with v degrees of freedom with v = (ni – 1).(nj – 1). In the illustrative example, T = 100 × 0.620 × (2.246 + 2.133) = 271.5. From Table A5 listing the critical values of the chi-square distribution, the null hypothesis can be safely rejected, and one can conclude that the non-linear correlation coefficient is statistically very significant.
Parametric and Non-Parametric Regression Methods
Different Non-Linear Search Methods
There are numerous functions which are intrinsically non-linear in the parameters i.e., cannot be transformed into functions linear in the parameters. Two examples of such models are: y = b0 þ b1 exp ð- b2 xÞ y = exp
b 1 þ b2 x 2
and
ð9:68Þ
Non-linear least-squares regression is the only recourse in such cases. Unlike linear parameter estimation which has closed form matrix solutions, non-linear least squares estimation is iterative and requires numerical methods which closely parallel the search techniques used in optimization problems (see Sect. 7.4). Any non-linear least squares procedure is considerably more difficult to solve than its linear counterpart requiring partial derivatives and Jacobian matrices to be determined several times. Note that the three major issues during non-linear regression are: (i) importance of specifying good initial or starting estimates, (ii) a robust algorithm that suggests the proper search direction and step size, and (iii) a valid stopping criterion. Moreover, non-linear estimation is prone to all the pitfalls faced by optimization problems resorting to search methods such as local convergence, no solution being found, and slow convergence. There are basically two classes of search methods: gradient-based (such as steepest-descent, Gauss-Newton’s method) and gradient-free methods. In the gradient descent method, the sum of the squared errors is reduced by updating the parameters in the steepest-descent direction. In the GaussNewton method, the sum of the squared errors is reduced by assuming the least squares function to be locally quadratic (using Taylor series expansion) and finding the minimum of the quadratic function. It is faster than the gradient-based methods but sometimes may not converge to a solution, while the latter are slower but are more robust and effective when starting value are far off the final values. Perhaps the most widely used least-squares search method is the Levenberg-Marquardt algorithm (LMA) which combines the desirable features of both the Gauss-Newton and the steepest descent methods while avoiding their more serious limitations (Draper and Smith 1981). Its attractiveness lies in the fact that it always converges and does not slow down as do steepest-descent methods. It is widely viewed as a sensible practical choice. LMA adopts the gradient-descent approach when the parameters are far from their optimal value and acts more like the Gauss-Newton method when the parameters are close to their optimal value. It finds a local minimum starting at an initial guess of the parameter values.
9.5 Non-Linear Parametric Regression
389
In systems where there is only one minimum, LMA will converge to the global minimum even if the initial guess is arbitrary. In systems with multiple minima, such as the one shown conceptually in Fig. 9.24, LMA is more likely to find the global minimum only if the initial guess is already close to the solution, and so, a multi-start search is warranted. However, it allows more flexibility i.e., the selection of initial values can be further away from the solution than the GaussNewton method. Most of the statistical software packages have the capability of estimating parameters of non-linear models. However, it is advisable to use them with due care and specify proper starting values; otherwise, the trial-and-error solution approach can lead to the program being terminated abruptly (one of the causes being ill-conditioning- see Sect. 9.2). Some authors suggest not to trust the output of a non-linear solution until one has plotted the measured values against the predicted and looked at the residual plots The interested reader can refer to several advanced texts which deal with non-linear estimation (such as Bard 1974; Beck and Arnold 1977; Draper and Smith 1981; Neter et al. 1983). Section 7.3.4, which dealt with numerical search methods, described and illustrated the general penalty function approach for constrained optimization problems. Such problems also apply to non-linear least squares parameter estimation where one has a reasonable idea beforehand of
the range of variation in the individual parameters (say, based on physical considerations), and would like to constrain the search space to this range8. These problems can be converted into unconstrained multi-objective problems. The objective of minimizing the squared errors δ between measured and model-predicted values is combined with another term which tries to maintain reasonable values of the parameters p by adversely weighting the difference between the search values of parameters and their preferred values based on prior knowledge. If the vector p denotes the set of parameters to be estimated, then the loss or objective function is written as the weighted square sum of the model residuals and those of the parameter deviations: n
J ðpÞ = j=1
N
wδ2j þ
ð1 - wÞðpi - pi Þ
2
ð9:69Þ
i=1
where w is the weight (usually a fraction) associated with the model residuals δj and (1 - w) is the weight associated with the penalty of deviating from the preferred values of the parameter set. Note that the preferred vector p is not necessarily the optimal numerical solution p. A simplistic example will make the approach clearer. One wishes to constrain the parameters a and b to be positive (e.g., certain physical quantities cannot assume negative values). An arbitrary user-specified loss function could be of the sort: J ð pÞ =
n
δ2 j=1 j
þ 1000 × ða < 0Þ þ 1000 × ðb < 0Þ ð9:70Þ
where the multipliers of 1000 are arbitrary and chosen simply to impose a large penalty should either a or b assume negative values. It is obvious that some care must be chosen to assign such penalties pertinent to the problem at hand, with the above example meant for conceptual purposes only. Example 9.5.19 Fit the following non-linear model to the data in Table 9.23: y = a 1 - e - bt þ ε:
ð9:71Þ
Fig. 9.24 Example of a function with two minima. One would like the search algorithm to be discriminating/robust enough not to get caught in the local minimum point but to keep searching for any global minimum should it exists
First, it would be advisable to plot the data and look at the general shape. Next, a statistical software is used for the non-linear estimation using least squares.
Table 9.23 Data table for Example 9.5.1
3 1.17
t y
1 0.47
2 0.74
4 1.42
5 1.60
7 1.84
9 2.19
11 2.17
8 Some researchers refer to such an approach as “Bayesian non-linear search method”, but the use of this somewhat pretentious terminology can be disputed. 9 From Draper and Smith (1981) by permission of John Wiley and Sons.
390
9
Parametric and Non-Parametric Regression Methods
Table 9.24 Results of the non-linear parameter estimation for Example 9.5.1 Parameter a b
Estimate 2.498 0.2024
Asymptotic standard error 0.1072 0.0180
Asymptotic 95.0% confidence interval Lower Upper 2.2357 2.7603 0.1584 0.2464
Fig. 9.25 Non-linear model fitting (Example 9.5.1). (a) Plot of fitted model to original data. (b) Plot of model residuals
The equation of the fitted model is found to be
9.5.3
y = 2:498: ½1 - expð- 0:2024 × t Þ with adjusted R2 = 98.9%, standard error of estimate SE= 0.0661 and mean absolute error MAE = 0.0484. The overall model fit is thus deemed excellent. Further, the parameters have low standard errors as can be seen from the parameter estimation results shown in Table 9.24 along with the 95% intervals. How well the model predicts the observed data is illustrated in Fig. 9.25a, while Fig. 9.25b is a plot of the model residuals. Recall that studentized residuals indicate how many standard deviations each observed value of y deviates from a model fitted using all the data except that observation. In this case, there is one studentized residual greater than 2 (point 7), but none greater than 3. DFITS is a statistic which measures how much the estimated coefficients would change if each observation was removed from the data set. Two data points (points 7 and 8) were flagged as having unusually large values of DFITS, and it would be advisable to look at these data points more carefully especially since these are the two end points. Preferably, the model function may itself have to be revised, and, if possible, collecting more data points at the high end would likely result in a more robust and accurate model. ■
Overview of Various Parametric Regression Methods
A flowchart meant to provide an overview of different types of parametric regression methods along with the sections in which they are treated is shown in Fig. 9.26. Some approaches (such as factor analysis) have been left out to avoid excessive clutter. Parameter estimation is done by either MME (least squares which can be closed form or by search methods) or by MLE methods.
9.6
Non-Parametric Regression
9.6.1
Background
It is important to distinguish between the terms “parametric” and “non-parametric” as applied to regression models. Recall that Sect. 4.5 dealt with statistical inferences about the population from sample data based on non-parametric tests where no assumptions are made about the population distributions underlying the data sample, i.e., the sample has an unknown distribution, or else none is assumed/specified. In the context of modeling, parametric regression, treated in Chap.5, is taken to imply that the OLS regression model captures the linear functional behavior of the system under the same
9.6 Non-Parametric Regression
391
Fig. 9.26 Overview of various parametric regression methods discussed along with the section numbers where treated. The two parameter estimation methods are MME and MLE. (OLS ordinary
least-squares, WLS weighted least-squares, CO Cochrane-Orcutt, CLS corrected least-squares, PCA principal component analysis, GLM generalized linear models, LMA Levenberg-Marquardt.algorithm)
implicit assumption about error distribution and further that the functional model contains model coefficients or parameters to be estimated which are already specified. Only in such instances do the uncertainty intervals of model predictions and the standard errors of the model parameters follow standard statistical formulae (Sect. 5.3). Linear, GLM, and non-linear models are examples of parametric regression models because the function that describes the relationship between the response and explanatory variables is known or assumed in advance and the functional model consists of a fixed set of parameters. Nonparametric regression differs from parametric regression in that it makes no assumptions about the probability distribution of the data, and further the shape of the functional relationships between the response (or dependent) and the explanatory (or independent) variables is not predetermined but adjusted to capture unusual or unexpected features of the data in different subspaces- this process is called “learning.” These have important implications regarding the interpretability of the model coefficients and their standard errors. A closed form function (such as say Eq. 5.63a) represents a linear spline model with one hinge point with four model coefficients (or a 4P model where P stands for “parameters”). The four model coefficients can be taken to be parameters if the model prediction uncertainty (and the standard error of the model coefficients) is determined based on, say, an OLS regression. Even if the model prediction uncertainty and the
standard errors of the parameters are determined by bootstrapping, it is still considered a parametric approach. When the relationship between the response and explanatory variables is known, parametric regression modeling is preferable to nonparametric regression since the former is more efficient, and interpretation/insight into the parameter influence/strength is often superior. If the relationship is unknown and non-linear, nonparametric regression models provide greater flexibility. In case the relationship between the response and only a subset of the explanatory variables (covariates or categorical) is known, the term “semiparametric” regression model is sometimes adopted. Finally, it is possible to adopt a strategy combining both approaches. One could adopt a non-parametric approach to explore the data and develop some preliminary understanding of the appropriate functional forms, and then use a parametric approach to evaluate different model forms, select the best and estimate relevant parameters.
9.6.2
Extensions to Linear Models
It is useful to distinguish between two different types of data analysis applications in the current context: (i) For smoothing to better detect underlying trend during data exploration between y and x variables.
392
9
Parametric and Non-Parametric Regression Methods
(ii) For regression to predict future expected values of y which requires that a model be identified. Further, one can distinguish between models that do so by fitting global behavior versus those which better capture local behavior. Further, one can distinguish between two types of data sets: (i) sparse data sets10 which have few observations due to the nature of the data, and which are such that there is essentially only one (or a small number of) response value for a specific value of x. This is common in data collected at or over discrete time intervals or during DOE experiments (see Chap. 6). Some examples are data shown in Fig. 3.11 to illustrate linear interpolation, and Fig. 3.20 showing worldwide population growth for each decade during the period 1960–2060; (ii) dense data sets with multiple response observations for a specific value of the independent variable (s). This is common while monitoring an existing component or system during routine operation. An example is Fig. 3.10 which is a scatter plot of building hourly cooling energy use versus ambient air temperature. The data spread at a given value of x is due to overlooking the influence of variables other than outdoor temperature, effects of thermal mass of the structure being cooled and noise in the data. Linear models are often inadequate to capture the system response over its entire range of variation. Several non-parametric modeling approaches have been developed as a result (James et al. 2013). Nonparametric regression may be used for a variety of purposes as listed below: (a) Polynomial regression (Eq. 5.23 in Sect. 5.4.1) includes higher order terms of the regressor variable to capture the global data behavior. It has been suggested that polynomials greater than third degree should be avoided, and alternative functional forms investigated if these are inadequate. Note that the model is non-linear in its functional form, but linear in the coefficients. (b) Regression with indicator variables involving piecewise step functions with discontinuity/jumps to account for the effect of different categorical variables. A discussion with an illustrative example is provided in Sect. 5.7.3. (c) Regression spline approach (discussed below), which is more flexible than the previous two approaches, involve breaking up the range of variation in the regressor vector X into distinct subspaces which are fit by piecewise polynomial functions. Usually, cubic functions are used for each region since they are continuous and smooth at the region boundaries (referred to as knots) 10 The word “sparse” is mostly used to denote data sets with few observations overall. Here, it has been used in a slightly different manner.
Fig. 9.27 Conceptual difference between smoothing splines and regression splines for a sparse univariate data set
with continuous first, and second derivatives. Their main purpose is to provide accurate predictions. Recall that the basic linear spline model has been discussed earlier in Sect. 5.7.2 using indicator variables and illustrated in the context of building energy use (in the terminology of this industry “knots” are referred to as change points). (d) Smoothing splines, also discussed below, are primarily meant for data exploration i.e., to discern overall behavior of the data; accurate prediction capability is not the intent. The simplest use is as a scatterplot smoother for pure exploration. In such a context, a plot of f(x) versus x is perhaps all that is required. The functional form is designed to balance fit with smoothing the data as illustrated in Fig. 9.27. Usually, the functions are smoother than regression spline model functions (as suggested by the terminology). (e) Kernel smoothing approaches are similar to their spline counterparts and are also widely used. They capture the non-linear structure of the data for regression and the detection of data trend for smoothing. For simple scatterplot smoothing, the spline and kernel techniques will frequently produce very similar results. While spline models are based on piecewise polynomial fitting, kernel regression models are based on weighing the neighboring data using some type of local functions. A kernel smoother usually involves two tasks: selection of a subsurface consisting of a number of points and a weighted local transformation of a kernel function either linear or non-linear. Thus, it estimates the real valued function of the response variable y as the weighted average of a user-specified number of neighboring points (k-nn) and of the variance by suitably weighting the difference in y-values of the individual k-nn points from the mean. The weight is defined by the choice of kernel, such that closer points are given higher weights. Different functions can be used for the kernel with the average (zeroth order), linear (first-order) polynomial and the Gaussian kernel (normal distribution) being
9.6 Non-Parametric Regression
393
widely used. The interested reader can refer to Wolberg (2006) or Wakefield (2013) for a detailed treatment. Section 11.4.2 illustrates the use of this method, in the context of clustering, for determining uncertainty around an estimated point using the zeroth order k-nn algorithm in conjunction with a physical-based manner of assigning weights to different regressors of a non-linear function representing building energy use (following Subbarao et al. 2011). (f) Local regression which is similar to splines but with the difference that regions are allowed to overlap. The often-adopted LOWESS approach, which can be used for both data fitting and for detecting hidden data trends when data is large and noisy, is presented in Sect. 9.7. (g) Multi-layer perceptron modeling, which is a class of neural networks, is in some ways the ultimate in non-linear non-parametric modeling for the purpose of future predictions. An overview of this machinelearning method is provided in Sect. 9.8.
9.6.3
Basis Functions
Basis functions are a general model building approach/transformation with spline and kernel functions as special cases. The linear model for the general multivariate set of X regressors is recast as a linear combination of k terms based on preselected univariate functions called basis functions (b1, b2, . . . bK): yi = βo þ β1 b1 ðxi Þ þ β2 b2 ðxi Þ . . . þ βk bk ðxi Þ þ εi
9.6.4
Splines are popular for interpolation and approximation of data sampled at a discrete set of points, e.g., for time series interpolation. As stated above, the approach is to break up the range of variation in X into distinct regions which are fit by piecewise polynomial functions following the Taylor series approximation. 11The kth order spline is simply a piecewise polynomial function of degree k. It is continuous and has continuous derivatives of orders (1, . . . k – 1), at its knot points. The simplest piecewise polynomial is one with a constant function which is a first-order spline (akin to the change point model discussed in Sect. 5.7.2). With one knot, one has two-basis functions, with two knots three basisfunctions and so on. However, the first-order spline function is discontinuous at the knot (i.e., the left side first derivative and the right-side derivative at the knot point are not equal). One can assume quadratic or second-order basis functions. Even then the second derivatives on either side of the knot will still not be equal. Third-order or cubic splines are most commonly used since they provide very good smoothness often undetectable by eye and have a theoretical basis for this selection (as shown by Wakefield 2013). For a model with only one regressor/predictor variable (univariate case), the third-order spline model with two knots can be represented by the following truncated power series with eight parameters: f ðxÞ = β0 þ β1 x þ β2 x2 þ β3 x3 þ b1 ðx - ξ1 Þ3þ
ð9:72Þ
This equation, referred to as the “standard model of order k”, is simply a transformation of the original regressors xi i.e., converted into a new set of transformed predictors/regressors/ variables bj(xj). The transformation is made based on any arbitrary but appropriate feature in the data trend, and thus provides much greater flexibility than the purely linear model. Spline functions and kernel functions are two common ways of making the transformation, but others can also be used (such as Fourier series). Since the basis functions are specified/selected, the problem reduces to one of regression, i.e., to determine the model coefficients (βo, β1. . .βk). The estimation can be done using OLS if the error conditions are met, and all related inferences on prediction uncertainty and standard errors of model estimates can be based accordingly. The next section elaborates on this concept.
Polynomial Regression and Smoothing Splines
þ b2 ðx - ξ2 Þ3þ
ð9:73Þ
where ξi denotes the ith knot. The “+” sign in the last two terms indicates that the quantity within brackets is zero when negative, i.e., for the b1 term=0 when x < ξ1 (this is akin to the indicator variable terminology using I as in Eq. 5.63a). In cases where there is essentially only one response value for a specific value of x (sparse data), the largest number one can choose is one knot at each value of x. This is usually adopted for interpolation. For smoothing and regression, the number of knots is generally fewer. If the knots are preselected, this becomes a linear regression problem with 6 parameters for the example in Eq.(9.73) (refer to Sect. 5.7.2), and OLS can often be used. Otherwise, a non-linear regression approach is warranted. Notice that Eq. 9.73 reduces to the standard cubic model with 3 parameters if there are no knot points. In case of a model with two regressors, one would add an additional set of truncated 11
Recall that the Taylor series method offers a way to approximate any function at a specific point by a polynomial function whose individual coefficients are higher order derivatives of the original function at that specific point.
394
9
power series terms, and so on. Splines based on Eq. 9.73 are said to become erratic beyond the extreme knots, and the remedy is to use linear functional forms beyond the two extreme knot points on either end; in such cases they are called natural splines (Wakefield 2013). As stated above, smoothing splines are primarily meant for data exploration i.e., to discern overall behavior of the data; on the other hand, accurate prediction capability is the intent of regression splines. During regression, one attempts to minimize the least square errors. For smoothing, one would strive to minimize the rate of change of the slope i.e., the roughness or the second derivative. To balance these considerations and frame it under a unified framework, the least squares criterion of minimizing the total squared error is reformulated by including a penalty term: n
½yi - f ðxi Þ2 þ λ
min
f 00 ðxÞ2 dx
ð9:74Þ
i=1
where the second term within the integral represents a summation over the range which penalizes the roughness (or curvature/wiggliness) of the function, and λ, called the “smoothing parameter”, controls the degree of this roughness relative to the total square error term. The greater the value of λ, the smoother the curve or the fit will be close to linear; thus, it controls the extent of under- and over-fitting. There is some subjectivity involved in choosing λ and is commonly determined by cross-validation. The number of splines in a regression setting (when λ = 0) will turn out to be less than that of smoothing splines.
9.7
Local Regression: LOWESS Smoothing Method
Sometimes the data is so noisy that the underlying trend may be obscured by the data scatter. A non-parametric black-box method called the “locally weighted scatter plot smoother” or LOWESS (Devore and Farnum 2005) can be used to smoothen the data scatter and reveal otherwise undetectable or hard-to-detect trends. Instead of using all the data (such as done in traditional parametric model fitting), the intent is to fit a series of lines (usually polynomial functions) using a prespecified portion of the data. Say, one has n pairs of (x, y) observations, and one elects to use subsets of 20% of the data at a time. For each individual x0 point, one selects 20% of the closest x-points, and fits a polynomial line with only this subset of data. One then uses this model to predict the corresponding value of the response variable y0 at the individual point x0. This process is repeated for each of the n points so that one gets n sets of points of ðx0 , y0 Þ. The LOWESS plot is simply the plot of these n sets of data points. Figure 9.28 illustrates a case where the LOWESS plot
Parametric and Non-Parametric Regression Methods
indicated a trend which one would be hard pressed to detect in the original data. Looking at Fig. 9.28 which is a scatter plot of the characteristics of bears, with x as the chest girth of the bear, and y its weight, one can faintly detect a non-linear behavior; but it is not too clear. However, the same data when subject to the LOWESS procedure (Fig. 9.28b) assuming a pre-specified portion of 50% reveals a clear bi-linear behavior with a steeper trend when x > 38. In conclusion, LOWESS is a powerful functional estimation method which is local (as suggested by the name), non-parametric, and can potentially capture any arbitrary feature present in the data. Note that one could use LOWESS regression with individual data point down-weighted if their residuals are large. There are different ways to do so as discussed in Sect. 9.9.
9.8
Neural Networks: Multi-Layer Perceptron (MLP)
A widely used black-box approach is the neural network approach (NN) which is a class of mathematical models related to the way the human brain functions while performing such activities as decision-making, pattern or speech recognition, image or signal processing, system prediction, optimization and control (Wasserman 1989). NN grew out of research in artificial intelligence, and hence, they are often referred to as artificial neural networks. NN possess several unique attributes that allow them to be superior to the traditional methods of knowledge acquisition (of which data analysis and modeling is one specific activity). They have the ability to: (i) exploit a large amount of data/ information, (ii) respond quickly to varying conditions, (iii) learn from examples, and to generalize underlying rules of system behavior, (iv) map complex non-linear behavior for which input-output variable set is known but not their structural interaction (i.e., black-box models), and (v) ability to handle noisy data, i.e., have good robustness. The last 50 years have seen an explosion in the use of NN. These have been successfully applied across numerous disciplines such as engineering, physics, finance, psychology, information theory and medicine. For example, in engineering, NN have been used for system modeling and control, short-term electric load forecasting, fault detection and control of complex systems. Stock market prediction and classification in terms of credit-worthiness of individuals applying for credit cards are examples of financial applications. Medical applications of NN involve predicting the probable onset of certain medical conditions based on a variety of healthrelated indices. NN has also been used for image processing and pattern recognition. In this section, the discussion is limited to its function mapping capability. There are numerous NN topologies by which the relationship between one or more response variable(s) and one or
9.8 Neural Networks: Multi-Layer Perceptron (MLP)
395
Fig. 9.28 Example to illustrate the insight which can be provided by the LOWESS smoothing procedure. The data set is meant to detect the underlying pattern between the weight of a wild bear and its chest girth. While the traditional manner of plotting the data (frame a) suggests a linear function, (frame b) assuming a 50% prespecified smoothing length, reveals a bi-linear behavior. (From Devore and Farnum 2005 by # permission of Cengage Learning)
more regressor variable(s) can be framed. A widely used architecture for applications involving predicting system behavior is the feed-forward multi-layer perceptron (MLP). A typical MLP consists of an input layer, one or more hidden layers and an output layer (see Fig. 9.29). The input layer is made up of discrete nodes (or neurons, or units), each of which represents a single regressor variable, while each node of the output layer represents a single response variable. Only one output node is shown but the extension to a set of response variables is obvious. Networks can consist of various topographies; for example, with no hidden layers (though this is uncommon), with one hidden layer, and with numerous hidden layers. It is the wide consensus that except in rare circumstances, an MLP architecture with only one hidden layer is usually adequate for most functional mapping applications. While the number of nodes in the input and output layers are dictated by the specific problem at hand, the number of nodes in each of the hidden layers is a design choice (certain heuristics are stated further below). Each node of the input and first hidden layers are connected by lines to indicate information flow, and so on for each successive layer till the output layer is reached. This
is why such a representation is called “feed-forward.” The input nodes or the X vector are multiplied by an associative weight vector W representative of the strength of the specific connection. These are summed so that (see Figs. 9.29 and 9.30): NET = ðw11 x1 þ w21 x2 þ w31 x3 Þ =
XW
ð9:75Þ
For each node i, the NETi signal is then processed further by an activation function: OUTi = f ðNETi Þ þ bi
ð9:76Þ
The activation function f(NET) (also called basis function or squashing function) can be a linear mapping function (this would lead to the simple linear regression model), but more generally, the following monotone sigmoid forms are adopted: f ðNETi Þ ½1 þ exp ð- NETi Þ - 1 or f ðNETi Þ tanh ðNETi Þ
ð9:77Þ
396
Fig. 9.29 Feed-forward multilayer perceptron (MLP) topology for a network with 3 nodes in the input layer, 3 in the hidden layer and one in the output layer denoted by MLP(3,3,1). The weights for each of the interactions between the input and hidden layer nodes are denoted by (w11,. . .w33) while those between the hidden and the output nodes are
Fig. 9.30 The three computational steps done at processing node H1 represented by square nodes in Fig. 9.29. The incoming signals are weighted and summed to yield the NET which is then transformed non-linearly by a squashing (or basis function) to which a bias term is added. The resulting OUT signal becomes the input to the next node downstream to which it is connected, and so on. These steps are done at each of the processing nodes
Logistic or hyperbolic functions are selected because of their ability to squash or limit the values of the output within certain limits such as (-1, 1). It is this mapping which allows non-linearity to be introduced in the model structure. The bias term b is called the activation threshold for the corresponding node and is introduced to avoid the activation function getting stuck in the saturated or limiting tails of the function. As shown in Fig. 9.29, this process is continued till an estimate of the output variable y is found. The weight structure determines the total network behavior. Training the MLP network is done by adjusting the network weights in an orderly manner such that each iteration (referred to as “epoch” in NN terminology) results in a step closer to the final value. Numerous epochs (of the order of 100,000 or more) are typically needed; a simple and quick
9
Parametric and Non-Parametric Regression Methods
denoted by (v10, v20, v30). Extending the architecture to deal with more nodes in any layer and to a greater number of hidden layers is intuitively straightforward. The square nodes indicate those where some sort of processing is done as elaborated in Fig. 9.30
task for modern personal computers. The gradual convergence is very similar to the gradient descent method used in non-linear optimization or during estimating parameters of a non-linear model. The loss function is usually the squared error of the model residuals just as done in OLS. Adjusting the weights as to minimize this error function is called training the network (some use the terminology “learning by the network”). The most used algorithm to perform this task is the back-propagation training algorithm where partial derivatives (reflective of the sensitivity coefficients) of the error surface are used to determine the new search direction. The step size is determined by a user-determined learning rate selected so as to hasten the search but not lead to overshooting and instability (notice the similarity of the entire process with the traditional non-linear search methodology described in Sect. 9.5). As with any model identification where one has the possibility of adding a large number of model parameters, there is the distinct possibility of overfitting or over-training, i.e., fitting a structure to the noise in the data.12 Hence, it is essential that a sample cross-validation scheme be adopted such as that used in traditional regression model identification (see Sect. 5.8.2). In fact, during MLP modeling, the recommended approach is to sub-divide the data set used to train the MLP into three subsets:
12
Spatial data (such as image processing or pattern recognition) have more complex features than functional mapping applications and requires an MLP architecture with a large number of hidden layers. However, the back-propagation algorithm breaks down in such cases. Advances in the last decade or so, such as stochastic gradient descent algorithms, are able to train deep networks with multiple layers.
9.8 Neural Networks: Multi-Layer Perceptron (MLP)
(i) Training data set used to evaluate different MLP architectures and to train the weights of the nodes in the hidden layer. (ii) Validation data set meant to monitor the performance of the MLP during training. If the network is allowed to train too long, it tends to over-train leading to a loss in generality of the model. (iii) Testing data set, similar to the cross-validation data set, meant to evaluate the predictive or external accuracy of the MLP network using such indices as the CV and NMBE. These are often referred to as generalization errors. The usefulness of the validation data set during training of a specific network is illustrated in Fig. 9.31. During the early stages of training, the RMSE of both the training and validation data sets drop at the same rate. The RMSE for the training data set keeps on decreasing as more iterations
Fig. 9.31 Conceptual figure illustrating how to select the optimal MLP network weights based on the RMSE errors from the training and validation data sets
Fig. 9.32 Two different MLP model architectures appropriate for time series modeling. The simpler feed forward network uses time-lagged values till time t to predict a future value at time (t + 1). Recurrent networks use internally generated past data with only the current value for prediction, and are said to be generally more powerful while, however, needing more expertise in their proper use. (a) Feedforward network. (b) Recurrent network. (From SPSS 1997)
397
(or epochs) are performed. At each epoch, the trained model is applied to the validation data set, and the corresponding RMSE error computed. The validation error eventually starts to rise; it is at this point that training ought to be stopped. The node weights at this point are the ones which correspond to the optimal model. Note, however, that this process is specific to a preselected architecture, and so different architectures need to be evaluated in a similar manner. Compare this process with the traditional regression model building approach wherein the optimal solution is given by closed form solutions and no search or training is needed. However, the need to discriminate between competing model functions still exists and statistical indices such as RMSE or R2 and adjusted R2 are used for this purpose. MLP networks have also been used to model and forecast dynamic system behavior as well as time series data over a time horizon. There are two widely used architectures. The time series feed forward network shown in Fig. 9.32a is easier and more straightforward to understand and implement. A recurrent network is one where the outputs of the nodes in the hidden layer are fed back as inputs to previous layers. This allows a higher level of non-linearity to be mapped, but they are more generally difficult to train properly. Many types of interconnections are possible with the more straightforward network arrangement for one hidden layer shown in Fig. 9.32b. The networks assumed in these figures use the past values of the variable itself; extensions to the multivariate case would involve investigating different combinations of present and lagged values of the regressor variables as well. Some useful heuristics with MLP training are assembled below (note that some of these closely parallel those followed by traditional model fitting):
(a) In most cases involving function mapping, one hidden layer should be adequate.
398
9
Parametric and Non-Parametric Regression Methods
(b) Start with a small set of regressor variables deemed most influential, and gradually include additional variables only if the magnitude of the model residuals decreases and if this leads to better behavior of the residuals. This is illustrated in Fig. 9.33 where the use of one regressor results in very improper model residuals which is greatly reduced as two more relevant regressor variables are introduced. (c) The number of training data points should not be less than about 10 times the total number of weights to be tuned. Unfortunately, this criterion is not met in some published studies. (d) The number of nodes in the hidden layer should be directly related to the complexity/non-linearity of the problem. Often, this number should be in the range of 1–3 times the number of regressor variables. (e) The original data set should be split such that training uses about 50–60% of the number of observations, with the validation and the testing each using about 20–25% each. All three subsets should be chosen randomly. (f) As with non-linear parameter estimation, training should be repeated with different plausible architectures, and further, each of the architectures should be trained using different mixes of training/validation/testing subsets to avoid the pitfalls of local minima. Commercial software is available which automate this process, i.e., train a number of different architectures and let the analyst select the one he deems most appropriate. Two examples of MLP modeling from the published building energy literature dealing with predicting building response over time are discussed below. Kawashima et al. (1998) evaluated several time series modeling methods for predicting the hourly thermal load of a building over a 24 h time horizon using current and lagged values of outdoor temperature and solar insolation. Comparative results in terms of the CV and NMBE statistics during several days in summer and in winter are assembled in Table 9.25 for five methods (out of a total of seven methods in the original study), all of which used 15 regressor variables. It is quite clear that the MLP models are vastly superior in predictive accuracy to the traditional methods such as ARIMA, EWMA, and linear regression (the last used 15 regressor variables). The recurrent model is slightly superior to the feed-forward MLP model. However, the MLP models, both of which have 15 nodes in the input layer and 31 nodes in the single hidden layer, imply that a rather complex model was needed to adequately capture the short-term forecasts. It is instructive to briefly summarize another study (Miller and Seem 1991) to point out that many instances only warrant simple MLP architectures, and that, even then, traditional methods may be more appropriate in practice. The study compared MLP models with traditional methods (namely, the recursive least squares) in the context of being able to predict the amount of time needed for a room to return to
Fig. 9.33 Scatter plots of two different MLP architectures fit to the same data set to show the importance of including the proper input variables. Cooling loads of a building are being predicted against outdoor dry-bulb and humidity and internal loads as the three regressor variables. Both the magnitude and the non-uniform behavior of the model residuals are greatly reduced as a result. (a) Residuals of MLP(1–10-1) with outdoor temperature as the only regressor. (b) Residual plot and c measured vs modeled plot of MLP(3–10-1) with three regressors
9.9 Robust Regression
399
Table 9.25 Accuracy of different modeling approaches in predicting hourly thermal load of a building over a 24 h time horizon Model type Auto regressive integrated moving average (ARIMA) Exponential weighted moving average (EWMA) Linear regression with 15 regressors Multi-layer Perceptron—MLP(15,31,1) Recurrent Multi-layer Perceptron—MLP(15,31,1)
Coefficient of variation (%) Winter Summer 27.7 34.4 12.0 26.0 27.8 21.4 11.2 9.1 9.3 6.8
Normalized mean biased error (%) Winter Summer 1.6 1.0 1.8 3.3 - 2.0 - 1.3 - 0.5 - 0.2 - 0.5 - 0.4
Extracted from Kawashima et al. (1998)
its desired temperature after night- or weekend- thermostat set-back. Only two input variables were used: the room temperature and the outdoor temperature. It was found that the best MLP was one with one hidden layer with two nodes even though evaluations were done with two hidden layers and up to 24 hidden nodes. The general conclusion was that even though the RMS errors and the maximum error of the MLP were slightly lower, the improvement was not significant enough to justify the more complex and expensive implementation cost of the MLP algorithm in actual HVAC controller hardware as compared to more traditional approaches.13 For someone with a more traditional background and outlook to model building, a sense of unease is felt when first exposed to MLP. Not only is a clear structure or functional form lacking even after the topography and weights are determined, but the “model” identification is also somewhat of an art. Further, there is the unfortunate tendency among several analysts to apply MLP to problems which can be solved more easily by traditional methods which offer more transparency and allow clearer physical interpretation of the model structure and of its parameters. MLP should be viewed as another tool in the arsenal of the data analyst and, despite its power and versatility, not as the sole one. As with any new approach, repeated use and careful analysis of MLP results will gradually lead to the analyst gaining familiarity, discrimination of when to use it, and increased confidence in its proper use. The interested reader can refer to several excellent textbooks of various levels of theoretical complexity; for example, Wasserman (1989); Fausett (1993), and Haykin (1999). The MLP architecture is said to be “inspired” by how the human brain functions. This is quite a stretch, and at best a pale/pretentious replication considering that the brain typically has about 10 billion neurons with each neuron having several thousands of interconnections!
13 This is a good example of the quote by Einstein expressing the view that ought to be followed by all good analysts: “Everything should be as simple as possible, but not simpler”.
9.9
Robust Regression
Proper estimation of parameters of a pre-specified model depends on the assumptions one makes about the errors. Robust regression methods are parameter estimation methods which are not critically dependent on such assumptions (Chatfield 1995). The term is also used to describe techniques by which the influence of outlier points can be automatically down-weighted in different ways during parameter estimation. Detection of gross outlier points in the data has been addressed previously, and includes such measures as limit checks, balance checks and by visual means (Sect. 3.3), and by statistical considerations (Sect. 3.6.6). Diagnostic methods using model residual analysis (Sect. 5.6) are a more refined means of achieving additional robustness since they allow detection of influential points. There are also automated methods that allow robust regression when faced with large and noisy data sets, and some of these are described below. Recall that OLS assumes certain idealized conditions to hold, one of which is that the errors are normally distributed. Often such departures from normality are not serious enough to warrant any corrective action. However, under certain cases, OLS regression results are very sensitive to a few seemingly outlier data points, or when response data spans several orders of magnitude. Under such cases, the square of certain model residuals may overwhelm the regression and lead to poor fits in other regions. Common types of deficiencies include errors that may be symmetric but non-normal; they may be more peaked than the normal with lighter tails, or the converse. Even if the errors are normally distributed, certain outliers may exist. One could identify outlier points and repeat the OLS fit by ignoring these. One could report the results of both fits to document the effect or sensitivity of the fits to the outlier points. However, the identification of outlier points is arbitrary to some extent, and rather than rejecting points, methods have been developed whereby less emphasis is placed during regression on such dubious points. Such methods are called robust fitting methods which assume some appropriate weighting or loss function. Figure 9.34 shows several such functions. While the two plots in the upper frame are continuous, those at the bottom are discontinuous with one function basically
400
9
Parametric and Non-Parametric Regression Methods
(c) Pearson minimization proceeds to Minimize
Fig. 9.34 Different weighting functions for robust regression. (a) OLS versus least absolute difference. (b) Two different outlier weighting functions
ignoring points which lie outside some pre-stipulated deviation value. (a) Minimization of the least absolute deviations or MAD (Fig. 9.34a). The parameter estimation is performed with the objective function being to: Minimize
j yi - yi j
ð9:78Þ
where yi and yi are the measured and modeled response variable values for observation i. This is probably the best-known method but is said to be generally the least powerful in terms of managing outliers. (b) Lorentzian minimization adopts the following criterion: Minimize
ln 1þjyi - yi j2
ð9:79Þ
This is said to be very effective with noisy data and data that spans several orders of magnitude. It is similar to the normal curve but with much wider tails; for example, even at 10 standard errors, the Lorentzian contains 94.9% of the points. The Gaussian, on the other hand, contains the same percentage at 2 standard errors. Thus, this function can accommodate instances of significant deviations in the data.
ln
1þ j yi - yi j2
ð9:80Þ
This is the most robust of the three methods with outliers having almost no impact at all on the fitted line. This minimization should be used in cases where wild and random errors are expected as a natural course. In summary, robust regression methods are those which are less affected by outliers, and this is a seemingly advisable path to follow with noisy data. However, Draper and Smith (1981) caution against the blind and indiscriminate use of robust regression since clear rules indicating the most appropriate method to use for a presumed type of error distribution do not exist. Rather, they recommend against the use of robust regression based on any one of the above functions and suggest that maximum likelihood estimation be adopted instead. In any case, when the origin, nature, magnitude and distribution of errors are somewhat ambiguous, the cautious analyst should estimate parameters by more than one method, study the results, and then make a final decision with due diligence. Example 9.9.1 Consider the simple linear regression data set given in Example 5.3.1. One wishes to investigate the extent to which MAD estimation would differ from the standard OLS method. A commercial software program has been used to refit the same data using the MAD optimization criterion which is more resistant to outliers. The results are summarized below: • OLS model identified in Example 5.3.1: y = 3.8296 + 0.9036x with the 95% CL for the intercept being {0.2131, 7.4461} and for the slope {0.8011, 1.0061}. • Using MAD analysis: y = 2.1579 + 0.9474 x. One notes that the MAD parameters fall comfortably within the OLS 95% CL intervals but a closer look at Fig. 9.35 reveals that there is some deviation in the model lines especially at the low range. The difference is small, and one can conclude that the data set is such that outliers have little effect on the model estimated. Such analyses provide an additional level of confidence when estimating model parameters using the OLS approach. ■
Problems
401
Fig. 9.35 Comparison of OLS model along with 95% confidence and prediction bands and the MAD model (Example 9.9.1)
Problems Pr. 9.1 Compute and interpret the condition numbers for the following: (a) f(x) = e-x (b) f(x) = [(x2 + 1)1/2 - x]
for x = 10 for x = 1000
Pr. 9.2 Compute the condition number of the following matrix 21
7
-1
5 4
7 -4
7 20
Pr. 9.3 Indicate whether the following functions are linear, intrinsically linear or non-linear in their parameters. In the case of intrinsically linear, show the transformation. y = expðb0 þ b1 x1 Þ: expðb2 x2 Þ y = b0 þ
b1 x x b2 1 2
ð9:81aÞ ð9:81bÞ
chiller (Reddy 2007). Instead of retaining all 15 variables, you will reduce the data set first by generating the correlation matrix of this data set and identifying pairs of variables which exhibit (i) the most correlation and (ii) the least correlation. It is enough if you retain only the top 5–6 variables. Subsequently repeat the PCA analysis as shown in Example 9.3.3 and compare results. Pr. 9.5 Quality control of electronic equipment involves taking a random sample size n and determining the proportion of items which are defective. Compute and graph the likelihood function for the two following cases: (a) n = 6 with 2 defectives, (b) n = 8 and 3 defectives. Pr. 9.6 Indoor air quality measurements of carbon dioxide concentration reveal how well the building is ventilated, i.e., whether adequate ventilation air is being brought in and properly distributed to meet the comfort needs of the occupants dispersed throughout the building. The following twelve measurements of CO2 in parts per million (ppm) were taken in the twelve rooms of a building: f732, 816, 875, 932, 994, 1003, 1050, 1113, 1163, 1208, 1292, 1382g
y = b0 xb11 xb22
ð9:81cÞ
y = b0 - b1 bx21
ð9:81dÞ
(a) Assuming normal distribution, estimate the true average concentration and the standard deviation using MLE, (b) How are these different from classical MME values? Discuss.
Pr. 9.4 Chiller data analysis using PCA Consider Table 9.4 of Example 9.3.3 which consists of a data set of 15 possible characteristic features (CFs) or variables under 27 different operating conditions of a centrifugal
Pr. 9.7 Non-linear model fitting to thermodynamic properties of steam Table 9.26 lists the saturation pressure in kilo Pascals for different values of temperature extracted from the wellknown steam tables.
402
9
Parametric and Non-Parametric Regression Methods
Table 9.26 Data table for Problem 9.7 Temperature t (°C) Pressure pv,sat (kPa)
10 1.227
Table 9.27 Data table for Pr. 9.11
20 2.337
x y
30 4.241
1 0.47
40 7.375
2 0.74
50 12.335
3 1.17
4 1.42
60 19.92
70 31.16
5 1.60
7 1.84
80 47.36
90 70.11
9 2.19
11 2.17
Table 9.28 Data table for Problem 9.12a
Type of equipment Central A/C Color TV Lights a
Total number of units 10,000 15,000 40,000
Annual energy use (kWh) Current model New model 3500 2800 800 600 1000 300
Initial number of new models N0 100 150 500
Initial growth rate r% 5 8 20
Predicted saturation fraction 0.40 0.60 0.80
Data available electronically on book website
Two different models proposed in the literature are: pv,sat = c: exp
at bþt
and ln pv,sat = a þ
b T
ð9:82Þ
where T is the temperature in units Kelvin. (a) You are asked to estimate the model parameters by both OLS (using variable transformation to make the estimation linear) and by MLE along with standard errors of the coefficients. Comment on the differences of both methods and the implied assumption, (b) Which of the two models is the preferred choice? Give reasons, (c) You are asked to use the identified models to predict saturation pressure at t = 75 °C along with the model prediction error. Comment. Pr. 9.8 Consider the Weibull distribution W(2,7,9) shown in Fig. 2.21. Apply the Box-Cox transformation to make the distribution near normal. Plot the two distributions similar to Fig. 9.16. Pr. 9.9 Consider the log-normal distribution L(2,2) shown in Fig. 2.18. Apply the Box-Cox transformation to make the distribution near normal. Plot the two distributions similar to Fig. 9.16. Pr. 9.10 Consider the following logistic function: E ðyÞ =
300 x≥0 ½1 þ 30: expð- 1:5xÞ
ð9:83Þ
(a) Plot the function (b) What is the asymptote? (c) At what values of x do the response reach 50% and 90% of their asymptote? Pr. 9.11 Fit the following non-linear model to the data in Table 9.27 and estimate 95% confidence regions for the two model coefficients: y = β0 ½1 - expð- β0 :xÞ
ð9:84Þ
Pr. 9.12 Logistic functions to study residential equipment penetration Electric utilities provide incentives to homeowners to replace existing appliances by high-efficiency ones—such as lights, air conditioners, dryers/washers, . . . . In order to plan for future load growth, the annual penetration level of such equipment needs to be estimated with some accuracy. Logistic growth models have been found to be appropriate since market penetration rates reach a saturation level, often specified as a saturation fraction or the fractional number of the total who purchase this equipment. Table 9.28 gives a fictitious example of load estimation with three different types of residential equipment. If the start year is 2020, plot the year-to-year energy use for the next 20 years assuming a logistic growth model for each of the three pieces of equipment separately, and also for the combined effect. Hint The carrying capacity can be calculated from the predicted saturation fraction and the difference in annual energy use between the current model and the new one.
Problems
403
Pr. 9.13 Fitting logistic models for growth of population and energy use Table 9.29 contains historic population data from 1970 till 2010 (current) as well as extrapolations till 2050 (from the U.S. Census Bureau’s International Data Base). Primary energy use in million tons of oil equivalent (MTOE) consumed annually has also been gathered for the same time period (the last value for 2010 was partially extrapolated from 2008). You are asked to analyze this data using logistic models. (a) Plot the population data, (b) Estimate population growth rate and carrying capacity by fitting a logistic model to the data from 1970 to 2010. Does your analysis support the often-quoted estimate that the world population would plateau at 10 billion? (c) Identify a regression model between world population and primary energy use using data from 1970 to 2010? Report pertinent statistics (d) Predict primary energy use for the next four decades (2020–2050) along with 95% uncertainty estimates,
(e) Calculate the per capita annual energy use for each decade from 1970 to 2050. Analyze results and draw pertinent conclusions. Pr. 9.14 Dose response model fitting for VX gas You will repeat the analysis illustrated in Example 9.4.6 using the dose response curves for VX gas which is a nerve agent. You will identify the logistic model parameters for both the causality dose (CD) and lethal dose (LD) curves and report relevant model fit and parameter statistics (Fig. 9.36). Pr. 9.15 Non-linear parameter estimation of a model between volume and pressure of a gas The pressure P of a gas corresponding to various volumes V is given in Table 9.30. (a) Estimate the coefficients a and b assuming the ideal gas law: PV a = b
ð9:85Þ
(b) Study model residuals and draw relevant conclusions. Table 9.29 Data table for Problem 9.13 Year 1970 1980 1990 2000 2010 2020 2030 2040 2050 a
World population (in billion) 3.712 4.453 5.284 6.084 6.831 7.558 (projected) 8.202 8.748 9.202
a
Primary energy use in MTOE 4970.2 6629.7 8094.7 9262.6 1150b – – – –
Data available electronically on book website Extrapolated from measured 2008 value
b
Pr. 9.16 Non-linear parameter estimation of a model between light intensity and distance An experiment was conducted to verify the intensity of light ( y) as a function of distance (x) from a light source; the results are shown in Table 9.31. (a) Plot this data and fit a suitable polynomial model, (b) Fit the data with an exponential model, (c) Fit the data using a model derived from the underlying physics of the problem, (d) Compare the results of all three models in terms of their model statistics as well as their residual behavior. Pr. 9.17 Consider a hot water storage tank which is heated electrically. A heat balance on the storage tank yields: :
Pðt Þ = C T ðt Þ þ L½T ðt Þ - T a ðt Þ
ð9:86Þ
where
Fig. 9.36 Dose response curves for VX gas with 50% casualty dose (CD50) and 50% lethal dose (LD50) points. (From Kowalski 2002 by permission of McGraw-Hill) Table 9.30 Data table for Problem 9.15
V (cm3) P (kg/cm2)
50 64.7
C = M cp = thermal capacity of the storage tank [J/°C] L = U A = heat loss coefficient of the storage tank [W/°C] (A = surface area of storage) Ts(t) = temperature of storage [°C] as function of time
60 51.3
70 40.5
90 25.9
100 7.8
404
9
Ta(t) = ambient temperature [°C] as function of time P(t) = heating power [W] as function of time t
WC = log V ð0:26T Þ - 23:68ð0:63T þ 32:9Þ
:
ð9:88Þ
You are given this data without knowing the model. Examine a linear model between WC = f(T,V), and point out inadequacies in the model by looking at the model fits and the residuals,
Suppose one has data for Ts(t) during cool-down under constant conditions: P(t) = 0 and Ta(t) = constant. In that case, the energy balance can be written as (where the constant has been absorbed in T, i.e. T is now the difference between the storage temperature and the ambient temperature) τ T ðtÞ þ TðtÞ = 0
Parametric and Non-Parametric Regression Methods
(a) Investigate, keeping T fixed, a possible relation between WC and V, (b) Repeat with WC and T, (c) Adopt a stage-wise model building approach and evaluate suitability, (d) Fit a model of the type: WC = a + b T + c V + d(V )1/2 and evaluate suitability
ð9:87Þ
with τ = CL = time constant: Table 9.32 assembles the test results during storage tank cool-down.
Pr. 9.19 Parameter estimation for an air infiltration model in homes The most common manner of measuring exfiltration (or infiltration) rates in residences is by artificially pressurizing (or depressurizing) the home using a device called a blower door. This device consists of a door-insert with a rubber edge which can provide an air-tight seal against the door-lamb of one of the doors (usually the main entrance). The blower door has a variable speed fan, an air flow measuring meter and a pressure difference manometer with two plastic hoses (to allow the inside and outside pressure differential to be measured). All doors and windows are closed
(a) First, linearize the model and estimate its parameters, (b) Study model residuals and draw relevant conclusions, (c) Determine the time constant of the storage tank along with standard errors Pr. 9.18 Model identification for wind chill factor The National Weather Service generates tables of wind chill factor (WC) for different values of ambient temperature (T ) in °F and wind speed (V ) in mph. The WC is an equivalent temperature which has the same effect on the rate of heat loss as that of still air (an apparent wind of 4 mph). The equation used to generate the data in Table 9.33 is: Table 9.31 Data table for Problem 9.16 x (cm) y
30 0.85
35 0.67
40 0.52
45 0.42
50 0.34
55 0.28
60 0.24
65 0.21
70 0.18
75 0.15
Table 9.32 Data table for Problem 9.17 t(h) T(t)
0 10.1
1 8
2 6.8
3 5.7
4 4.4
5 3.8
6 3
7 2.4
8 2
9 1.8
10 1.1
11 1
Table 9.33 Data table for Problem 9.18a Wind speed (mph) 5 10 15 20 25 30 35 40 45 50
Actual air temperature (°F) 50 40 30 20 48 36 27 17 40 29 18 5 35 23 10 -5 32 18 4 - 10 30 15 -1 - 15 28 13 -5 - 18 27 11 -6 - 20 26 10 -7 - 21 25 9 -8 - 22 25 8 -9 - 23
10 5 -8 - 18 - 23 - 28 - 33 - 35 - 37 - 39 - 40
0 -
From Chatterjee and Price 1991 by permission of John Wiley and Sons Data available electronically on book website
a
5 20 29 34 38 44 48 52 54 55
-
10 15 30 42 50 55 60 65 68 70 72
-
20 25 43 55 64 72 76 80 83 86 88
-
30 35 55 70 79 88 92 96 100 103 105
-
40 46 68 83 94 105 109 113 117 120 123
-
50 56 80 97 108 118 124 130 135 139 142
-
60 66 93 112 121 130 134 137 140 143 145
Problems
405
Table 9.34 Data table for Problem 9.19a
Before weather-stripping Δp (Pa) 3.0 5.0 5.8 6.7 8.2 9.0 10.0 11.0 a
Table 9.35 Data table for Problem 9.20 with replication level of 2a
Q (m3/h) 99.2 170.4 185.6 208.5 263.2 283.1 310.2 346.2
Data available electronically on book website
1 2 3 4 5 6 7 8 9 a
After weather-stripping Δp (Pa) 2.2 5.5 6.7 8.2 11.6 13.5 15.6 18.2
Q (m3/h) 365.0 445.9 492.7 601.8 699.2 757.5 812.4 854.1
x1 1 10 100 1 10 100 1 10 100
x2 1 1 1 10 10 10 100 100 100
y1 12 32 103 20 61 198 38 133 406
y2 8 38 98 14 56 205 43 128 398
Data available electronically on book website
during the testing. The fan speed is increased incrementally and the pressure difference Δp and air flow rate Q are measured at each step. The model used to correlate air flow with pressure difference is a modified orifice flow model given by: Q = kðΔpÞn
ð9:89Þ
where k is called the flow coefficient (which is proportional to the effective leakage area of the building envelope) and n is the flow exponent. The latter is close to 0.5 when the flow is strictly turbulent which occurs when the flow paths through the interstices of the envelope are small and tortuous (like in a well-built “tight” house), while n is close to 1.0 when the flow is laminar (such as in a “loose” house). Values of n around 0.65 have been experimentally determined for typical residential construction in the United States. Table 9.34 assembles test results of an actual house where blower door tests were performed both before and after weather-stripping. (a) Plot the data and determine whether OLS conditions are satisfied in terms of the measured response and in terms of the log transform of the response (b) Identify the two sets of coefficients k and n for tests done before and after the house tightening. (c) Are the changes in the numerical values of the two sets of coefficients consistent with physical expectation?
(d) Based on the uncertainty estimates of these coefficients, what can you conclude about the effect of weatherstripping the house? Pr. 9.20 Non-linear model regression (from Neter et al. 1983) The yield ( y) of a chemical process depends on the temperature (x1) and pressure (x2). The following non-linear regression model is expected to apply: y = axb1 xc2
ð9:90Þ
Designed experiments with a replication level of 2 were performed in a laboratory and the data shown in Table 9.35. (a) Plot the data and look at the functional shape for the measured and the log-transformed values. What can you conclude about the distributions of the y-variable? (b) Log-transform the data and perform a multiple regression analysis. (c) Use these results as starting values and perform a non-linear regression. (d) Compare results of steps (b) and (c). Pr. 9.21 Learning Rate Model or Experience Curve Model Experience gained with a certain technology results in increased efficiency and cost decreases in the beginning. After a certain time, the rate of improvement decrease and
406
9
Parametric and Non-Parametric Regression Methods
Fig. 9.37 Learning rate curves or reduction in solar photovoltaic panel costs ($/W) with increase in production. Note the log scales for both axes. Every time the world’s solar power doubles, the cost of panels drops by 26%. (Downloaded on March 10, 2023 from http://www.rapidshift.net/ solar-pv-shows-a-record-learningrate-28-5-reduction-in-cost-perwatt-for-every-doubling-ofcumulative-capacity/)
eventually the process stabilizes. The initial phase where the associated cost of production reduces as a technology matures can be modeled by what is referred to as the learning rate model. This simplified approach is convenient to use for planning and policy studies. The simplest form of this model (one factor model): Ci = Co
Ni No
α
ð9:91Þ
where Ci is the unit cost at time t, Co the unit cost at time 0, Ni the cumulative production at time t, No the cumulative production at time 0, and α is the learning elasticity parameter. Often a factor PR called the “progress ratio” is defined such that for every doubling of the cumulative production, initial costs decrease by PR: PR = 2α
ð9:92aÞ
An alternative term is also commonly used Learning Rate ðLRÞ 1 - PR
ð9:92bÞ
Typical LR values range from 0.1 to 0.3. A typical learning rate of 0.2 (or PR=0.8) is often used as the best estimate of future cost reduction potential for a variety of energy technologies. For more accurate modeling, historic data is often used to determine numerical values for the learning rate. The reduction in solar crystalline-silicon photovoltaic module costs in dollars per Watt-peak seems to historically follow a learning rate model with LR = 0.285 (Fig. 9.37). Note that not all technologies follow this model-
some technologies reach an early death during the immature stage of development. (a) Estimate the cumulative PV deployment capacity in MW-peak needed to reach grid-parity (i.e., equal to cost of electricity generated by conventional power plants) under the following assumptions: • PV costs in 1990 were $8/W-peak • Worldwide PV capacity in 1990 = 2000 MW-peak • Learning rate LR = 0.285 • For grid-parity, the module costs should be around $0.3/W-peak (this is location dependent but assume this to be a mean value). Note that in addition to this, a typical solar installation will include material and labor costs as well as permitting fees and profits; these are excluded here. (b) Investigate the sensitivity of the different assumed values in how they affect the uncertainty of the estimated cumulative PV deployment capacity. (c) Research the PV costs and installed capacity in the U.S. for 2022 and compare your results with these values. Discuss causes for discrepancy.
References Andersen, K.K., and T.A. Reddy, 2002. The error in variable (EIV) regression approach as a means of identifying unbiased physical parameter estimates: Application to chiller performance data, HVAC&R Research Journal, vol.8, no.3, pp. 295–309, July. Asadzadeh, A., T. Kotter and E. Zebardast, 2015. An augmented approach for measurement of disaster resilience using connective factor analysis and analytic network process (F’ANP) model, Int.
References Journal of Disaster Risk Reduction, vol.14, part 4, pp. 504–518, December. Bard, Y., 1974. Nonlinear Parameter Estimation, Academic Press, New York. Beck, J.V. and K.J. Arnold, 1977. Parameter Estimation in Engineering and Science, John Wiley and Sons, New York Belsley, D.A., E. Kuhn, and R.E. Welsch. 1980. Regression Diagnostics, John Wiley & Sons, New York. Box, G. E. P. and Cox, D. R., 1964. An analysis of transformations, Journal of the Royal Statistical Society, Series B, 26, 211–252. Chapra, S.C. and R.P. Canale, 1988. Numerical Methods for Engineers, 2nd Ed., McGraw-Hill, New York. Chatfield, C., 1995. Problem Solving: A Statistician’s Guide, 2nd Ed., Chapman and Hall, London, U.K. Chatterjee, S. and B. Price, 1991. Regression Analysis by Example, 2nd Edition, John Wiley & Sons, New York Devore J., and N. Farnum, 2005. Applied Statistics for Engineers and Scientists, 2nd Ed., Thomson Brooks/Cole, Australia. Dobson, A.J. and A.G. Barnett, 2018. An Introduction to Generalized Linear Models, 4th Ed., CRC Press, Boca Raton, FL. Draper, N.R. and H. Smith, 1981. Applied Regression Analysis, 2nd Ed., John Wiley and Sons, New York. Edwards, C.H. and D.E. Penney, 1996. Differential Equations and Boundary Value Problems, Prentice Hall, Englewood Cliffs, NJ Fausett, L., 1993. Fundamentals of Neural Network: Architectures, Algorithms, and Applications, Prentice Hall, Englewood Cliffs, NJ. Fuller, W. A., 1987. Measurement Error Models, John Wiley & Sons, NY. Godfrey, K., 1983. Compartmental Models and Their Application, Academic Press, New York. Gordon, J.M. and K.C. Ng, 2000. Cool Thermodynamics, Cambridge International Science Publishing, Cambridge, UK Haykin, S., 1999. Neural Networks, 2nd ed., Prentice Hall, NJ. James, G., D. Witten, T. Hastie and R. Tibshirani, 2013. An Introduction to Statistical Learning: with Applications to R, Springer, New York Kachigan, S.K., 1991. Multivariate Statistical Analysis, 2nd Ed., Radius Press, New York. Kawashima, M., C.E. Dorgan and J.W. Mitchell, 1998. Hourly thermal load prediction for the next 24 hours by ARIMA, EWMA, LR, and an artificial neural network, ASHRAE Trans. 98(2), Atlanta, GA. Kowalski, W.J., 2002. Immune Building Systems Technology, McGrawHill, New York Lipschutz, S., 1966. Finite Mathematics, Schaum’s Outline Series, McGraw-Hill, New York Mandel, J., 1964. The Statistical Analysis of Experimental Data, Dover Publications, New York.
407 Manly, B.J.F., 2005. Multivariate Statistical Methods: A Primer, 3rd Ed., Chapman & Hall/CRC, Boca Raton, FL Masters, G.M. and W.P. Ela, 2008. Introduction to Environmental Engineering and Science,3rd Ed. Prentice Hall, Englewood Cliffs, NJ Miller, R.C. and J.E. Seem, 1991. Comparison of artificial neural networks with traditional methods of predicting return time from night or weekend setback, ASHRAE Trans., 91(2), Atlanta, GA., Mullet, G.M., 1976. Why regression coefficients have the wrong sign, J. Quality Technol., 8(3). Neter, J. W. Wasserman and M.H. Kutner, 1983. Applied Linear Regression Models, Richard D. Irwin, Homewood IL. Pierce, J.R., 1980. An Introduction to Information Theory, 2nd Ed., Dover Publications, New York. Pindyck, R.S. and D.L. Rubinfeld, 1981. Econometric Models and Economic Forecasts, 2nd Edition, McGraw-Hill, New York, NY. Reddy, T.A., and D.E. Claridge, 1994. Using synthetic data to evaluate multiple regression and principal component analysis for statistical models of daily energy consumption, Energy and Buildings, Vol.21, pp. 35–44. Reddy, T.A. and K.K. Andersen, 2002. An evaluation of classical steady-state off-line linear parameter estimation methods applied to chiller performance data, HVAC&R Research Journal, vol.8, no.1, pp.101–124. Reddy, T.A., 2007. Application of a generic evaluation methodology to assess four different chiller FDD Methods (RP1275), HVAC&R Research Journal, vol.13, no.5, pp 711–729, September. Shannon, R.E., 1975. System Simulation: The Art and Science, PrenticeHall, Englewood Cliffs, NJ. Sinha, N.K. and B. Kuszta, 1983. Modeling and Identification of Dynamic Systems, Van Nostrand Reinhold Co., New York. SPSS, 1997. Neural Connection- Applications Guide, SPSS Inc and Recognition Systems Inc., Chicago, IL. Subbarao, K., Y. Lei and T.A. Reddy, 2011. The nearest neighborhood method to improve uncertainty estimates in statistical building energy models, ASHRAE Trans, vol. 117, Part 2, American Society of Heating, Refrigerating and Air-Conditioning Engineers, Atlanta. Wakefield, J., 2013. Bayesian and Frequentist Regression Methods, Springer, New York. Walpole, R.E., R.H. Myers, S.L. Myers, and K. Ye, 2007. Probability and Statistics for Engineers and Scientists, 8th Ed., Prentice-Hall, Upper Saddle River, NJ. Wasserman, P.D., 1989. Neural Computing: Theory and Practice, Van Nostrand Reinhold, New York. Wolberg, J., 2006. Data Analysis Using the Method of Least Squares, Springer, Berlin, Germany.
Inverse Methods for Mechanistic Models
Abstract
This chapter deals with different types of inverse methods which were previously introduced and categorized in Chapter 1. Inverse models were defined as “pertaining to the case when the system under study already exists, and one uses measured or observed system performance data to identify a model structure of the system and estimate model parameters.” The focus here is on mechanistic or gray-box models (as against black-box) whose functional form captures, albeit in a simplified manner, the physical functioning and component interactions of the system or process. Specifically, three different categories of applications are treated involving both observational data and data collected under controlled experiments. The concept of information content of collected data and associated quantitative metrics are introduced and illustrated by means of actual examples in the context of non-intrusive off-line and on-line data gathering. The first category of inverse methods involves algebraic or static models. Three types of applications are used to illustrate the commonly adopted parameter estimation methods: solar photovoltaic systems, liquid-cooled chillers, and estimating macro-parameters of the building envelope. How different models can be recast in a form suitable for model parameter estimation, how to evaluate different models in a purely non-intrusive context, and how the physically meaningful model parameters can provide insights into design and operation are discussed. The concept of sequential stagewise regression in conjunction with selection of appropriate data windows is also illustrated. The second category involves dynamic gray-box models and covers three different types of
10
approaches: (i) sequential parameter estimation using intrusive controlled tests, (ii) evaluation of different thermal network models and ARMAX time series models when applied to non-intrusive data, and (iii) the compartment modeling approach within the state space representation suitable for linear first order ODE describing the behavior of a series of discrete compartments. The Bayesian and nonlinear least-squares methods of calibration are also presented and evaluated against gray-box modeling of a relatively small retail building. The third category is calibration of white-box or detailed simulation programs which is a highly over-parameterized problem. Methods to deal with this situation are also classified into three approaches as they pertain specifically to building energy simulation programs: (i) Raw input tuning (RIT) methods require that one selectively manipulate certain parameters of the model to incrementally achieve better fits to observed data (this is the most prevalent method used by practitioners today). The process can be done heuristically based on domain knowledge or by Monte Carlo (MC) resampling methods in conjunction with regional sensitivity analyses; (ii) The semi-analytic method (SAM) injects some measure of statistical rigor into the RIT-MC variant in terms of determining the number of identifiable parameters which the data can support, and identifying those that are most appropriate; (iii) The physical parameter estimation (PPE) method (still under research development) is more rigorous involving running the simulation program several times with carefully selected changes in its input driving functions and correcting previously defined macro-parameters of the major energy flows of the system.
# The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 T. A. Reddy, G. P. Henze, Applied Data Analysis and Modeling for Energy Engineers and Scientists, https://doi.org/10.1007/978-3-031-34869-3_10
409
410
10.1
10
Fundamental Concepts
10.1.1 Applicability Inverse problems were previously introduced and one way of classifying them was proposed in Sect. 1.5.3. They are synonymous to the term “system identification”1 and comprises of two sub-tasks. Model identification is the process of selecting the functional form or model structure among several competing alternatives based on understanding of how the system functions, and parameter estimation is the process of identifying the parameters in the model. Note that the use of the word estimation has a connotation of uncertainty. This uncertainty in determining the parameters of the model, which describes the physical system, is unavoidable and arises from model simplification errors and measurement noise invariably present. The challenge is to decide on how best to simplify the model and design the associated experimental protocol to minimize the uncertainty in our parameter estimates from data corrupted by noise. Thus, such analyses not only require knowledge of the physical behavior of the system, but also presume expertise in modeling, designing, performing experiments, and in regression/optimization methods. Many of the chapters in this book (specifically Chaps. 1, 5, 6, 7, 8, and 9) directly pertain to such issues. The inverse approach is more suitable for better understanding the behavior of unknown or partially known systems, provides better diagnostics, and allows for more accurate predictions and optimal control. The forward approach, on the other hand, is more appropriate for evaluating different component and system alternatives which satisfy preset performance and safety criteria during the design phase. Consider the discrete-time dynamic linear model with no lagged terms of a component or system represented by the block diagram in Fig. 1.8 reproduced from Sect. 1.4.3 of Chap 1. Then, the model can be represented in matrix form as: Yt = AYt - 1 + BUt + CWt
with
Y1 = d (repeat 1.1)
where the output or state variable at time t is Yt. The forcing (or input or exogenous) variables are of two types: vector U denoting observable controllable input variables and vector W indicating observable uncontrollable input variables or disturbing inputs. The parameter vectors of the model are {A, B, C}, while d represents the initial condition vector of U. Note that the output Y is designated as a vector to allow Though there is a connotational difference between the words “identification” and “estimation” in the English language, no such difference is usually made in the field of inverse modeling. Estimation is a term widely used in statistical mathematics to denote a similar effect as the term identification which appears in electrical engineering literature.
1
Inverse Methods for Mechanistic Models
for the general case of multiple system responses (as against the more common single response instance). One can view inverse modeling as relevant to two types of problems/applications: (a) identification problem which involves both estimating the undetermined system model/function and identifying relevant parameters. One can distinguish between (i) offline or batch identification and (ii) online or recursive or adaptive identification meant to model and control constantly varying physical systems such as airplanes and ships; it can also be used for other applications such as short-term prediction (say of electric power such as using autoregressive models, Sect. 8.5) or for signal processing (say, digital transmission of signals). An important issue related to the information content of data collection is discussed in Sect. 10.3.2. (b) reconstruction problem wherein, given the system response/reaction, one wishes to identify the stimulus or effect which caused it or vice versa. Further, there may be three sub-classes: (i) the control problem which involves modifying the system inputs so as to obtain a pre-specified reaction or response; hence, some sort of feedback control is inherently present, (ii) diagnostic problem or identification of sources of abnormality which involves deducinge the boundary conditions or/and the system inputs which caused a certain output (one catchy example, attributed to Richard Feynman, is that of a frog looking at the patterns of different ripples on the surface of a pond and trying to figure out how many people jumped in at what time and where), and (iii) hybrid where both the above are combined. The best possible manner to “reconstruct” such missing information depends on the specific circumstance and the nature of the application. For example, unwanted chemical outflows from a factory to multiple points in a pond are to be detected, diagnosed, and remedial (control) action is to be automatically initiated to dilute/remove the source of contamination. The number of sensors, their location, their sampling frequency, the number of people affected, and their specific locations are some of the numerous issues which need to be considered while designing such a safety system. A general mathematical way of framing the above problems has already been presented in Sect. 1.5.3 and is reproduced below: Parameter estimation problems: Control problems:
given {Y, U, W″, d}, determine {A, B, C}
(repeat 1.19)
given{Y″} and {A, B, C}, (repeat 1.20) determine{U, W, d}
10.1
Fundamental Concepts
411
Fig. 1.8 (repeat) Block diagram of a simple component with parameter vectors {A, B, C}. Vectors U and W are the controllable/observable and the uncontrollable/disturbing inputs, respectively, while Y is the state variable or system response (reproduced from Chap. 1)
where Y″ is meant to denote that only limited measurements may be available for the state variable. Control problems include both control design and the real-time operation of the controller.
10.1.2 Approaches and Their Characteristics An important characteristic of inverse problems is that they are usually ill-posed or incorrectly posed whereby one or more of the following limitations apply: (i) the measured data does not allow the existence of a solution to the problem (undetermined),2 (ii) the solution is not unique, or (iii) even worse, the solution is not stable due to the nature of the disturbance (Argoul 2012). The objective is to find ways to circumvent such limitations by either appropriate analysis techniques, or by well-designed experimentation, or by both. Inverse problems require context-specific approximate numerical or analytical solutions for linear and non-linear problems. Chapter 9 already covered the concept of parameter estimability in terms of structural and numerical identifiability as well as ascertaining ill-conditioning of equations using the condition number. The ill-conditioning, i.e., when the solution is extremely sensitive to the data (see Sect. 9.2) is often due to variable multi-collinearity and to the repetitive nature of the data collected (common for systems under normal operation). Another characteristic of inverse problems is that there is no one single standard approach to dealing with them given the numerous ways in which the problem and the associated data available/collected can be ill-posed. Some of the ones which could be adopted are to (i) take additional observations at carefully selected sampling frequencies determined from system physics, (or series of well-conceived intrusive experiments if permitted), (ii) simplify the set of modeling equations while retaining the important system behavior, i.e., reduced order models, (iii) reduce the number of parameters of the system model from physical considerations 2 For example, one sees a small puddle of water near the fridge and one tries to determine whether it was a water spill from the water dispenser or a cube of ice which fell out when a handful of ice cubes were grabbed an hour back. Short of further investigation, one cannot come to any firm conclusion.
or from sensitivity analysis (see Sect. 6.7.4), (iv) introduce constraints to reduce estimation/search, (v) in case of observational data, filter/select windows of data time series when different system forcing functions and their effects on system response are more pronounced, and thus the associated parameters can be better estimated, and (iv) use the Bayesian approach to formulate prior and posterior probabilities of the parameter estimates. Recall the three broad types of inverse approaches introduced in Sect. 1.5.3: calibrated white box, black box, and gray box. White-box or detailed mechanistic models are based on first principles and permit accurate and microscopic modeling of the various fluid flow, heat, and mass transfer phenomena which occur within engineered systems. They are most appropriate for detailed or high-fidelity simulation studies involving the evaluation of different candidate system designs. How best to calibrate such models also falls under the purview of inverse methods. Black box models are inexact structural models where the model equations have no scientific basis and the parameters are unlikely to have physical meaning. “Curve fitting” is the proper terminology to use for their identification given that there is some amount of arbitrariness in the choice of the function with the selection criterion based on assuring the best fit. Gray box models can be viewed as simplified representation of white box models which approximately capture the underlying physical behavior of the system. Different gray box models can be considered for the same system behavior depending on how well the system/process is known, on data availability, and the intent of the problem under study (further discussed in Sect. 10.2). Consider the multi-component system shown in Fig. 10.1. The overall system has five inter-connected components with three input and two output variables. If each component is to be individually considered, one needs to identify five component functions and estimate eight β parameters. It all depends on what type of data is available, and whether one has some prior information which will allow one to use the Bayesian approach and constrain the parameters in their range of variation. The best-case scenario (and one which is most expensive) is to instrument each of the components separately and identify the individual component models one by one. Such a component isolation approach is impractical in most cases,
412
Fig. 10.1 Block diagram of a multicomponent system showing different components with their inputs, outputs and interconnections
and so, often, one is faced with a sub-optimal set of measurements for system identification. One can identify different instances. The lowest level is the input-output case when only measurements of [x1, x2, x3] and [y1, y2] are known. It is impossible to identify the functional form of each component. The best one could hope for is to identify a lumped or overall system model. If no prior knowledge of system behavior is available, one would adopt the black box approach (i.e., empirical or curve fit) treated in Chaps. 5 and 9 and fit the response with whatever functional form best captures the trend while keeping in mind the issues of numerical identifiability. One could enhance the value of the collected performance data by surrogate modeling (Sect. 6.7.5) with due care exercised to maintain the original correlation structure among regressors and between them and the response variable. The semi-empirical case is when some prior knowledge about the overall system is available, and the model fitting will be limited to functions consistent with this knowledge. This is the gray-box approach (elaborated in Sects. 10.2 and 10.4); of course, the greater the knowledge on the functional form, the less “black” will be the model identified. If welldefined mechanistic model forms of the components are available, the parameter estimation is likely to be sound. Even then, the physical interpretation of the estimated parameters may suffer from variable collinearity (discussed in Sect. 9.3) which can be sometimes mitigated/reduced by
10
Inverse Methods for Mechanistic Models
appropriate statistical methods. If intrusive experiments are possible, one would think of suitable ways to perturb the system so as to elicit more information about system behavior, and thereby obtain even sounder model parameter estimates. For nonintrusive data collected, one could use sequential stagewise regression involving a series of regression steps on carefully selected data windows (discussed in Sect. 10.2.4). Model reduction, often used for analysis and control design, can be achieved by relevant and simplified physical interpretation of the system functioning, viz, the gray-box approach. Domain knowledge or prior information (or even insight gained from performing a white box simulation model of the system with its five interacting components) is used to simplify the functioning of each of the components and their system interconnections. Perhaps components 3 and 4 of Fig. 10.1 are less dominant (or less sensitive to the overall system response) than components 1 and 2. One could then combine them (or even eliminate if appropriate) so that the system simplifies to three components: component (1 + 3), component (2 + 4), and component 5. One could then follow a suitable experimental design to identify these three components individually. If component (1+3) is more pronounced during say, certain periods (e.g., a cooling system during summer months) while component (2 + 4) during other periods (say, a heating system during winter months), one could select time windows and identify these components more robustly instead of using the entire set of data available. This is an illustration of choosing suitable time windows depending on magnitude of component response. A note of caution is that a particular component deemed less dominant and eliminated at the initial stages should be so over all the operating regimes which are being investigated. Finally, the calibrated white box model approach (treated in Sect. 10.6) is one where a mechanistic simulation model is available for the system with its five interconnected components, and the problem is to estimate the eight model parameters knowing the inputs and responses as well as their uncertainty ranges and prior probability distributions (akin to the Bayesian approach). There are numerous combinations possible, but a common scenario is one where only [x1, x2, x3] and [y1,y2] measurements are known (see Fig. 10.1). The problem is of course over-parametrized resulting in non-unique parameter estimation even when the selection of data windows or performing intrusive experiments are possible. This is where sensitivity analysis (see Sect. 6.7.4) can be used to isolate the less influential component(s) and freeze their model parameters based on prior knowledge or/and combine them to simplify the system model structure. Another promising approach pertinent to the white-box approach is the reduced order modeling approach (ROM) (also referred to as model order reduction MOR) meant to reduce the computational complexity and run-time of a
10.2
Gray-Box Static Models
computer model, while preserving the expected fidelity within a controlled error. Here, the primary reliance or onus is on mathematical, numerical, and data-driven techniques with support from domain knowledge as relevant.
413
search methods. These methods are also relevant to mechanistic inverse problems, as illustrated in this chapter.
10.1.4 Scope of Chapter 10.1.3 Mechanistic Models Mechanistic gray-box models are those which capture the primary behavior and the interactions between components by mathematical equations based on well accepted scientific principles. Steady-state behavior can be captured by algebraic equations, while dynamic behavior is formulated in terms of differential equations. The equations can be non-linear in the parameters and may even have time variant parameters (see Sect. 1.4.4 for explanation of these terms). Empirical models can be used to describe such behavior using multivariate polynomial or spline functions, but these typically tend to be non-unique with numerous parameters; this results in identifiability and estimability being major issues. Mechanistic gray-box models tend to be more economical (i.e., parsimonious) in the parameters, have (limited) physical interpretability, and whose magnitude offers structural insight into system behavior. Further, they contribute to scientific understanding (either by confirming a hypothesis or by providing insights into previously opaque system/process behavior), and they can provide a basis for extrapolation outside the immediate region of study more robustly than do black-box models (Box et al. 1978). Finally, they may suggest further experimentation to enhance understanding or improving the model formulation. It must be pointed out, however, that mechanistic models are not always the best choice. Judgement is needed when to use mechanistic versus empirical models. If the system behavior is too complex or opaque as in certain disciplines, for example, social sciences or even econometrics, empirical methods may be the only recourse. They are also more suitable in instances when the analyst wishes to undertake a preliminary exploratory study with limited resources and time, or when he has limited domain knowledge to properly grasp the system behavior. Most of the techniques and concepts covered in Chaps. 5 and 9 have been presented in terms of empirical models: how to discriminate between black-box models using statistical indicators such as R2 or RMSE, how to estimate model parameters under standard conditions and under ill-conditioned (collinear) data, how to diagnose and remedy improper model residual behavior, the advantages of resampling and cross-validation, the use of OLS and maximum likelihood estimation methods, the use of the Box-Cox transformation, the use of the logistic function and its uses, how to convert some non-linear estimation problems into linear ones using link functions under the general umbrella of GLM models, and how to estimate model parameters for intrinsically non-linear models using
The inverse approach has been used in a wide variety of areas; to name a few, geophysical exploration (deducing internal structure of the ground from surface measurements for earthquake prediction and petroleum deposit), civil engineering (non-destructive testing for component fracture), thermal/heat transfer processes, aeronautics and industrial control, radio and digital signal processing, biomedicine, ecology, transportation, and robotics. They have also been applied to large systems such as climate change characterization and evaluation of different future scenarios. The scope of this chapter (except Sects. 10.4.4 and 10.6) is limited to the simplest types of problems involving linear systems, lumped parameters, invariant in time with direct measurement where the inputs and outputs (or effects) of the system are directly measured or observable (as against indirect measurements such as geophysical data where the observable effects are linked to the system inputs and outputs). The latter type of problems is generally harder to treat. Both single and multi-component systems under steady state (using algebraic equations) and dynamic (first-order differential equations) will be discussed. The focus in this chapter is on gray-box mechanistic models and calibration of detailed white-box models. Specifically, the application of these methods is illustrated with case study examples from solar photovoltaic systems, building energy use, and human toxicology.
10.2
Gray-Box Static Models
10.2.1 Basic Notions System identification is formally defined as the determination of a mathematical model of a system based on its measured input and output data over a relatively short period of time, usually for either understanding the system or phenomenon being studied or for making predictions of system behavior. It deals with the choice of a specific model from a class of models that are mathematically equivalent to the given physical system. Model selection problems involve underconstrained problems with degrees of freedom greater than zero where an infinite number of solutions are possible. One differentiates between: (i) situations when nothing is known about the system functional behavior, sometimes referred to as “complete identification problems,” requiring the use of black-box models, and (ii) “partial identification problems” wherein some insights are available and allow gray-box models to be framed in order to analyze the data at hand.
414
The objectives are to identify the most plausible system models and the most likely parameters/properties of the system by performing certain statistical analyses and experiments. The difficulty is that several mathematical expressions may appear to explain the input–output relationships, partially due to the presence of unavoidable errors or noise in the measurements and/or limitations in the quantity and spread of data available. Usually, the data is such that it can only support models with limited number of parameters. Hence, by necessity (as against choice) are the mathematical inverse models macroscopic in nature, and usually allow determination of only certain essential properties of the system. Thus, an important and key conceptual difference between forward or white-box (also called microscopic or micro-dynamic) models and black-box and gray-box identification problems (also called macrodynamic) is that one should realistically expect the latter to involve models containing only a few aggregate interactions and parameters. Note that knowledge of the physical (or structural) parameters is not required (as in black-box models) if internal prediction or a future (in time) forecast is the only purpose. Such forecasts can be obtained through the reduced form equations directly, and exact deduction of parameters is not necessary (Pindyck and Rubinfeld 1981). A model for the inverse problem can be quite different from the conventional forward models. It should contain only a small number of adjustable parameters, because the information content of the data is rather limited, being collected under fairly repetitive conditions and subject to errors. Thus, the model itself can be quite simple, although a fairly sophisticated procedure is often needed to ensure reliable determination of the adjustable model parameters. Developing an inverse mechanistic model involves the following steps: 1. Choose a model (functional form) making sure that it captures the crucial physical features of the system or situation. 2. Identify one (or more) influential variable as regressors and one or more dependent or response variable ymeas; also identify the model parameters to be estimated. 3. Recast the model in a form suitable for regression, and determine the parameters of the model, say, minimizing the squared differences between ymodel and ydata, summed over all data points (usually this will be done by least squares, using commercially available regression software). 4. Test the model against another (or several) set of performance data (this is called cross-validation). 5. As much as possible, express the model parameters in terms of quantities with direct physical interpretation, such as heat transmission coefficient, admittances, and time constants.
10
Inverse Methods for Mechanistic Models
Classical inverse estimation methods can be enhanced by using the Bayesian estimation method (one example is shown in Sect. 10.5) which is more subtle in that it allows one to include prior subjective knowledge about the value of the estimate or the probability of the unknown parameters in conjunction with information provided by the sample data (see Sects. 2.5 and 4.6). The prior information can be used in the form of numerical results obtained from previous tests by other researchers or even by the same researcher on similar processes and systems. Including prior information allows better estimates. For larger samples, the Bayesian estimates and the classical estimates should be close. It is when sample are small that Bayesian estimation is particularly useful.
10.2.2 Performance Models for Solar Photovoltaic Systems Photovoltaic (PV) modules/systems are devices which convert incident solar irradiation into DC or AC electric power (numerous textbooks are available for the interested reader, for example, Messenger and Ventre 2010). Several inverse models (both black-box and gray-box) related to the performance of PV systems have been proposed and evaluated over the years by several researchers. A short description of the different performance models for PV systems and for solar cell temperatures is provided below (from Moslehi et al. (2017). This is meant to illustrate the importance of proper understanding of the relevant domain knowledge prior to undertaking an inverse analysis and also to illustrate the difference between black-box and gray-box model formulations. This background will also be needed to solve Problem 10.6. In general, there are two scenarios associated with collection of required data (see Fig. 10.2) and inverse modeling of solar PV performance: • Predicting PV module power (Pelec), using only climatic data, i.e., solar irradiation (IT) on the plane of array or POA, ambient temperature (Ta) and wind velocity (vwind). • Predicting PV module power output (Pelec) using measured solar cell/module temperature (Tcell) and climatic data. Nomenclature PV panels surface area, m2 Total irradiance on tilted plane or on plane of array (POA), W/m2 Kη Incidence angle modifier NOCT Nominal operating cell temperature, °C POA Plane of array Pelec Solar PV power output, W A IT
10.2
Gray-Box Static Models
415
Fig. 10.2 Sketch of a solar PV module indicating the measured variables relevant for performance modeling. The X indicates a measurement point
Solar radiation IT
vwind β θi ηelec ηn
Ambient dry-bulb temperature, °C PV module/cell temperature, °C Overall thermal heat loss coefficient from solar panel, W/m2 °C Wind velocity, m/s Temperature degradation coefficient Solar incidence angle on POA, degrees Electrical efficiency of PV module Optical efficiency at normal solar incidence
10.2.2.1 Weather-Based Models (a) Black-Box Models A black-box model needs monitored data to determine model coefficients appropriate for the specific PV system. In this case, module or array power efficiency is assumed to be a simple linear function of the environmental parameters which, when rewritten in terms of power output, is (Myers and Emery 1976): Pelec = AI T ða0 þ b0 T a þ c0 I T þ d0 vwind Þ
ð10:1Þ
where A is the collector area. This equation can be re-expressed in a form suitable for regression as: Pelec = aI T þ bI T T a þ cI T 2 þ dI T vwind
X
X Pelec
Ambient air temp. Ta X Ambient wind velocity vwind
Ta Tcell U
Solar incidence angle θi
ð10:2Þ
where a, b, c, and d are regression coefficients which include the PV surface area A. Alternatively, one could use (Pelec /A) as the regressor variable. Though this model form includes the effects of wind, its statistical significance needs to be evaluated under the various operating conditions experienced by the PV systems over the year. (b) Gray-Box Models A more physical (gray-box) model of the power generated in terms of the climatic variables and the PV system characteristics is given by (Gordon and Reddy 1988):
X
X Solar cell/module temp. Tcell
Pelec = AI T K η ηrated ða1 - a2 T a - a3 I T Þ
ð10:3aÞ
where Kη is the incidence angle modifier, which is the product of the solar transmittance of the glass cover and the absorptance of the collector surface. This value depends on the collector configuration but also on the solar incidence angle (which can be computed accurately knowing the day of the year, time of the day, latitude of location and the tilt angle and azimuth angle of the solar panel). The incidence angle modifier is almost constant till about 60° solar incidence angle and then drops precipitously to zero at 90° incidence angle. Strictly speaking, there are different values for the beam and diffuse components of the solar irradiation, but this aspect will be neglected for simplicity. The parameters of the model have a physical interpretation and characterize PV cell properties, such as overall thermal heat loss coefficient, cell temperature at rated conditions (Tcell,rated) temperature degradation coefficient β (which is a negative number), and optical efficiency at normal solar incidence ηn. They are given by: a1 = 1 þ βT cell,rated , a2 = β and
ð10:3bÞ
a3 = β ηn - ηn,rated =U ðvwind Þ
Further, U(vwind) is simply taken to be equal to (a + b.vwind) where the coefficients should be determined from in-situ measurements and are dependent on the deployment type of the solar panel. Neglecting the effects of wind, Eq. 10.3a can be re-expressed in a form suitable for regression as: Pelec = aI T Kη þ bI T Kη T a þ c I T Kη
2
ð10:4Þ
This equation can be considered to be akin to the Myers and Emery model (Eq. 10.2) except that it accounts for the solar incidence angle by including the incidence angle modifier, Kη, calculated as: K η = 1 þ bð1= cos θi - 1Þ
ð10:5Þ
where the coefficient b is negative and can be assumed to be equal to -0.10 for single glazed collectors.
416
10
10.2.2.2 Weather and Cell Temperature-Based Models The module temperature (also referred to loosely as solar cell temperature) is often not measured in field installations. Even if it is, there is some indeterminacy depending on where such point measurements are taken on the panel surface or on a system with multiple solar panels. For example, Faiman (2008) experimentally found about a 2 °C root mean square error (RMSE) in cell temperature difference at different points of a PV module. A variety of approaches have been proposed for predicting solar PV module/cell temperature. (a) Gray-Box Models with Measured Cell Temperature The power output can be predicted based on detailed physical models. However, a more convenient model for field applications is based on (some aggregate) module cell temperature and the incident irradiation. The most widely used model is the one originally proposed by Evans (1981) for efficiency: ηelec = ηelec,rated, ½1 þ βðT cell - T cell,rated Þ
ð10:6Þ
Copper et al. (2016) modified the Evans model and presented it in terms of power output. This model, referred here as the “modified Evans model”, can be re-expressed as: Pelec = aI T þ bðT cell - T cell,rated ÞI T
ð10:7Þ
The gray-box modeling approach retains some of the physics-based interactions while allowing regression-based adjustments to the model form and associated model coefficients when measured data is available. This approach is perhaps the most widely used and most suitable for modeling actual PV systems and installations. The more detailed physics-based models meant for design studies make use of many of the parameters provided in the specification sheets. This level of detail is not well suited for field applications since model calibration at various stages is required which is somewhat tedious and requires user experience. (b) Gray-Box Model Based on Empirical Cell Temperature Model
Inverse Methods for Mechanistic Models
experimental studies that wind velocity representative of the value impacting module heat loss is not a well-defined variable and using actual values may not yield more accurate models (PVSyst 2012). Several studies present detailed heat transfer equations to model the heat loss coefficient involving forced and natural convection effects as well as radiation heat loss (e.g., Kaplani and Kaplanis 2014), but these are not suitable for field modeling. Several simplified approaches to predict cell temperature Tcell have been proposed based on the observation that the difference between cell and ambient air temperatures vary linearly with solar irradiation. One such formulation is that by Kurtz et al. (2009): T cell = T a þ I T expða þ bvwind Þ
where a = -3.473 and b = -0.0594 Alternately, a simpler black-box model was also found to capture the measured electric output of 22 PV systems of six different cell types monitored in the Phoenix area (Mani et al. 2003): T cell = a þ bI T þ cT a þ dvwind
I T ηn K η = I T ηelec þ UðT cell - T a Þ
ð10:8Þ
where U is the overall thermal heat loss coefficient of the module in W/m2.°C. It has been pointed out by some
ð10:10Þ
with four coefficients to be determined by regression. Perhaps the most popular (but not very accurate) forwardmodel to predict Tcell is based on the nominal operating cell temperature (NOCT) value (Myers et al. 2004). The NOCT is a measured performance value of the actual solar PV module under four standard reference climatic conditions: solar irradiation on module of 800 W/m2, ambient temperature of 20 ° C, wind velocity of 1 m/s, and PV module tilt angle of 45°. The numerical value is provided by the manufacturer as part of the specification sheet. The actual cell temperature model is then modeled as: T cell = T a þ
NOCT - 20 ° C × IT 800 W=m2
ð10:11Þ
If NOCT data is not available or if monitored performance data is available, the model can be recast as an inverse model whose parameters can be identified by regression from: T cell - T a = aI T
Instead of using measured solar cell or module temperature, one could use an empirical model to first predict cell temperature, Tcell, and then use it for power prediction. A thermal energy balance on the module or array yields:
ð10:9Þ
ð10:12Þ
This is clearly a very simplified version of Eq. 10.10 with coefficients c = 1, a = 0, and d = 0. The above brief description of inverse models for solar PV systems will be needed to solve Pr. 10.6 which requires that these equations be evaluated against monitored data provided in the problem.
10.2
Gray-Box Static Models
417
Fig. 10.3 Sketch of a large flooded-type chiller with the two water flow loops indicating various important variables needed for performance modeling
10.2.3 Gray-Box and Black-Box Models for Water-Cooled Chillers This section presents yet another example of gray-box and black-box inverse models proposed in the literature to model the performance of electric water-cooled chillers. (a) Gray-Box Models The universal thermodynamic model proposed by Gordon and Ng (2000) is a good example of a gray-box model involving recombination of the basic measured variables. The GN model is a generic model for chiller performance derived from thermodynamic principles and linearized heat losses. The model expresses the dependent chiller COP (defined as the ratio of chiller (or evaporator) thermal cooling capacity Qch by the electrical power Pcomp consumed by the chiller (or compressor) with specially chosen independent (and easily measurable) parameters such as the fluid (water or air) inlet temperature to the condenser Tcdi, fluid temperature leaving the evaporator (or the chilled water return temperature from the building) Tcho, and the thermal cooling capacity of the evaporator (refer to Fig. 10.3). The GN model for constant fluid flow rates at the condenser and the evaporator reduces to a three-parameter model which for parameter identification takes the following form: ðT - T cho Þ 1 T T þ 1 cho - 1 = a1 cho þ a2 cdi T cdi Qch T cdi Qch COP þ a3
ð1=COP þ 1ÞQch T cdi
a1 = Δs, the total entropy production rate in the chiller due to internal irreversibilities, a2 = Qleak, the rate of heat losses (or gains) from (or in to) the chiller, 1-E a3 = R = ðmCE1Þ þ ðmCEÞevap , i.e., the total heat exchanger cond
For regression, Eq. 10.13 can be expressed as a linear model without an intercept term: y = a1 x1 þ a2 x2 þ a3 x3
where the temperatures are in absolute units, and the parameters of the model can be associated with physical quantities:
ð10:14aÞ
where: x1 =
ðT - T cho Þ T cho , x = cdi , T cdi Qch Qch 2
x3 =
ð1=COP þ 1ÞQch and T cdi
y=
1 T þ 1 cho - 1 T cdi COP
ð10:14bÞ
For the GN model with constant coolant flow rates, Eq. 10.13 can be rearranged to yield the following expression for the chiller electric power Pcomp: Pcomp =
ð10:13Þ
evap
thermal resistance which represents the irreversibility due to finite-rate heat exchanger. Also, m is the mass flow rate, C the specific heat of water, and E is the heat exchanger effectiveness.
Qch ðT cdi - T cho Þ þ a1 T cdi T cho þ a2 ðT cdi - T cho Þ þ a3 Q2 ch T cho - a3 Qch
ð10:15Þ The model of Eq. 10.15 applies both to unitary and large chillers operating strictly under steady state conditions and for constant flow systems. The above model is linear, and this
418
10
is an obvious advantage during parameter estimation. However, the linearization is a result of a non-linear recombination of the basic variables, and some researchers argue that this would introduce larger uncertainty in these transformed variables due to error propagation and corrupt the parameter identification process. Evaluations by several researchers have shown this model to be very accurate for many chiller types and sizes (one such study is described in part (c) below). Recall that a field study based on the use of the GN model was presented in Sect 5.9 as a case study example. Although most commercial chillers are designed and installed to operate at constant coolant flow rates, variable condenser water flow operation (as well as variable evaporator flow rates) is being increasingly used to improve overall cooling plant efficiency especially at low loads. To accurately correlate chiller model performance under variable condenser flow, an analytical model was also developed: T cho ð1 þ 1=COPÞ ð1=COP þ 1ÞQch 1 -1T cdi T cdi ðVρC Þcond = c1
T cho T - T cho þ c2 cdi Qch Qch T cdi
þ c3
Qch ð1 þ 1=COPÞ T cdi ð10:16aÞ
T cho , Qch
x2 =
T cdi - T cho , Qch T cdi
x3 =
ð1=COP þ 1ÞQch T cdi
and y=
model. The commercially available DOE-2 building energy simulation model (DOE-2 1993) relies on the same parameters as those for the physical model but uses a second order linear polynomial model instead. This “standard” empirical model (also called a multivariate polynomial linear model or MLR) has 10 coefficients which need to be identified from monitored data: COP = b0 þ b1 T cdi þ b2 T cho þb3 Qch þ b4 T cdi 2 þ b5 T 2cho þ b6 Q2ch
ð10:17Þ
þb7 T cdi T cho þ b8 T cdi Qch þ b9 T cho Qch These coefficients, unlike the three coefficients appearing in the GN model, have no physical meaning and their magnitude cannot be interpreted in physical terms. Collinearity in regressors and ill-behaved residual behavior are also problematic issues. Usually, one needs to retain in the model only those parameters which are statistically significant, and this is best done by stepwise regression. The above background will be needed for Problems 10.8 and 10.9 which requires that both these approaches be evaluated for performance data collected from a large centrifugal chiller. (c) Physical Insights Provided by Gray-Box Models
If one introduces x1 =
Inverse Methods for Mechanistic Models
T cho ð1=COP þ 1Þ ð1=COP þ 1ÞQch 1 -1T cdi T cdi ðVρC Þcond ð10:16bÞ
where V, ρ, and c are the volumetric flow rate, the density, and specific heat of the condenser water flow. Equation 10.16a for the case of variable condenser flow reduces to: y = c 1 x1 þ c2 x2 þ c3 x3
ð10:16cÞ
(b) Black-Box Models Whereas the structure of a gray-box model, like the GN model, is determined from an analytical thermodynamic first-principles derivation, the black-box model is characterized as having no (or marginal) information about the physical problem incorporated in the model structure. The model is regarded as a black-box and describes an empirical relationship between input and output variables. This approach does not require reparameterization of the basic measured variables as is needed for the GN gray-box
Jiang and Reddy (2003) evaluated the applicability and advantages provided by the GN modeling approach using data from over 50 water cooled chiller data sets. Chiller types studied include single- and double-stage centrifugal chillers with inlet guide vane and variable speed drive (VSD) capacity control, screw, scroll, and reciprocating chillers, as well as two-stage absorption chillers. It was found that the fundamental GN formulation for all types of vapor compression chillers is excellent in terms of its predictive ability, yielding CV values in the range of 2-5%, which is comparable to the experimental uncertainty of many chiller performance data sets. Further, the study investigated whether there is a practical benefit in terms of the physical interpretation of the model coefficients (since these coefficients can be linked to thermal characteristics specific to the chiller components). Figure 10.4 shows sub-plots of how the three coefficients vary with centrifugal chiller size with two different types of capacity control. Some of the important observations were: (i) One detects a strong pattern in the numerical values of coefficient a3 for all chiller types; a decreasing asymptotic trend as the rated cooling capacity of the chiller increases (shown by trend line). Since coefficient a3 represents effective thermal resistance normalized by the cooling load for the combination of heat exchangers
10.2
Gray-Box Static Models
419
Fig. 10.4 Scatter plot with standard errors of the regression coefficients (a1, a2 and a3) for centrifugal chillers versus chiller rated capacity using the GN model (Eq. 10.13). Curve 1 corresponds to single-stage and curve 2 to two-stage chillers (a) Centrifugal Chiller with Inlet Guide Vane Control, (b) Centrifugal Chiller with Variable-Speed Drive. (From Jiang and Reddy 2003)
in chillers, the trend seems to indicate that chiller manufacturers, in an effort to improve chiller efficiency by targeting the source of external irreversibility, find it expedient to reduce the load normalized combined heat exchanger resistance only up to a certain size. For inlet guide vane centrifugal chiller, this limit is about 500 Tons from Fig. 10.4a frame 3, and for VSD chillers the limit is not reached even at 500 Tons (Fig 10.4b frame 3). As an aside, for screw chillers, the asymptote was found by the same study to be about 300 Tons of refrigeration capacity. These limits are probably dictated by economics since heat exchanger performance (such as effectiveness, for example) follows the law of diminishing returns with increasing heat exchanger size or area. (ii) Clearly, there is no underlying pattern in the scatter of the numerical values of the coefficient a2 for VSD chillers (Fig. 10.4b frame 2). This is probably since this coefficient represents the heat loss (or heat gain) from the chiller which is affected by refrigerant type (which dictates working temperatures) as well as practical constraints such as the amount of insulation, piping length, etc. However, the trend lines for one-stage and two-stage chillers with inlet guide vane control are clearly seen; and why this should be so for such chillers and not in VSD chillers is unclear. (iii) The numerical values of the coefficient a1 represent the internal entropy generation due in large part to
irreversibilities in the compression process. In this regard, it is clearly noted from Fig. 10.4 frame 1 that VSD is more efficient than inlet guide vane control with single-stage compression, while two-stage compression is even more efficient (such trends are indeed supported by physical insights). As expected, the numerical value for two-stage chillers (curve 2) was lower than one-stage (curve 1). Finally, the linear positive trend indicates that large chillers have higher total internal entropy production rate (due to friction losses in the compressor, piping, . . .) which is to be expected.
These trends may be useful to chiller manufacturers. One example is that this would provide a simple and direct means to evaluate a particular prototype chiller under development as against broad industry trend and specific competitor’s chiller models. They could also be potentially used as generic patterns for fault detection and diagnosis.
10.2.4 Sequential Stagewise Regression and Selection of Data Windows Another approach that offers the promise of identifying physically interpretable parameters of a linear model whose regressors are correlated is sequential regression. The identification of the model parameters is done in stages which
420
10
requires either the use of judicious selection of data windows in case of non-intrusive data collection (treated in this section), or else the adoption of a series of controlled experiments to identify different elements of the model sequentially (treated in Sect. 10.4.2). It is suitable for lowerorder linear functions involving gray-box models and ought to be supported by some degree of physical understanding of the phenomenon being studied. Consider the following multivariate linear model with collinear regressors: y = β 0 þ β 1 x1 . . . þ β p xp
ð10:18Þ
The basic idea is to perform a simple regression with one regressor at a time with the order in which they are selected depending on their correlation strength with the model residuals and consistent with physical domain knowledge. This strength is re-evaluated at each step. The algorithm describing its overall methodology consists of the following steps:3 (i) Compute the correlation coefficients of the response variable y against each of the regressors, and identify the strongest one, say xi. (ii) Perform a simple OLS regression of y vs xi and compute the model residuals u. This becomes the new response variable. (iii) From the remaining regressor variables, identify the one most strongly correlated with the new response variable. If this is represented by xj, then regress u vs xj and recompute the second-stage model residuals, which become the new response variable. (iv) Repeat this process for as many remaining regressor variables as are significant. (v) The final model is found by back-substituting or rearranging the terms of the final expression into the standard regression model form, i.e., with y on the lefthand side and the significant regressors on the righthand side. The basic difference between this method and the forward stepwise multiple regression method is that in the former method the selection of which regressor to include in the second (and subsequent stages) depends on their correlation strength with the residuals of the model defined in the first stage. Stepwise regression selection is directly based on the strength of the regressor with the response variable. Stagewise regression is said to be less precise, i.e., the model usually has larger RMSE; however, it minimizes, if not eliminates, the effect of correlation among variables. Simply interpreting the individual parameters of the final model as the relative influence which these have on the 3
The selection of the sequence of variables should be as far as possible consistent with physical understanding.
Inverse Methods for Mechanistic Models
response variable is misleading since the regressors which enter the model earlier pick up more than their due share at the expense of those which enter later. Even though Draper and Smith (1981) do not recommend its use for typical problems since the true OLS is said to provide better overall prediction accuracy, this approach, under certain circumstance, can yield realistic and physically meaningful estimates of the individual parameters.
10.2.5 Case Study of Non-Intrusive Sequential Parameter Estimation for Building Energy Flows The superiority of sequential stagewise regression as compared to OLS can be illustrated by a simulation case study example (taken from Reddy et al. 1999). Here, the intent was to estimate building and ventilation parameters of large commercial buildings from non-intrusive monitoring of its heating and cooling energy use from which the net load can be inferred. Since the study uses computer generated synthetic data (i.e., from a commercial detailed hourly building energy simulation software), one knows the “correct” values of the parameters in advance, which allows one to judge the accuracy of the estimation technique. The procedure involves first deducing a macro-model for the thermal loads of an ideal one-zone building suitable for use with monitored data, and then using a sequential linear regression approach to determine the model coefficients (along with their standard errors), which can be finally translated into estimates of the physical parameters (along with the associated errors). The evaluation was done for two different building geometries and building mass levels at two different climatic locations (Dallas, TX and Minneapolis, MN) using daily average and/or summed data to remove/minimize dynamic effects. (a) Model Formulation First, a steady-state model for the total heat loads (Qb) was formulated in terms of variables that can be conveniently monitored. Building internal loads consist of lights and receptacle loads and occupant loads. Electricity used by lights and receptacles (qLR) inside a building can be conveniently measured. Heat gains from occupants consisting of both sensible and latent portions and other types of latent loads are not amenable to direct measurement and are, thus, usually estimated. Since the schedule of lights and equipment closely follows that of building occupancy (especially at a daily time scale as presumed in this study), a convenient and logical manner to include the unmonitored sensible loads was to modify qLR by a constant multiplicative correction factor ks which accounts for the miscellaneous (i.e., unmeasurable) internal sensible loads. Also, a simple manner of treating internal latent loads was to introduce a
10.2
Gray-Box Static Models
421
constant multiplicative factor kl defined as the ratio of internal latent load to the total internal sensible load (=ks qLR) which appears only when outdoor specific humidity w0 is larger than that of the conditioned space. Assuming the sign convention that energy flows are positive for heat gains and negative for heat losses, the following model was proposed: QB = qLR ks ð1 þ kl δÞA þ a0sol þ bsol þ UAs þ mv Acp × ðT0 - Tz Þ þ mv Ahv δðw0 - wz Þ ð10:19Þ where A AS cp hv kl ks mv QB qLR T0 Tz U W0 Wz δ
Conditioned floor area of building Surface area of building Specific heat of air at constant pressure Heat of vaporization of water Ratio of internal latent loads to total internal sensible loads of building Multiplicative factor for converting qLR to total internal sensible loads Ventilation air flow rate per unit conditioned area Building thermal loads Monitored electricity use per unit area of lights and receptacles inside the building Outdoor air dry-bulb temperature Thermostat set point temperature Overall building shell heat loss coefficient Specific humidity of outdoor air Specific humidity of air inside space An indicator variable which is 1 when w0 > wz and 0 otherwise.
The effect of solar gains is linearized with outdoor temperature T0 and included in the terms a′sol and bsol. The expression for QB given by Eq. 10.19 includes six physical parameters: ks, kl, UAS, mv, Tz, and wz. One could proceed to estimate these parameters in several ways. (b) One-Stage Regression Approach One way to identify these parameters is to directly resort to OLS multiple linear regression provided monitored data of qLR, T0, and w0 is available. For such a scheme, it is more appropriate to combine solar gains into the loss coefficient U and rewrite Eq. 10.19 as: QB =A = a þ bqLR þ cδqLR þ dT 0 þ eδðw0 - wz Þ ð10:20aÞ where the regression coefficients are: a = - UAS =A þ mv cp T z b = k s e = mv hv d = UAs =A þ mv cp
c = ks kl
ð10:20bÞ
Subsequently, the five physical parameters can be inferred from the regression coefficients as:
kl = c=b mv = e=hv ks = b UAs =A = d - ecp =hv T z = - a=d
ð10:21Þ
The uncertainty associated with these physical parameters can be estimated from classical propagation of errors formulae discussed in Sect. 3.7. The “best” value of building specific humidity wz could be determined by a search method: select the value of wz that yields the best goodness-of-fit to the data (i.e., highest R2 or lowest CV). Since wz has a more or less wellknown range of variation, the search is not particularly difficult. Prior studies indicated that the optimal value has a broad minimum in the range of 0.009 - 0.011 kg/ kg. Thus, the choice of wz was not a critical issue, and one could simply assume wz = 0.01 kg/kg without much error in subsequently estimating other parameters. (c) Two-Stage Regression Approach Earlier studies based on daily data from several buildings in central Texas indicate that for positive values of (w0 - wz) the variables (i) qLR and T0, (ii) qLR and (w0 - wz), and (iii) T0 and (w0 - wz) are strongly correlated and are likely to introduce bias in the estimation of parameters from OLS regression. It is the last set of variables which is usually the primary cause of uncertainty in the parameter estimation process. Two-stage regression involves separating the data set into two groups depending on δ being 0 or 1 (with wz assumed to be 0.01 kg/kg). During a two-month period under conditions of low outdoor humidity, δ = 0, and Eq. 10.20a reduces to QB =A = a þ bqLR þ dT 0
ð10:22Þ
Since qLR and T0 are usually poorly correlated under such low outdoor humidity conditions, the coefficients b and d deduced from multiple linear regression are likely to be unbiased. For the remaining year-long data when δ = 1, Eq. 10.20a can be re-written as: QB =A = a þ ðb þ cÞqLR þ dT 0 þ eðw0 - wz Þ
ð10:23Þ
Now, there are two ways of proceeding. One variant is to use Eq. 10.23 as is, and determine coefficients a, (b + c), d, and e from multiple regression. The previous values of a and d determined from Eq. 10.22 are rejected, and the parameter b determined from Eq. 10.22 is retained along with a, c, d, and e determined from Eq. 10.23 for deducing the physical parameters. This approach, termed two-stage variant A, may, however, suffer from the collinearity effects between T0 and (w0 - wz). A second variant, termed two-stage variant B, would be to retain both coefficients b and d determined from
422
10
Inverse Methods for Mechanistic Models
Table 10.1 Stages involved in the multi-stage regression approach using Eq. 10.20a. Days when d = 1correspond to days with outdoor humidity higher than that indoors Stage # 1 2 3 4
Dependent variable QB/A Y1 = QB/A - b qLR Y2 = QB/A - b qLR-d T0 Y2 = QB/A - b qLR d T0
Regressor variables qLR T0 qLR (w0 - wz)
Type of regression model 2-P 2-P or 4-P 2-P 2-P or 4-P
Eq. 10.22 and use the following modified equation to determine a, c and e from data when δ = 1: QB =A - dT 0 = a þ ðb þ cÞqLR þ eðw0 - wz Þ
ð10:24Þ
The collinearity effects between qLR and (w0 - wz) when δ = 1 are usually small, and this is likely to yield less unbiased parameter estimates than variant A. (d) Multi-stage Regression Approach The multi-stage approach involves four stages, i.e., four regressions are performed based on Eq. 10.20a as against only two in the two-stage (see Table 10.1). First, (QB/A) is regressed against qLR only, using a two-parameter (2-P) regression model to determine coefficient b. Next, the residual Y1 = (QB/A - b qLR) is regressed against T0 using either a two parameter (2-P) model or a four parameter (4-P) change point model (see Sect. 5.7.2 for explanation of these terms) to determine coefficient d. Next, for data when δ = 1, the new residual is regressed against qLR and (w0- wz) in turn to obtain coefficients c and e, respectively. Note that this procedure does not allow coefficient a in Eq. 10.20a to be estimated, and so Tz cannot be identified. This, however, is not a serious limitation since the range of variation in Tz is fairly narrow for most commercial buildings. The results of evaluating whether this identification scheme is superior to the other two schemes are presented below. (e) Analysis and Evaluation A summary of how accurately the various parameter identification schemes (one-stage, two-stage variant A, two-stage variant B, and the multistage procedures) can identify or recover the “true” parameters is shown in Fig. 10.5. Note that simulation runs R1–R5 contain the influence of solar radiation on building loads, while the effect of this variable has been “disabled” in the remaining four computer simulation runs. The “true” values of each of the four parameters are indicated by a solid line, while the estimated parameters along with their standard errors are shown as small boxes. It is obvious that parameter identification is very poor for one-stage and two-stage procedures (runs R1, R2, and R3) while, except for (UAS/A), the three other
Parameter identified from Eq. 10.20a b d c e
Data set used Entire data Entire data Data when δ = 1 Data when δ = 1
parameters are very accurately identified by the multistage procedure (run R4). Also, noteworthy is the fact that the single-stage regression to daily QB values for buildings B1 and B2 at Dallas and Minneapolis are excellent (R2 in the range of 0.97–0.99). So, a model with very high R2 by itself does not assure accurate parameter estimation but seems to be a necessary condition for being able to do so. The remaining runs (R6–R9) do not include solar effects and in such cases the multistage parameter identification scheme is accurate for both climatic types (Dallas and Minneapolis) and building geometry (B1 and B2). From Fig. 10.5, it is seen that though there is no bias in estimating the parameter mv, there is larger uncertainty associated with this parameter than with the other four parameters. Finally, note that the bias in identifying (UAS/A) using the multistage approach when solar is present (R4 and R5) is not really an error: simply that the steady-state overall heat loss coefficient must be “modified” to implicitly account for solar interactions with the building envelope. A physical explanation as to why the multistage identification scheme is superior to the other schemes (especially the two-stage scheme) has to do with the cross-correlation of the regressor variables. Table 10.2 presents the correlation coefficients of the various variables, as well as variables Y1 and Y2 (defined in Table 10.1). Note that for both locations, qLR, because of the finite number of schedules under which the building is operated (five day-types in this case such as weekday, weekends, holidays, . . .) is the variable least correlated with QB as well as with the other regressor variables. Hence, regressing QB with qLR is least likely to result in the regression coefficient of qLR (i.e., b in Eq. 10.20a) picking up the influence of other regressor variables, i.e., the bias in the estimation of b is likely to be minimized. Had one adopted a scheme of regressing QB with T0 first, the correlation between QB and T0 as well as between T0 and (w0- wz) for data when δ = 1 would result in coefficient d of Eq. 10.20a being assigned more than its due share of importance, thereby leading to a bias in UAS value (see runs R1, R2, and R3 in Fig. 10.5), and, thus, underestimating kS. The regression of QB versus qLR for run R6 is shown in Fig. 10.6a. The second stage involves regressing the residual Y1 versus T0 because of the very strong correlation between
10.2
Gray-Box Static Models
423
Fig. 10.5 Comparison of how the various estimation schemes (runs R1-R9) were able to recover the “true” values of the four physical parameters of the model given by Eq. 10.20a. The solid lines depict
the correct value while the mean values estimated for the various parameters and their 95% uncertainty bands are shown as dotted lines. (From Reddy et al. 1999)
Table 10.2 Correlation coefficient matrix of various parameters for the two cities selected at the daily time scale for Runs #6 and #7 (R6 and R7) QB,1-zone QB,1-zone Y1 Y2 qLR T0 W0z qLR δ
0.88 - 0.86 0.48 0.91 0.57 0.66
Y1 0.85 - 0.82 0.01 0.97 0.59 0.60
Dallas Y2 0.52 0.78
qLR 0.53 0.00 - 0.27
- 0.30 - 0.93 - 0.40 - 0.48 Minneapolis
both variables (correlation coefficients of about 0.97– 0.99, see Table 10.2). Equally good results were obtained by using stage 3 and stage 4 (see Table 10.1) in any order. Stage 3 (see Fig. 10.6c) allows identification of the regression coefficient c in Eq. 10.20a representing the building internal latent load,
0.13 0.10 0.27
T0 0.88 0.97 0.59 0.11 0.54 0.58
W0z 0.72 0.80 0.68 0.07 0.75
qLR δ 0.82 0.70 0.44 0.42 0.72 0.66
0.72
while stage 4 (i.e., coefficient e of Eq. 10.20a) identifies the corresponding regression coefficient associated with outdoor humidity (Fig. 10.6d). In conclusion, this case study illustrates how a multistage identification scheme has the potential to yield accurate parameter estimates by removing
424
10
Inverse Methods for Mechanistic Models
Fig. 10.6 Different stages to estimate the four model parameters “b, c, d, e” following Eq. 10.20a as described in Table 10.1. (a) Estimation of parameter “b” (b) Estimation of parameter “d” (c) Estimation parameter “c” (d) Estimation of parameter “e”
much of the bias introduced in multiple linear regression approach with correlated regressor variables.
10.2.6 Application to Policy: Dose-Response Example 1.5.2 in Chap. 1 discussed three methods of extrapolating dose-response curves down to low doses using observed laboratory tests performed at high doses. While the three types of models agree at high doses, they deviate substantially at low doses because the models are functionally different (Fig. 1.17). Further, such tests are done on laboratory animals, and how well they reflect actual human response is also suspect. In such cases, model selection is based more on policy decisions rather than how well a model fits the data. This aspect is illustrated below using gray-box models based on simplified but phenomenological assumptions of how biological cells become cancerous. This section will discuss the use of inverse models to an application involving modeling risk to humans when exposed to toxins. Toxins are biological poisons usually produced by bacteria or fungi under adverse conditions such as shortage of nutrients, water, or space. They are, generally, extremely deadly even in small doses. Dose is the variable describing the total mass of toxin which the human body ingests (either by inhalation or by food/water intake) and is a function of the toxin concentration and duration of exposure (some models are based on the rate of ingestion, not simply the dose amount). Response is the measurable physiological change in the body produced by the toxin which has many manifestations, but here the focus will be on human cells
becoming cancerous. Since different humans (and test animals) react different to the same dose, the response is often interpreted as a probability of cancer being induced, which can be interpreted as a risk. Responses may have either no or often small threshold values to the injected dose coupled with linear or non-linear behavior (see Fig. 1.17). Dose-response curves passing through the origin are considered to apply to carcinogens, and models have been suggested to describe their behavior. The risk or probability of infection to a toxic agent with time-variant concentration C (t) from times t1 to t2 is provided by Haber’s law: t2
C ðt Þdt
RðC, t Þ = k
ð10:25aÞ
t1
where k is a proportionality constant of the specific toxin and is representative of the slope of the curve. The non-linear dose-response behavior is modeled using the toxic load equation: t2
RðC, t Þ = k
C n ðt Þdt
ð10:25bÞ
t1
where n is the toxic load exponent and depends on the toxin. The value of n generally varies between 2.00 and 2.75. The implication of n = 2 is that if a given concentration is doubled with the exposure time remaining unaltered, the response increases fourfold (and not by twice as predicted by a linear model).
10.2
Gray-Box Static Models
425
Fig. 10.8 Two very different models of dose-response behavior with different phenomenological basis. The range of interest is well below the range where data actually exist. (From Crump 1984)
Fig. 10.7 Gray-box models for dose-response are based on different phenomenological presumptions of how human cells react to exposure. (From Kammen and Hassenzahl 1999 by permission of Princeton University Press)
The above models are somewhat empirical (or black-box) and are useful as performance models. However, they provide little understanding or insights of the basic process itself. Gray-box models based on simplified but phenomenological considerations of how biological cells become cancerous have also been proposed. Though other model types have also been suggested (such as probit and logistic models discussed in Sect. 9.4.5), the Poisson distribution (Sect. 2.4.2) is appropriate since it describes the number of occurrences of isolated independent events when the probability of a single outcome occurring over a very short time period is proportional to the length of the time interval. The process by which a tumor spreads in a body has been modeled by multistage multi-hit models of which the simpler ones are shown in Fig. 10.7. The probability of getting a hit (i.e., a cancerous cell coming in contact with a normal cell) in a nhit model is proportional to the dose level and the number n of hits necessary to cause the onset of cancer. Hence, in a one-hit model, one contact is enough to cause a toxic response in the cell with a known probability; in a two-hit model, the probabilities are treated as random independent events. Thus, the probabilities of the response cumulate as independent random events. The two-stage model looks superficially similar to the two-hit model; but is based on a distinctly different phenomenological premise. The process is treated as one where the cell goes through a number of distinct stages with each stage leading it gradually towards becoming carcinogenic by disabling certain specific
functions of the cell (such as tumor suppression capability). This results in the dose effects of each successive hit cumulating non-linearly and exhibiting an upward-curving function. Thus, the multistage process is modeled as one where the accumulation of dose is not linear but includes historic information of past hits in a non-linear manner (see Fig. 10.8). Following Masters and Ela (2008), the one-hit model expresses the probability of cancer onset P(d ) as:
PðdÞ = 1 - expð- q0 þ q1 dÞ
ð10:26Þ
where d is the dose, and q0 and q1 are empirical best fit parameters. However, cancer can be induced by other causes as well (called “background” causes). Let P(0) be the background rate of cancer incidence corresponding to d = 0. Since exp(x) ffi (1 + x), it follows from Eq. 10.26 that P(0) is: Pð0Þ ffi ½1 - ð1 - q0 Þ ffi q0
ð10:27Þ
Thus, model coefficient q0 can be interpreted as the background risk. Hence, the lifetime probability of getting cancer from small doses is a linear model given by: Pðd Þ = 1 - ½1 - ðq0 þ q1 dÞ = q0 þ q1 d = Pð0Þ þ q1 d ð10:28Þ In a similar fashion, the multistage model of order m takes the form: Pðd Þ = 1 - exp - q0 þ q1 d þ q2 d2 þ . . . þ qm d m ð10:29Þ Though the model structure looks empirical superficially, gray-box models allow some interpretation of the model coefficients. Note that the one-hit and the multistage models
426
10
are linear only in the low dose region (which is the region to which most humans will be exposed to but where no data is available) but exponential over the entire range. Figure 10.8 illustrates how these models (Eqs. 10.26 and 10.29) capture measured data, and more importantly that there are several orders of magnitude differences in these models when applied to the low dosage region. Obviously, the multistage model seems more accurate than the one-hit model as far as data fitting is concerned, and one’s preference would be to select the former. However, there is some measure of uncertainty in these models when extrapolated downwards, and further there is no scientific evidence which indicates that one is better than the other in capturing the basic process. The U.S. Environmental Protection Agency has chosen to select the one-hit model since it is much more conservative (i.e., predicts higher risk for the same dose value) than the other at lower doses. This was a deliberate choice in view of the lack of scientific evidence in favor of one over the other model. This example discussed one type of problem where inverse methods are used for decision making. Instances when the scientific basis is poor and when one must extrapolate the model well beyond the range over which it was developed qualify as one type of ill-defined problem. The mathematical form of the dose-response curve selected can provide widely different estimates of the risk at low doses. The lack of a scientific basis and the need to be conservative in drafting associated policy measures led to the selection of a model which is probably less accurate in how it fits the data collected but was deemed preferable for its final intended purpose.
10.3
Certain Aspects of Data Collection
10.3.1 Types of Data Collection As stated in Sect. 6.1, the two common methods of collecting data are (i) design of experiments (DOE) which are often performed in a laboratory setting or under controlled in-situ conditions so as to identify models and parameters as robustly as possible (see Chap. 6), and (ii) observational or non-intrusive data collection. The latter is associated with systems under their normal operation (usually under field conditions), and subject to whatever random and natural stimuli that perturb them. Various aspects related to planning an experiment were discussed in Sect. 3.8. Such system performance data have several characteristics significantly different from data collected under DOE. Measurements made in the field have higher errors and uncertainty than those under controlled laboratory conditions due to factors such as abruptly varying forcing functions, cheaper instrumentation and data collection systems, and constraints in proper placement and calibration of instrumentation. Further, field monitored data is highly repetitive in nature, and its
Inverse Methods for Mechanistic Models
impact on inverse modeling is discussed in the next sub-section. Further, one can distinguish between two types of model and parameter identification techniques: (i) Offline or batch identification where data collection and model identification are successive, i.e., data collection is done first, analyzed and a model identified later using the entire data stream. There are different ways of processing the data, from the simplest conventional OLS and time series techniques to more advanced ones (some of which are described in Chaps. 8 and 9 and in this chapter). (ii) Online or real-time identification (also called adaptive estimation, or recursive identification) where the model parameters are identified in a continuous or semicontinuous manner during the operation of the system as more data is forthcoming. The term “recursive” applies when data processing is done several times to gradually improve the accuracy of the estimates as more data comes in. The basic distinction between online and offline identification is that in the former case the new data is used to correct or update the existing estimates without having to re-analyze the data set from the beginning. There are three disadvantages to online identification in contrast to offline identification. One is that the model structure needs to be known before identification is initiated, while in the offline situation different types of models can be tried out separately. The second disadvantage is that, with few exceptions, online identification does not yield parameter estimates as accurate as do offline methods (especially with relatively short data records). The third is related to the speed of computation depending on the sampling frequency versus the time needed by the online algorithm to complete the necessary updating. However, there are also advantages in using online identification. One can discard old data and only retain the model parameters. Another advantage is that corrective action, if necessary, can be taken in real time. Online identification is, thus, of critical importance in certain fields involving fast reaction via adaptive control of critical systems, digital telecommunication, and signal processing. The interested reader can refer to numerous textbooks on this subject (e.g., Sinha and Kuszta 1983).
10.3.2 Measures of Information Content Information technology distinguishes between “message” and “information” (Cyganski and Orr 2001). While the former is the content of the transmitted data, the latter represents the conveyed knowledge that is in some way useful and
10.3
Certain Aspects of Data Collection
427
was not known previously. The concept of quantifying the information content of data is not new, nor is it limited to one discipline. Broadly speaking, it is an “attempt to measure (i.e., quantify or enumerate) the valuable information in a set or stream of data” (Neumann and Krawczyk 2001). It finds application in areas such as information theory, statistics, and applied science depending on the problem or intent of the stipulated problem. The basic underlying premise is that information is gained when uncertainty is reduced. How this concept is applied practically in two different instances is illustrated below (from Reddy et al. 2003).
parameter variances up to a scaling factor. There are pros and cons in using this measure of information. A drawback is that the measure does not consider the correlation between the parameter estimates (i.e., the elements outside the diagonal of (XT X)) and can be misleading if the parameters are seriously correlated. Second, if the variances of the different regressors are of different magnitude, the measure might be totally dominated by a single variance. Another measure of information, as discussion earlier is the log of the mean of the determinant: I 2 = - ln det XT X =m
(a) More Accurate Parameter Estimation During Planned Experiments In information theory (IT), the expected value of the information in a sample taken from a distribution (called the entropy) is the average number of bits required to describe the sample (Gershenfeld 1999). The entropy is a maximum if the distribution is flat (since one knows nothing about the next point) and a minimum if the distribution is sharply peaked. In the design of experiments, the criterion based on which one determines whether a datum contains useful information or not is judged by whether it reduces the uncertainty in the parameter estimate vector b. Thus, the information content in the context of linear multivariate models can be considered to be a function of the matrix (XTX) following Eq. (5.36). More specifically, minimizing det(XTX), or alternatively, minimizing ln{det(XTX)} also minimizes the uncertainty of the parameter estimator vector b. These measures are called “projections” in the IT literature, and though some information about specific parameter estimates can be lost, it provides a better overall measure of uncertainty (Beck and Arnold 1977). The analyst can thus attempt to tailor his experimental sequence so that a minimum number of experiments can provide the necessary accuracy in the model estimates. (b) Initial Training Data During Non-intrusive Field Monitoring What is the value of such measures of information in the context of non-intrusive field-monitored data? One advantage is that instead of tracking the uncertainty of the full set of model parameter estimates individually as more data is forthcoming (which is tedious), it is far simpler both computationally and for decision making, to track only one overall measure. A simple measure of information is the trace of a matrix (see any pertinent statistical textbook): I 1 = trace
XT X
-1
ð10:30Þ
Since the trace of a matrix is equal to the sum of the diagonal elements, I1 can be interpreted as the sum of the
ð10:31Þ
where m is the number of observations used to calculate (XT X). I2 measures the average amount of information and is often used in experimental design to determine optimal input sequences with respect to a given model structure. Although the input sequences cannot be controlled for field monitored data, the measure still provides an advantage in that it allows one to compare the amount of information in different datasets and may serve as a guide for the length of the initial training period required. Measure I2 is generally better than I1 since it also accounts for the covariance and variances between the parameters. The case study example presented below illustrates this concept. Note that the measures I1 and I2 do not contain any indication of the extent to which the regressor set and the response are correlated, since only the former is used. This is clearly a deficiency, and so another measure which includes this type of information would be advantageous. Such a measure is the mutual information measure (Gershenfeld 1999) which is defined as the difference in the information between two samples taken independently and taken together. In other words, it is the amount of uncertainty in system output that is resolved by observing the system inputs. The mutual information matrix is useful in (i) determining the right inputs to a model, and (ii) what transformation in inputs is optimal. In this regard this measure has been used for non-linear black-box modeling. It is calculated following: I ðX, yÞ =
pðx, yÞ ln
pðX, Y Þ pðX Þ:pðyÞ
ð10:32Þ
where the summation is over the discrete values of X and y respectively, p(X,y) is the joint probability mass function of X and y, and p(y) and p(X) are the probability mass functions of y and X respectively. The mutual information measure seems more suitable for experimental design or for offline parameter estimation since it requires prior knowledge of joint probabilities. This measure has been built on the concept of entropy introduced by Shannon (discussed in Sect. 9. 5.1 in the framework of detecting non-linear correlation between variables).
428
10
Example 10.3.1 Measures of information content in field data collected from a field chiller The usefulness of the measures of information presented above are discussed in the framework of field monitored data of a centrifugal chiller located in Toronto, Canada monitored (taken from Reddy et al. 2003 and Reddy and Andersen 2002). This chiller has been monitored for over 5 months from June till October yielding 810 hourly data points suitable for analysis. The relevant background, nomenclature and gray-box and black-box modeling equations for chiller systems are described in Sects. 9.3.4 and 10.2.3. The monitored variables are: (i) thermal cooling capacity, Qch in kWt, (ii) supply chilled water temperature Tchi in K, (iii) condenser water supply temperature Tcdi in K, and (iv) electric power consumed (P) in kW. From the time series plots of the four variables shown in Fig. 10.9, there is relatively little variation in the two temperature variables, while the load and power experience important variations. The extent to which field monitored data is inherently repetitive in information content will be illustrated below; in other words, more field data does not necessarily mean more information. (a) Stationary Statistics
Tcdi, Tchi, and Qch. One notes that Qch values are fairly well distributed, while those for the two temperatures are not. For example, there are only a couple of data points for Tcdi < 27.5 °C. These points are likely to be influence points (Cook and Weisberg 1982) and whether these reflect actual operating conditions or are a result of either erroneous data or uncharacteristic chiller operation must be first ascertained from physical (as against statistical) considerations. Additional insight into the extent to which the collected data is repetitive in nature can be gleaned by studying the joint occurrences. Table 10.3 shows these values for four bins of Tchi and Qch each and for two bins of Tcdi (which exhibits the least variability). It is obvious that there is great variability in how the individual bins are populated. For example, for the higher Tcdi bin, there are large number occurrences (95 and 106) in the middle two Qch bins at the two extreme Tchi bins. The extent to which a new datum point is bringing in additional information can be evaluated based on the number of data points already present in that particular bin. Ideally, population uniformity across bins, i.e., having a certain number of occurrences within each bin (say, 4–5 in order to account for random noise in the data), and no more, would be a desirable situation offering sound parameter estimation with a minimum of data points. (b) Effect of Autocorrelation in Variables
One intuitive and simple way is to look at the histograms of the regressor variables since this would provide an indication of the variability in operating conditions of the chiller. Uniform distributions of the regressors would indicate good coverage of chiller operating conditions, and vice versa. Figure 10.10 depicts such histograms for the three variables Fig. 10.9 Time series data of the four measured variables for the Toronto chiller (Example 10.3.1). Tchi is inlet water temperature to evaporator (K), Tcdi inlet water temperature to condenser (K), P is the electric power consumed by chiller (kW), and Qch is the chiller thermal load (kW)
Inverse Methods for Mechanistic Models
Another characteristic of field monitored data is the effect of autocorrelation of the forcing variables. Diurnal and seasonal trends due to operational and climatic influences result in strongly patterned behavior, which also reduces the information value of collected data. The GN chiller model (Sect.
Tchi
304
290 302 285
300 Tcdi
280
0
200
400
600
800
298
0
200
400
800
600
2000
350 P
Qch
300
1500
250 1000
200 150
0
200
400
600
800
500
0
200
400
600
800
10.3
Certain Aspects of Data Collection
429
Fig. 10.10 Histogram of number of occurrences for the three measured variables using the hourly Toronto chiller data (total of 810 data points)
Table 10.3 Number of joint occurrences for the Toronto chiller data (total of 810 hourly data points) for two bins of condenser water inlet temperature Tcdi. Tchi (°C) Qch (kW) 8.8–10.8 23.4 °C > Tcdi > 28.8 °C 517–797 111 797–1077 84 1077–1424 9 1424–1771 0 28.8 °C > Tcdi > 30.5 °C 517–797 7 797–1077 95 1077–1424 28 1424–1771 0
10.8–12.7
12.7–16.6
16.6–20.5
0 39 18 16
0 39 31 2
0 3 22 0
0 5 60 26
0 11 61 27
0 6 106 4
10.2.3) can be linearized by a transformation of variables and reduces to a linear model with three parameters. Specifically, one can observe how the parameter estimates and their associated uncertainty bands converge as the length of data collection is increased. Model robustness in term of parameter variances can be investigated using resampling techniques (see Sect 5.8). Specifically, the following computational scheme can be adopted: (i) A sliding window length of m data points is selected starting at point m0; (ii) The GN parameter estimates are deduced for this data set; (iii) Steps (i) and (ii) are repeated a large number of timesspecifically 400 times in this analysis with the starting point m0 being moved from 1 to 400;
(iv) Mean values of the parameter estimates and their 2.5% and 97.5% percentiles are calculated from the 400 sets pertaining to window length m; (v) Steps (i) to (iv) are repeated by incrementally changing the window length from m = 20 to m = 400. Figure 10.11 summarizes the results of such an analysis. One notes that proper convergence along with acceptable 2.5 and 97.5 percentiles seem to require close to 300 data points for the GN model parameters, largely because of the serial correlation present. Say now, the same procedure as previously is applied but to m data points no longer taken sequentially but randomly from the entire data set of 810 points with replacement (this is the bootstrap methodSect. 5.8.3). This would result in much of the auto-correlation in the data to be removed. Inspection of Fig. 10.12 leads us to a completely different conclusion than previously. The number of observations (from 20 to 400 observations) seems to have no effect on the mean values of the model parameters nor on their variance (indicated by the 2.5 and 97.5 percentiles). In other words, using about 20 independent samples is just as good in terms of variance of parameter estimates as using 300 data points monitored continuously! The above conclusion can be confirmed in a simple manner. From statistical sampling theory, the number of independent observations n' of n observations with constant variance but having a one-lag autocorrelation ρ, is equal to: n0 = n:
ð 1 - ρÞ ð 1 þ ρÞ
ð10:33Þ
430 Fig. 10.11 Convergence properties for the three parameter estimates of the GN model applied to chiller data in the sequence in which the data was collected. For each plot, the middle line is the calculated mean and the upper and lower lines correspond to the 97.5 and 2.5 percentiles. The x-axis shows the window size (i.e., number of data points), and the yaxis the parameter estimates b1, b2, and b3
10
Inverse Methods for Mechanistic Models
0.8 b1 0.6 0.4
0
50
100
150
200
250
300
350
400
0 -1000 -2000 -3000 10
b2 0 x 10-3
50
100
150
200
250
300
400
350
8 6 4
Fig. 10.12 Same as Fig. 10.9 but when bootstrapping resampling is applied, i.e. the samples are created randomly so as to remove the effect of serial autocorrelation
b3 0
50
100
150
200
250
300
400
350
0.55 0.5 b1 0.45
0
50
100
150
200
250
300
350
400
-600 -650
b2
-700 -750 9
0 x 10-3
50
100
150
200
250
300
350
400 b3
8 7
0
The one-lag serial correlation among the basic variables for the Toronto chiller data is about 0.90 which results in (n'/ n) = 1/19. Thus, the number of independent observations n' from n = 300 observations would be equal to about (300/19) = 16, which is consistent with the above observation that about 20 independent samples are sufficient for robust model parameter estimation. (c) Implications to Online Training Since the chilled water flow rate is constant, the 810 observations of the three regressor variables [Tcdi, Tchi, Tcho] where Tcho is the return chilled water temperature is
50
100
150
200
250
300
350
400
used to calculate the measures of information I1 and I2 using Eqs. (10.30) and (10.31), respectively. The computation has been performed for three cases: (i) Incremental window, i.e., an ever-increasing window length to mimic the way data will be collected in actuality. Specifically, the computation starts with an initial set of 10 data points, which was increased one observation at a time till the end of the data stream was reached (represented by the point n = 810). (ii) Sliding window with 100 data points. (iii) Sliding window of 200 data points.
10.3
Certain Aspects of Data Collection
431
Fig. 10.13 Plots illustrating how the two measures of information I1 (shown on a log scale) and I2 change as more data is incoming under both incremental and sliding window scenarios. The basic variable set [Tcdi, Tchi, Tcho] for the Toronto chiller has been used
How these two measures of information change with n under the above three cases is shown in Fig. 10.13. It is clear that both measures continuously decrease as more data is collected even though the data may be largely repetitive. However, the asymptotic nature of the curves indicates that there is a decreasing value to incoming data as n increases. One notes that the first 20 or 30 data points seem to bring in large amounts of “new” information, resulting in a steep initial drop in the plots, which gradually decreases in time. Also, to be noted is the fact that the I2 plot seems to asymptote a little quicker than the I1 plot. This is consistent with the earlier observation that this should be so when the regressor set is strongly correlated. As expected, a sliding window of 100 data points shows more variability than one with 200 data points and with the incremental window case, indicating the importance of gathering data for more than one month (or season) of the year. Thus, a window size of 100 data points would be an acceptable waiting period before initiating online tracking of the chiller.
10.3.3 Functional Testing and Data Fusion With engineered systems becoming increasingly complex, several innovative means have been developed to check the sensing hardware itself. One such technique is functional testing; a term often attributed to software program testing with the purpose of ensuring that the program works the way it was intended while being in conformance with the relevant industry standards. It is also being used in the context of testing the proper functioning of engineered systems which are controlled by embedded sensors. For example, consider a piece of control hardware whose purpose is to change the
operating conditions of a chemical digester (say, the temperature or the concentration of the mix). Whether the control hardware is operating properly or not, can be evaluated by intentionally sending test signals to it at some pre-determined frequency and duration, and then analyzing the corresponding digester output to determine satisfactory performance. If it is not, then the operator is alerted and corrective action to recalibrate the control hardware may be warranted. Such approaches are widely used in industrial systems and telecommunication networks and have also been proposed and demonstrated for building energy system control. The techniques and concepts presented in this chapter are several decades old, and though quite basic, are still relevant since they provide the foundation to more advanced and recent methods. Performance of engineered systems was traditionally measured by sampling, averaging and recording analog or digital data from sensors in the form of numeric streams of data. Nowadays, sensory data streams from multiple and disparate sources (such as video, radar, sonar, vibration, . . .) are easy and cheap to obtain. Such multi-sensor and mixed-mode data streams have led to a discipline called data fusion, defined as the use of techniques that combine data from multiple sources to achieve inferences, which will be more robust than if they were achieved by means of a single source. A good review of data fusion models and architectures is provided by Esteban et al. (2005). Application areas such as space, robotics, medicine, sensor networks have seen a plethora of allied data fusion methods aimed at detection, recognition, identification, tracking, change detection, and decision making. These techniques are generally studied under signal processing, sensor networks, data mining, and engineering decision making.
432
10
Inverse Methods for Mechanistic Models
Fig. 10.14 Classification of dynamic modeling approaches
10.4
Gray-Box Models for Dynamic Systems
10.4.1 Introduction Some of the important aspects to consider when faced with mechanistic inverse problems are summarized below: (a) Type of inverse problem: identification or reconstitution (Sect 10.1). (b) Model identification or selection is perhaps the most difficult/critical aspect. It should be consistent with the physics while being parsimonious. Overparametrization is a common problem and must be dealt with adequately. (c) Is the problem static or dynamic? If the latter, is the model framed in terms of a physical analogue such as a thermal network (Sect. 10.4.2) or as a time series model (Sect. 10.4.3), or as a series of first-order ODE (Sect. 10.4.4), or as a higher order or partial differential equation. (d) Would a change of variables, called “regularization” be useful? It is often done to better condition the parameters, either to be able to interpret them in physical terms (such as the GN chiller model Sect. 10.2.3) or to reduce/remove correlations among regressors. (e) The type of analysis approach in case of non-intrusive data collection can be: (i) single batch wherein all the data is analyzed at once, or (ii) online sequential wherein the parameter estimation is gradually improved as more data comes in (Sect. 10.3). An additional consideration is the
sampling frequency depending on the dynamic behavior of the system (discussed in Sect. 10.4.6). (f) In case intrusive experiments are feasible, how best to perform them to improve robustness and obtain the soundest parameters estimates. One could envision a series of sequential experiments under conditions where different forcing drivers are more pronounced (illustrated in Sect 10.4.2). (g) What is the process or algorithm to estimate the parameters? Usually, one adopts OLS, MLE or the Levenberg-Marquardt algorithm for non-linear estimation (Sect. 9.5.2). A classification of various dynamic modeling approaches is shown in Fig. 10.14. There is an enormous amount of knowledge in this field and on relevant inverse methods, and only a basic introduction to a narrow class of models is provided here. As described in Sect. 1.4.4, one differentiates between distributed parameter and lumped parameter system models, which can be analyzed either in time domain or frequency domain. The models, in turn, can be divided into linear and non-linear, and then into time continuous or discrete time models. This book limits itself to (i) thermal network models which are a type of physical analog representation of heat transfer problems, (ii) ARMAX or transfer function models described in Sect. 8.6 which are discrete time linear models with constant weight coefficients, and (iii) to compartmental models which are appropriate for linear, time-invariant systems with discrete and separate components (these are a special sub-set of the more general state variable model formulation, see Sect. 10.4.4).
10.4
Gray-Box Models for Dynamic Systems
433
10.4.2 Sequential Estimation of Thermal Network Model Parameters from Controlled Tests Representing the hourly heat flows in and out of a building considering the thermal mass of the envelope/shell and the interiors by dynamic representations suitable for inverse modeling has been studied by numerous researchers from the early 1970s (e.g., Sonderegger 1978; Bacot 1985; Subbarao 1985; Hammersten 1986; Rabl 1988). The thermal network analogue representation is especially intuitive to engineers and has been introduced in Sect. 1.4.4. and expanded in Sect. 7.9. The approach involves selecting a network model suitable for the construction type of the building in question and then inferring basic physically relevant building parameters. A model thus identified would allow investigation the effect of different measures; for example, evaluation of possible retrofits, verification of actual energy performance, the effect of changing thermostat settings, diagnostics, and optimal control. One could adopt a dynamic optimization approach to minimize the operating cost of cooling a building during several hours of the peak period when electric costs are high. A study by Lee and Braun (2008) is described in Sect 7.9 where the air-conditioner power savings and daily energy savings for a small test building located in Palm Desert, CA. The model parameters of different higher order differential equations deduced from experimental testing of the building and the AC equipment were used to evaluate different thermostat control options. A simpler application is presented below. A thermal network configuration has been proposed by Saunders et al. (1994) for modeling the performance of wood frame residences in the New England area of the U.S. under winter conditions. It has been called MPR (measured performance rating) method since its primary purpose was to rate or evaluate the effect of weatherization of a home on the building physical parameters such as overall heat loss coefficient and thermal mass of the building. It has been found to be adequate for its intended purpose which involves using the model to compare the preand post-retrofit thermal characteristics of the house. It can also be used to determine by hour-by-hour analysis throughout the heating season whether the heating energy use has been reduced as compared to that during the pre-retrofit stage. The thermal network is shown in Fig. 10.15 and can be specified generally as a 2R1C (two resistors and one capacitor) electric network. The thermal storage effects of the building shell are represented by a single node, while the indoor air temperature (Ti ) and indoor thermal storage are assumed to be perfectly coupled. Depending on the presence of a basement or crawl-space, an extra node represented by Tb is also included. Heat input Qin is provided either by the furnace itself or by electric auxiliary heaters. The specific choice of this model is somewhat ad hoc and based on domain knowledge of the researchers since one could have
Fig. 10.15 The 2R1C thermal network configuration on which the MPR method is based
selected different types of networks. Whether it serves the intended purpose must be determined by field trials. The basic governing equation for this network is: C
1 dT i 1 ðT - T b Þ þ AS Qsol þ Qin η = - ðT i - T o Þ dt R Rb i ð10:34Þ
with To indicating the outdoor dry-bulb temperature, Qsol the solar loads on the building per unit aperture area and As the effective solar aperture. The five physical parameters to be determined are the resistances R and Rb, the thermal heat capacity C, the effective solar aperture As and the efficiency of the heating source. Trying to estimate these parameters simultaneously using non-experimental data collected from the residence, during say a few days, leads to serious bias. Hence an intrusive test protocol was developed wherein the influence of certain physical variables are blocked by either conducting tests during certain times (such as doing tests at night to eliminate solar effects) or by controlling certain influencing factors. This allows parameters of the partial model to be identified sequentially and gradually expanded to the full model. The MPR experimental protocol involves the following steps: (a) Determine the efficiency of the heating system by isolating the furnace or boiler and taking relevant measurements. (b) Perform a co-heating test simultaneously with a tracer gas infiltration test. This is done at night with Ti = Tb and Ti kept constant by electric heaters with all other sources of internal heat switched off. Hourly readings of To and Qin allow the value of resistance R to be determined by regression. (c) Perform a second co-heating test, also at night, with Ti kept constant and Tb left to float. From here Rb can be determined. (d) Perform a third test at night called a cool-down test wherein electric heaters are switched off and the drop in Ti is monitored at 1/2-h time steps. Use a finite
434
difference scheme to discretize the governing equation and deduce capacitor C by regression. (e) Finally, the effective solar aperture Asol is determined by taking solar radiation measurements on all surfaces with transparent glazing during the daytime when the sun is up and using a reduced form of Eq. (10.34). To avoid another test, a simpler alternative would be to simply measure the window areas on different orientations and weight them by the corresponding incident solar irradiation. However, this procedure is approximate since it neglects several factors: the effects of diurnal changes in solar irradiation, of glazing transmittivity with solar incidence angle, any internal or external shading devices or by trees, etc. This protocol was performed on eight weatherized homes as well as repeated tests on two occupied homes. The study found that estimates of thermal characteristics of a house can be obtained with satisfactory precision even when testing is limited to one night and one day.
10.4.3 Non-Intrusive Identification of Thermal Network Models and Parameters The selection of the network configuration is judgement-based drawing on prior domain knowledge. In any case, a few possible networks should be evaluated and the most suitable one picked. The previous section dealt with intrusive testing to determine the physical parameters of a thermal network. Even though this is the better option, it cannot be implemented in some instances and in large buildings. The focus of this section is on analysis methods based on routine and non-intrusive data collection to estimate dynamic and steady-state parameters of a building. Of the several analysis methods proposed, only two of them will be discussed below in the framework of energy flows in residences which are simpler to analyze than those in commercial buildings (taken from Reddy 1989). Three ranch-style houses of frame construction, located in the Princeton area of NJ, and all built in the 1960s (about 30 years old at the time of the study) were instrumented and data collected under “lived- in” conditions as non- intrusively as possible during the swing seasons of the year. These correspond to situations where parameter estimation is likely to be most uncertain given that the signals will be relatively “weak,” while being “noisy” at the same time. These homes could be considered typical American houses, i.e., of lightweight construction, no planned solar strategies, common landscaping, and not super- insulated. Moreover, all these houses have no major source of internal heat generation other than that used by the electrical appliances; even cooking ranges are electric. This section only presents the results of two homes and two analysis methods.
10
Inverse Methods for Mechanistic Models
Periods of data when the air-conditioner system was not being operated were identified for all three houses (referred to as “free floating mode”). This resulted in 290 hourly observations for House#1 and 486 hourly data points for House#3. Hourly measured values used in the analysis of the interior air and outdoor air temperatures Ti and Ta, respectively, of Qa and Qs the total internal heat generation and global solar radiation on a horizontal surface per unit area, respectively. Figure 10.16 illustrates the type of variation in Ti, Ta, Qs and QA encountered during the monitoring period. Differences between Ti and Ta were on the order of 6° to 8 °C on the average, which, considering the values of the overall heat loss coefficients indicated that the heat loss through the building shell is the most important source of heat flow while the solar loads are low due to tree-shading. Four different simple thermal network configurations suitable for internal-mass dominated homes were selected based on some amount of judgement. The original study (Reddy 1989) presents results of all four networks; unfortunately, the best one is not obvious since all four seem to fit the monitored data equally well. Here, the results of two homes and the thermal networks which seem to be most appropriate (for reasons explained later) are depicted in Fig. 10.17. (The other two network configurations evaluated are shown in Pr. 10.13.) For the 2R1C network, there are four parameters to estimate: the two resistances, the thermal heat capacitance C and the factor As associated with the solar irradiation. For the 3R1C network an additional resistance is due to the inclusion of a clamp temperature (that of the basement). All the monitored internal electricity use is assumed to appear as internal load and did not require a conversion factor. Heat balance equations, based on Kirchhoff’s laws, for the two nodes of the RC1 configuration can be stated as: Node Ti:
Node Ts:
1 1 ðT - T i Þ þ QA = ðT - T s Þ R1 a R2 i 1 ðT - T s Þ þ AS QS = C T_s R2 i
ð10:35Þ
The storage (or aggregated internal node) temperature Ts coupled with the capacitor C is expressed as a differential equation and is meant to denote change in heat storage. Since this temperature cannot be measured directly, the two equations must be re-expressed into a single one ODE with respect to the air node temperature Ti. One derives an expression for Ts from the first equation which is then substituted in the second equation to yield: T_ i = a0 T_a þ b1 ðT a - T i Þ þ c0 Q_A þ c1 QA þ d1 QS ð10:36aÞ
10.4
Gray-Box Models for Dynamic Systems
435
Fig. 10.16 Typical sample of the recorded data of the four variables (House#3)
a0 c 0 = b1 c 1
Fig. 10.17 The two thermal network configurations evaluated with monitored data from two residences under free floating mode. (a) 2R1C network with four parameters, (b) 3R1C network with five parameters
where
ao = R1
c1 =
b1 =
R2 R 1þR2 1
1 C:R1
R 1þR2 1
co =
R2 R
1þR2 1
ð10:36bÞ
This equation with five parameters is the one to be used for regression after discretization. A simple forward difference scheme is usually acceptable. For the internal air node temperature at time t and time step Δt, Tit = Ti(t - 1) while the first order derivative can be written as T_it = T iðt - 1Þ - T iðt - 2Þ =Δt, and so on. If the data is collected at hourly intervals, Δt = 1 h. Even though there are four physical parameters and five regression coefficients, the former can be determined uniquely due to the constraint between the regression coefficients shown in Eq. 10.36b. A note of caution is that this will not be the case for all network configurations; very often one is faced with the possibility of an over- or under-determination problem, and one has to modify the network appropriately or impose more constraints. The ARMAX model (see Sect. 8.6) is well suited to model time series response of systems with clear forcing functions. The study selected model orders ranging from one to three lags and evaluated all of them for all homes. Defining n = t/Δt, the basic model for ARMAX(2,2,2,2),4 i.e., one with two lag terms for all four variables is (refer to Eq. 8.47):
1 R
C 1þR2 1
d1 ¼
AS C 1 þ RR21
subject to the condition that
ð10:37Þ
4 This terminology is different from the traditional one described in Sects. 8.5 and 8.6. Strictly speaking, the model structure assumed here should be called AR2X3 denoting that no MA component is involved, that there are three exogenous forcing functions and that these and the response and forcing variables have two lagged terms each. However, the above terminology seems to have been adopted in the building energy literature.
436
10
T i ðnÞ þ a001 T i ðn - 1Þ þ a002 T i ðn - 2Þ = b000 T a ðnÞ þb001 T a ðn - 1Þ þ b002 T a ðn - 2Þ þ c000 QA ðnÞ
Table 10.4 Equations for the building physical parameters in terms of the regression coefficients of the time series model ARMAX (2,2,2,2) given by Eq. 10.38a
þ c001 QA ðn - 1Þ þ c002 QA ðn - 2Þ þ d000 QS ðnÞ þ d001 QS ðn - 1Þ
þ
Inverse Methods for Mechanistic Models
d 002 QS ðn - 2Þ ð10:38aÞ
Physical parameter Steady-state building heat loss coefficient
Expression 2
a″k
L=
k=0 2
c″k k=0
with the constraint that the sum of the a″ and b″ coefficients must generally satisfy steady-state conditions given by:
2
Effective solar aperture
d}k
AS =
k=0 2
.
c}k k=0
Ni
a″k = k=0
Time constants
Na
b″k
τ1 and τ2 = Δt: ln
ð10:38bÞ
k=0
Effective building heat capacity
where Ni and Na represent the number of lagged terms for the drivers Ti and Ta (equal to two for the model form assumed above). Note that the coefficients are normalized assuming the coefficient for Ti(n) = 1. Rearranging terms yields the following expression suitable for regression: T i ðnÞ - T i ðn - 1Þ = a02 T i ðn - 1Þ - T i ðn - 2Þ
þb00 ½T a ðnÞ - T i ðn - 1Þ þ b01 ½T a ðn - 1Þ - T i ðn - 1Þ þb02 ½T a ðn - 2Þ - T i ðn - 1Þ þc00 QA ðnÞ þ c01 QA ðn - 1Þ þ c02 QA ðn - 2Þ þd 00 QS ðnÞ þ d01 QS ðn - 1Þ þ d02 QS ðn - 2Þ
ð10:39Þ Note that difference between the regression expressions given by Eq. (10.36a) and Eq. (10.39) used for identifying the model coefficients. Once these are determined, the physical parameters can be deduced from Eq. (10.37) for the thermal network approach. For the ARMAX(2,2,2,2) approach, there are 10 regression coefficients and one constraint from which one cannot infer the five physical parameters uniquely. However, one can determine certain overall physical parameters such as overall heat loss coefficient and time constants; the equations for the four most important ones are shown in Table 10.4. The reader can refer to Rabl (1988) for a discussion and complete expressions of higher order models. One could have selected the order of the times series model with the highest goodness of fit as the best one. However, since the intent was to estimate the building physical parameters and not simply fit a prediction model, the better approach of selection was to determine which of the RC network configurations and which order of the ARMAX model yielded most consistent results. It was concluded (albeit indirectly) that the particular network models shown in Fig. 10.15 were the most suitable since the associated parameter values were very close to those estimated separately from ARMAX(3,3,3,3) models (see Table 10.5).
- a}1 ± sqrtða}21 - 4a}2 Þ 2a}2
-1
C = L(τ1 + τ2)
From Rabl (1988)
Table 10.5 Building parameter sets identified for the two houses which were deemed most physically realistic and consistent across the time series and the thermal network models (Reddy 1989) Type of model Configuration L (kW/°C) AS (m2) C (kWh/°C) τ1 (h) τ2 (h) Adj–R2 RMSE (°C)
House#1 Thermal network 2R1C 0.34 1.6 16.2 47.6 – 0.322 0.15
Time series 3,3,3,3 0.35 2.2 16.0 45.3 0.55 0.366 0.17
House#3 Thermal network 3R1C 0.46 6.6 21.3 46.6 – 0.521 0.16
Time series 3,3,3,3 0.46 6.8 21.8 46.0 1.39 0.582 0.15
A final observation is that the best features of both methods can be leveraged. The thermal network equations are somewhat laborious to manipulate especially since there is no ready-made software to do this. Further, the regression results are often not sensitive enough for the analyst to discriminate between competing networks or even between alternate generic configurations. However, its most redeeming advantage is its ability to provide a “mind model,” i.e., a visual representation with physically relevant parameters of the system, a feature most appealing to engineers. The time series model form is inherently much easier to manipulate for regression analysis, and further, one can progressively increase the order of the model till some measure of convergence is obtained in terms of model goodness of fit and parameter values. Even with such a systematic procedure, the analyst can only identify aggregate parameters (and only for lower order models). Hence, such a general multimethodology approach of using different analysis methods to reach sounder conclusions is one which is highly recommended not only during parameter estimation but in other types of statistical analyses as well.
10.4
Gray-Box Models for Dynamic Systems
437
Fig. 10.18 Nomenclature adopted in modeling dynamic systems by the state space representation
10.4.4 State Space Representation and Compartmental Models Generally, dynamic models are characterized by differential equations of first, second or higher orders. The standard nomenclature adopted to represent such systems is shown in Fig. 10.18. The system is acted upon by a vector of external variables or signals or influences u while y is the output or response vector (the system could have several responses). The vector x characterizes the state of the system which may not necessarily be the outputs and may represent latent states of the system which may not be measurable. For example, in mechanical systems, these internal elements may be positions and velocities of separate components of the system, or in thermal RC networks may represent the temperatures of the internal nodes within the wall. In many applications, the variables x may not have direct physical significance, nor are they necessarily unique. A special case is the state space formulation which involves manipulating an nth order differential equation into a set of n first order ODE. A multi-input multi-output linear system model can be framed as: •
x = Ax þ Bu y = Cx þ du
ð10:40Þ
where A is called the state transition matrix, B the input matrix, C the output matrix and d the direct transmission term. This is referred to as a linear time-invariant (LTI) state space model. Note that the first function relates the state vector at current time with those of its previous values. This can be expanded into a linear function of p states and m inputs: x_ = ai1 x1 þ ai2 x2 þ . . . þ aip xp þ bi1 u1 þ bi2 u2 þ . . . þ bim um
can be broken up into simpler discrete sub-systems where each can be viewed as homogeneous and well-mixed that exchange mass with each other and/or a sink/environment. This is a form of discrete linear lumped parameter modeling approach more appropriate for time invariant model parameters which is described in several textbooks (e.g., Godfrey 1983). Further, several assumptions are inherent in this approach: (i) the materials or energy within a compartment get instantly fully mixed and homogeneous, (ii) the exchange rate among compartments are related to the concentrations or densities of these compartments, (iii) the volumes of the compartments are taken to be constant over time, and (iv) usually no chemical reaction is involved as the materials flow from one cell to another. The quantity or concentration of material in each compartment can be described by first-order (linear or non-linear) constrained differential equations, the constraints being that physical quantities such as flow rates be non-negative. This type of model has been extensively used in such diverse areas as medicine (biomedicine, pharmacokinetics), science and engineering (ecology, environmental engineering, and indoor air quality), and even in social sciences. For instance, in a pharmacokinetic model, the compartments may represent different sections of a body within which the concentration of a drug is assumed to be uniform. Though these models can be analyzed with time-variant coefficients, time invariance is usually assumed. Compartmental models are not appropriate for certain engineering applications such as closed-loop control, and even conservation of momentum equations that are non-compartmental in nature. Two or more dependent variables, each a function of a single independent variable (usually time for dynamic modeling of lumped physical systems) appear in such problems which lead to a system of ODEs. Section 12.7.6 briefly mentions how the compartmental model approach has been adopted in sustainability studies under the discipline referred to as “system dynamics modeling” which is concerned with the overall behavior of complex interconnected environmental systems with numerous sources, sinks, feedback loops and stabilizing loops.
ð10:41aÞ
10.4.5 Example of a Compartmental Model Finally, the outputs themselves may or may not be the state variables. Hence, a more general representation is to express them as linear algebraic combinations: yi = ci1 x1 þ ci2 x2 þ . . . þ cip xp þ d i1 u1 þ d i2 u2 þ . . . þ d im um
ð10:41bÞ Compartmental models are a sub-category of the state space representation appropriate when a complex process
Consider the three radial-room building as shown in Fig. 10.19 with volumes V1, V2 and V3. A certain amount of uncontaminated outdoor air (of flow rate r) is brought into Room A from where it flows outwards to the other rooms as shown. Assume that an amount of contaminant is injected in the first room as a single burst which mixes uniformly with the air in the first room. The contaminated air then flows to the second room, mixes uniformly with the air in the second
438
10
Inverse Methods for Mechanistic Models
The eigenvalues are then determined from the characteristic equation: j A - λI j = 0 or ð- k 1 - λ1 Þð- k2 - λ2 Þð- k 3 - λ3 Þ = 0 ð10:45Þ The three distinct eigenvalues are the roots of Eq. 10.45; namely: λ1 = - k1, λ2 = - k2, λ3 = - k3. With each eigenvalue is associated an eigenvector v from where the general solution can be determined. For example, consider the case when k1 = 0.5, k2 = 0.25 and k3 = 0.2. Then, λ1 = 0.5, λ2 = - 0.25, λ3 = - 0.2. The eigenvector associated with the first eigenvalue is found by substituting λ by λ1 = - 0.5 in Eq. 10.44b, to yield Fig. 10.19 A system of three interconnected radial rooms in which an abrupt contamination release has occurred. A quantity of outdoor air r is supplied to Room 1. The temporal variation in the concentration levels in the three rooms can be conveniently modeled following the compartmental modeling approach
room, and on to the third room from where it escapes to the sink (or the outdoors). Let x1(t), x2(t), x3(t) be the volumetric concentrations of the contaminant in the three rooms and let ki = r/Vi. The entire system is modeled by a set of three ordinary differential equations (ODEs) as follows: •
x 1 = - k 1 x1 • x 2 = k 1 x1 - k 2 x2
ð10:42Þ
•
0 ½A þ ð0:5ÞIv = 0:5 0
0
0
v1
0
0:25 0
v2
= 0
0:25 0:3
v3
0 ð10:46Þ
Solving it results in: v1 = [3 - 6 5]T where [. . .]T denotes the transpose of the vector. A similar approach is followed for the two other vectors. The general solution is:
xð t Þ = c1
3 - 6 e - 0:5t þ c2
=
•
x3
þ c3
0 0 e - 0:2t 1 ð10:47Þ
In matrix form, the above set of ODEs can be written as: •
e
- 0:25t
-5
5
x 3 = k 2 x2 - k 3 x3
x1 • x2
0 1
- k1
0
0
x1
k1 0
- k2 k2
0 - k3
x2 x3
ð10:43aÞ
•
or x = Ax
ð10:43bÞ
The eigenvalue method of solving equations of this type consists in finding values of a scalar, called the eigenvalue λ, which satisfies the equation ðA - λIÞx = 0
ð10:44aÞ
where I is the identity matrix. For the three-room example, the expanded form of Eq. 10.44a is: - k1 - λ
0
0
x1
k1 0
- k2 - λ k2
0 - k3 - λ
x2 x3
0 =
0 0 ð10:44bÞ
Next, the three constants are determined from the initial conditions. Expanding the above equation results in: x1 ðt Þ = 3c1 :e - 0:5t x2 ðt Þ = - 6c1 :e - 0:5t þ c2 :e - 0:25t x3 ðt Þ = 5c1 :e - 0:5t - 5c2 :e - 0:25t þ c3 :e - 0:2t
ð10:48Þ
Let the initial concentration levels in the three rooms be x1(0) = 900, x2(0) = 0, x3(0) = 0. Inserting these in Eq. 10.48 and solving them yields c1 = 30, c2 = 1800, c3 = 7500. Finally, the equations for the concentrations in the three rooms are given by the following tri-exponential solution (plotted in Fig. 10.20): x1 ðt Þ = 900e - 0:5t x2 ðt Þ = - 1800e - 0:5t þ 1800e - 0:25t x3 ðt Þ = 1500e
- 0:5t
- 9000e
- 0:25t
þ 7500e
ð10:49Þ - 0:2t
The application of an inverse modeling approach to this problem can take several forms depending on the intent in developing the model and data available. The basic premise
10.4
Gray-Box Models for Dynamic Systems
439
Fig. 10.20 Variation in the concentrations over time for the three interconnected rooms modeled as compartmental models (following Eq. 10.49)
is that such a building with three inter-connected rooms exists in reality from which actual concentration measurements can be taken. If an actual test similar to that assumed above were to be carried out, where should the sensors be placed (in all three rooms, or would placing sensors in Rooms 1 and 3 suffice), and what should be their sensitivities and response times? One must account for sensor inaccuracies or even drifts, and so would some manner of fusing or combining all three data streams result in more robust model identification? Can a set of models identified under one test condition be accurate enough to predict dynamic behavior in the three rooms under other contaminant releases? What should be the sampling frequency? While longer time intervals may be adequate for routine measurements, would not the estimation be better if high frequency samples were available immediately after a contaminant release was detected. Such practical issues are surveyed in the next section.
10.4.6 Practical Issues During Identification The complete identification problem consists of selecting an appropriate model, and then estimating the parameters of the matrices {A, B, C} in Eq. 10.40. The concepts of structural and numerical identifiability were introduced in Sects. 9.2.1 and 9.2.3, respectively. Structural identifiability, also called “deterministic identifiability,” is concerned with arriving at performance models of the system under perfect or noise free observations. If it is found that parameters of the assumed model structure cannot be uniquely identified, either the model must be reformulated, or else additional measurements made. The latter may involve observing more or different states and/or judiciously perturbing the system with additional inputs. There is a vast body of knowledge pertinent to different disciplines as to the design of optimal input
signals; for example, Sinha and Kuszta (1983) vis-a-vis control systems, and Godfrey (1983) for applications involving compartmental models in general (while Evans 1996 limits himself to their use for indoor air quality modeling). The problem of numerical identifiability is related to the quality of the input-output data gathered, i.e., due to noise in the data and due to ill-conditioning of the correlation coefficient matrix (see Sect. 9.2.3). Godfrey (1983) discusses several pitfalls of compartmental modeling, one of which is the fact that difference in the sum of squares between different possible models tends to reduce as the noise in the signal increases. He also addresses the effects of limited range of data collected (neglecting slow transient or early termination of sampling or rapid transients; delayed start of sampling which can miss the initial spikes and limits the ability to extrapolate models back to the zero-time intercept values); effect of poorly spaced samples, and the effect of short samples. The three-room example will be used to illustrate some of these concepts in a somewhat ad hoc manner which the reader can emulate on his own and enhance with different allied investigations. (a) Introducing Noise in the Data The dynamic performance of the three rooms is given by Eq. 10.49. It is advisable at the onset to evaluate whether the model parameters can be re-identified with simulated data assuming no noise. This would serve as a reality check before undertaking more sophisticated analysis. The final determination is whether the “correct/assumed” values of the model parameters can be re-identified when some noise is introduced in the data. This is a better representation of reality. One could evaluate this aspect with different magnitudes of noise under different types of distributions. Here, a sequence of normally distributed
440
10
random noise with zero bias and standard deviation of 20, i.e., ε(0, 20) has been generated to corrupt the simulated data sample. This is quite a small instrument noise considering that the maximum concentration to be read is 900 ppm. This data has been used to identify the parameters assuming that the system models are known to be the same tri-exponential equations. One would obtain slightly different values depending on the magnitude and type of random noise introduced, and several runs would yield a more realistic evaluation of the uncertainty in the parameters estimated (this is the Monte Carlo simulation as applied to regression analysis). (b) Effect of Sampling Frequency Parameter estimation has been done with two different sampling frequencies over the 30 min period: (i) at one-minute intervals, and (ii) at 0.1 min frequency for the first 10 min and every minute thereafter. These results are assembled in Table 10.6. How well the models fit the data with almost no patterned residual behavior can be noted from Figs. 10.21 and 10.22. Note that though the model R2 is excellent, and the dynamic prediction of the models captures the “actual” behavior quite well, the parameters are somewhat different from the correct values, with the differences being roomspecific. The concentration in Room 1 at time t = 0 found from one-minute interval sampling strategy (i) is a 787.0 which is quite different than the correct value of 900 ppm. However, with 0.1 min sampling frequency strategy (ii), the parameter is estimated almost exactly. This is intuitive; one would expect the signal to be strong in Room 1 during the early
Inverse Methods for Mechanistic Models
time intervals after the release which is better captured by the higher-frequency sampling strategy. The same can be said about the parameter b for Room 1. In fact, sampling strategy (ii) is better (except for parameter “e” of Room 3) for the other two rooms as well. However, the precision of the parameter estimation process generally degrades as one goes to Room 2, but the estimates can be taken to be acceptable. The coefficients for Room 3 are very poorly reproduced probably because six parameters are being identified compared to two for Room 1 and the signal strength is lower than the concentrations in the other two rooms. A general issue with non-linear regression problems is the need to supply good (i.e., reasonably close) starting values. Sounder conclusions would require repeating these types of evaluations with different sampling frequencies and noise. In general, for proper model parameter estimation, it is safer to sample the data as frequently as possible with the frequency tied to the time constant of the system. It is advisable to collect 5–10 samples during this period after the system undergoes a step response (Ljung and Glad 1994). On the other hand, the sampling frequency should be more than 3–4 times the time constant of the measuring instrument. If the data frequency is found to be needlessly large, one could retain readings at equal intervals, say every 5th value of the original data set, to perform subsequent analysis. (c) Effect of Fitting Lower-Order Models How would selecting a lower order model affect the results? This aspect relates to system identification and not to parameter estimation. The simple case of regressing a bi-exponential model to the “observed” sample of Room
Table 10.6 Results of parameter estimation for two cases of the three-room problem with the simulation data corrupted by normally distributed random noise e(0, 20)
Room 1
Room 2
Room 3
Model parameters (Eq. 10.49) a b Adj. R2 a b c d Adj. R2 a b c d e f Adj. R2
Correct values (Eq. 10.49) 900 -0.500 -1800 -0.50 1800 -0.25 1500 -0.50 -9000 -0.25 7500 -0.20
(i) With sampling at one-minute intervals 787.0 -0.439 96.6% -1221.1 -0.707 1118.8 -0.207 98.0% 3961.9 -0.407 -11215.8 -0.280 7239.0 -0.208 97.1%
(ii) With high frequency sampling at 0.1 min intervals for first 10 min 900.1 -0.500 99.2% -1922.2 -0.483 1931.3 -0.254 98.4% 2706.1 -0.436 -8134.7 -0.276 5433.1 -0.196 96.9%
10.5
Bayesian Regression and Parameter Estimation: Case Study
441
exponential models which the compartmental approach yields would require non-linear estimation which along with the dynamic response transients and sampling errors make robust system identification quite difficult, if not impossible.
10.5
Fig. 10.21 Time series plots of measurements sampled at one-minute with random noise and identified models using Eq. 10.49 (case (i) of Table 10.6). (a) First room. (b) Second room. (c) Third room
3 is illustrated in Fig. 10.23. In this case, there is a distinct pattern in the residual behavior which is unmistakable. In actual situations, this is a much more difficult issue. Godfrey (1983) suggests that, in most practical instances, one should not use more than three or four compartments. The sum of
Bayesian Regression and Parameter Estimation: Case Study
Bayesian and non-linear least-squares methods of calibration were evaluated and compared for gray-box modeling of a retail building (Pavlak et al. 2014). Gray-box model calibration was examined with perturbations to the simple yet popular European Committee for Standardization (CEN)-ISO thermal network model consisting of four resistors and one capacitor (4R1C). The primary objective was to understand whether the computational expense of probabilistic Bayesian techniques (see Sects. 2.5 and 4.6) is justified to provide robustness to signal noise. The Bayesian approach allows parameter interactions and tradeoffs to be revealed, one form of sensitivity analysis, but its full power for uncertainty quantification cannot be harnessed with gray-box or other simplified models. Synthetic (some use the term “surrogate”) data from a detailed building energy simulation program were used to ensure command over latent variables, whereas a range of signal-to-noise and noise colors were considered in the experimental study. The fidelity to the building zone temperature and thermal load was the basis for comparing results. Bayesian calibration outperformed traditional methods on noisy data sets; however, traditional methods were adequate up to an approximately 25% noise level. The thermal gray-box model calibration has the intended application of model predictive control, where speed, accuracy, and robustness are crucial. EnergyPlus building energy simulation software (Crawley et al. 2001) was used in this study to generate synthetic data from a five-zone retail building consisting of a single floor area of 2300 m2 and a peak occupancy, lighting, and appliance power equal to 7.11 m2/person, 32.3 W/m2, and 5.23 W/m2, respectively. The retail building model was chosen for its relative simplicity, allowing straightforward abstraction to a single-zone approximation reduced-order model, as shown in Fig. 10.24. The packaged direct expansion (DX) systems typically associated with stand-alone retail buildings may be good candidates for embedded modelbased control algorithms as well. Synthetic data were preferred here over real measurements so that latent variables could be controlled in the experimental study and calibration methods could be directly compared and contrasted. Only hourly sensible zone loads (estimated by the calibration techniques) and corresponding temperatures were utilized as synthetic data for the gray-box modeling.
442
10
Inverse Methods for Mechanistic Models
Fig. 10.22 Plots to illustrate how model parameter identification is improved if the sampling rate is increased to 0.1 min during the first 10 min when dynamic transients are pronounced (case (ii) of Table 10.6). (a) First room. (b) Second room. (c) Third room
10.5
Bayesian Regression and Parameter Estimation: Case Study
Fig. 10.23 Plots to illustrate patterned residual behavior when a lower order exponential model is fit to observed data. In this case, a two exponential model was fit to the data generated from the third
443
compartment with random noise added. The adjusted R2 was 0.90 compared to 0.97 for the full model. (a) Time series plot. (b) Observed versus predicted plot
Fig. 10.24 Retail building simulation models: a detailed, EnergyPlus model (used to produce synthetic data) and the 4R1C gray-box model. (From Pavlak et al. 2014)
As discussed earlier, gray-box models are based on the approximation of heat transfer mechanisms by an analogous electrical lumped RC network. A simplified model containing only the dynamics of interest was sought. The only requirement of the parameter estimations, i.e., the Bayesian and non-linear least-squares approaches considered here, was that values are bounded in accordance with physical feasibility. A five-parameter model based on the RC network used in the CEN-ISO 13790 simple hourly method (ISO 2008)5 of Fig. 10.24 was used to forecast hourly winterheating loads for the retail building. The five-parameter RC network heat transfer and storage characteristics of the opaque building envelope materials are represented by R1, R2, and C. These elements link the ambient temperature node (Ta) to a pseudo-interior surface temperature node (Ts) while potential heat storage of the mass materials is accounted by temperature (Tm). Glazing heat transfer is represented by a single resistance (Rw) connecting the ambient temperature node to the surface temperature node because the thermal ISO. (2008). “Energy performance of buildings—Calculation of energy use for space heating and cooling.” ISO 13790, Geneva.
5
storage of glazing is typically neglected. The variable R3 represents a combined convection and radiation coefficient between the surface temperature node and zone air temperature node (Tz). The convective portions of internal gains (lighting, occupants, and equipment) are applied as a direct heat source to the zone temperature node, shown as Qgc, and the radiant fraction along with glazing transmitted solar gains (Qg,r1sol,w) is applied to the surface node. An energy balance can be performed on the mass temperature node Tm results in: C
dT m T a - T m T s - T m þ = dt R1 R2
ð10:50Þ
Since no storage occurs at the surface node, flows entering and leaving the node sum to zero: Ta - Ts Tz - Ts Tm - Ts _ þ þ þ Qg,rþsol,w = 0 Rw R3 R2
ð10:51Þ
The heat gain to the space is then represented by the total heat flow to the zone air node
444
10
T - Tz Tm - Ts _ þ þ Qg,c Q_ sh = s R3 R2
ð10:52Þ
These last three equations can be combined to form a firstorder differential equation for zone temperature. This allows one to perform load calculations for the zone that includes the effects of dual temperature setpoints with deadbands and system capacity limitations. The probabilistic perspective not only provides insight into the relationship between sets of model parameters, revealing tradeoffs and compensating interactions, but also lends itself to continuous model uncertainty quantification and tuning, where the posterior distribution of an initial parameter estimation can be used as the prior for a subsequent parameter estimation update once new building performance data have been collected. It benefits over traditional methods because prior knowledge of the system can be directly incorporated into the estimation task, and methods for addressing sensor noise are inherent to the Bayesian approach. The inference can essentially be thought of as fitting a joint probability distribution to a measured data set. Specifically, conditional probabilities are related through the product rule to derive Bayes’ theorem and allow consideration of before data and after data. From a parameter estimation perspective, the probability of parameter set Θ given measured data D and a knowledge base of the system K can be written as posterior probability p(θ|DK). Bayes’ Theorem then allows the conditional probability p(θ| DK) to be computed from p(θ| K ), p(D|θK), and p(D| K ): pðDjθK Þ pðθjDK Þ = pðθjK Þ pðDjK Þ
ð10:53Þ
where p(θ|K ) represents prior knowledge of parameter values; p(D|θK) represents the likelihood of observing the measured data set D, given a particular parameter set θ and knowledge of the system K; and p(D|K ) is the probability of observing the data set. The relation can be written in an alternate form, where the numerator remains the product of likelihood and prior and the denominator is a normalization factor so that posterior probabilities sum to unity: pðθjDÞ =
pðθÞpðDjθÞ i pðθ i ÞpðDjθ i Þ
ð10:54Þ
Assuming random Gaussian noise about a measured datum Di, the likelihood of an observation can be determined from its location within the normal distribution with standard deviation σ ε centered at μ equal to the measured datum pðDi jΘÞ =
1 p
σ ε 2π
exp
- ðDi - M i Þ2 2σ 2ε
ð10:55Þ
where Mi is the model output given the parameter set θ.
Inverse Methods for Mechanistic Models
Assuming independent errors, the likelihood of the entire data set is simply the product of likelihoods of all individual points; the assumption is valid for common HVAC sensors (e.g., temperature probes), but correlated errors could be handled with a slightly different formulation, which is indicative of a fault model. Measurement errors are often correlated due to hysteresis, linearity, sensitivity, zero shift, and repeatability errors. If such correlated errors are of concern, then a Bayesian (or other probabilistic) method that can accommodate correlated measurements could be used. The use of a least-squares approach is more computationally efficient; however, it is less robust with respect to many of the error sources found in the sensor networks. Here, uncorrelated temperature and sensible load measurements is assumed; thus, it ignores autocorrelation of errors, which is estimated to be small. From this model assumption, the easily computable likelihood function, which happens to be equivalent to the least-squares equation, is derived as: pðDi jΘÞ =
1 p σ ε 2π
n
exp
-1 2σ 2ε
n
ðD i - M i Þ2
ð10:56Þ
i=1
Evaluating this equation directly can pose numerical issues that can be alleviated by computing the logarithm of the posterior rather than the posterior directly. Taking the natural logarithm results in p -1 ln½pðθjDÞ / ln½pðθÞ - n ln σ ε 2π - 2 2σ ε
n
ðD i - M i Þ2 i=1
ð10:57Þ After computing the right-hand side, exponentials may be taken, and the normalization factor can be computed to determine posterior probabilities. To express the belief that the best parameters lie within a particular range, triangular priors were placed on each resistance and capacitance parameter, with the same upper and lower bounds used in the leastsquares approach. For the traditional calibration, model parameters were identified using least-squares algorithms that minimize the root mean square error (RMSE) between the reduced-order model predictions and synthetic time series data. A two-stage optimization that performs a direct search over the parameter space and executes a non-linear least-squares algorithm starting from the best location identified through the direct search was implemented. The direct search is performed over uniformly random points located within a bounded parameter space. A trust-region Newton method was used to solve the constrained non-linear least-squares problem. The parameter space was constrained to consider only physically plausible values, e.g., no negative resistances, while allowing room for estimates to incorporate geometry and construction uncertainty as well, e.g., lower resistances could be chosen to
10.5
Bayesian Regression and Parameter Estimation: Case Study
compensate for an underestimation of actual external surface area. To evaluate the reliability of this method, the leastsquares estimation was repeated 2500 times, with each iteration starting from a new set of 500 randomly generated direct search points. The least-squares and Bayesian calibration procedures were repeated, considering the noise of various colors and amplitudes to better evaluate the ability of each method to handle measurement error and data uncertainty. Noise was added to the three-week synthetic load data from the detailed building energy model to simulate sensor error and data uncertainty. The RC parameters were estimated from the noisy data sets, and performance was compared with respect to the noiseless case. For the least-squares case, a data length of one day was also considered to test the impact of significantly reducing the available data. Previous work with inverse gray-box RC models has shown good results using 2–3 weeks of training data for traditional methods. For the Bayesian approach, triangular priors were considered to evaluate the impact of informative priors. In reality, expert knowledge can often provide more than simple upper and lower bounds. Because Bayesian methods provide a direct means of incorporating such information, considering this feature in the analysis was desired. Because the value of σ ε is typically not known, a natural extension is to estimate the measurement error standard deviation along with the five model parameters. The Bayesian inference was repeated with σ ε treated as a free (sixth) parameter to estimate the most likely value. A uniform prior from 0.1 to 250 kW (approximately 0.03–80% maximum heating load) was placed on σ ε, which was sampled along with the R and C model parameters. The posterior maximum for all parameters occurred with a σ ε value of 11 kW (approximately 4% maximum heating load). Estimates of those RC parameters described above are summarized in Table 10.7, and the largest discrepancy between the two methods was observed in the external envelope resistance (R1) and window resistance (Rw) estimates. This difference may suggest that the model is relatively insensitive to the parameters R1 and Rw, and consequently
445
insensitive to ambient temperatures. This interpretation seems plausible because heat transfer in commercial buildings is often dominated by internal gains, and this particular building has notably high lighting loads. The results of the two calibration methods were compared directly by plotting least-squares point estimates on posterior distribution contour slices to contrast and discuss the difference in information available from each method (Fig. 10.25). The largest contour is the line of smallest posterior probability and the darker shades refer to higher values of posterior probability (the innermost circle is the peak). Within the figure presented is a smaller overview plot that shows the NLSQ estimates as circles found from different starting values as well as the domain of the posterior distribution confined by a dashed line. The posterior slice is then enlarged to occupy most of the figure. Relative to the variation of the NLSQ estimates, the domain of posterior probability was found to be relatively small. In general, the most probable parameter set from the six-parameter estimation agreed with the least-squares solutions. Figure 10.25 shows a posterior slice for the parameter pair R1-R2 and reveals that several least-squares runs agreed with the Bayesian inference for R1, although the median value is slightly lower. The R2 values were in close agreement as well, although slightly higher for the Bayesian results. Because R1 and R2 represent the envelope material resistances, choosing higher values for both parameters is not directly explainable from Fig. 10.25. However, the Bayesian inference chose a lower glazing resistance, which results in a similar overall envelope performance. In summary, the study revealed least-squares and Bayesian calibration methods performed similarly, regardless of noise, when using uniform priors. Bayesian methods did show the potential to outperform on noisy (i.e., >25% noise level) data sets when utilizing informative (such as triangular) priors. Bayesian calibration was also able to provide further insight into potential parameter interactions and tradeoffs as well as parameter uncertainty. However, the additional information comes at expense of added computational cost. The Bayesian calibration performed in this work required approximately 100 times more CPU time than the
Table 10.7 Model parameter estimates for the retail building following the two approaches Parameter R1 (m2KW-1) R2 (m2KW-1) R3 (m2KW-1) Rw (m2KW-1) C (kJ m-2K-1) σ ? (KW)
NLSQ Median 4.23200 0.10648 0.12274 2.99975 198.634 –
From Pavlak et al. (2014) Note: NLSQ nonlinear least squares
95% confidence [4.23179, 4.23228] [0.10647, 0.10649] [0.12272, 0.12276] [2.99967, 2.99983] [198.601, 198.664] –
Bayes: σ ? estimated pmax 4.306 0.111 0.113 2.078 200.1 11.0
95% credible interval [4.041, 4.585] [0.096, 0.120] [0.099, 0.142] [1.728, 2.837] [184.3, 230.3] [10.26, 11.30]
446
10
Inverse Methods for Mechanistic Models
Fig. 10.25 Posterior contour plot for parameters R2 and R1 with least-squares solutions to compare Bayesian versus least squares parameter estimation with σ ε = 11 kW. LS refers to solutions found using non - linear least squares. (From Pavlak et al. 2014)
traditional calibration. Additionally, storage and memory requirements are higher as well because all information is stored in the Bayesian analysis. Given the intended application of embedded model-based control algorithms and the expected noise levels, traditional calibration methods appear sufficient for the desired thermal model identification. Future work should consider Bayesian methods for more complex buildings with more parameters and greater uncertainty, where the CEN-ISO model used in this work may be inadequate. The Bayesian analysis can also be extended to include uncertainty in the model structure, which may prove beneficial when considering more complex scenarios and applications.
10.6
10.4). Thus, the logical question to ask is: what is the purpose or added benefit to warrant this extra labor? That such an approach would potentially allow more reliable and accurate predictions of future system performance as it exists compared to black-box or gray-box models is questionable, and not the main intent. The specific benefit (s) would depend on the system model and its intended application; but ideally, “the purpose of calibrating a simulation model is to evaluate the effect on system response or functioning under proposed elaborate or/and complex changes to the components and to the existing system operation which can be directly simulated by modifying the model inputs.” This aspect will be expanded later in this sub-section with a specific application in mind, namely, detailed building energy use models.
Calibration of Detailed Simulation Programs 10.6.2 The Basic Issue
10.6.1 Purpose As an introduction, a simple definition of calibration is: “a process of reconciling the simulated system response with the measured one.” Such a reconciliation involves much greater effort (commensurate with the complexity of the system) in addition to extensive domain knowledge related to the component models behind the simulation program and their interconnections. This approach is much more demanding compared to the effort required for black-box modeling (either curve-fitting or neural network based- see Chaps. 5 and 9) or even using gray-box models (as in Sects. 10.2 and
Detailed simulation models involve coupled complex set of models with a large number of parameters and where a precise relationship between outputs and inputs cannot be expressed analytically because of the complex nature of the components and their coupling.6 Solving such sets of equations require computer programs with relatively sophisticated solution routines; and this is what is done under design studies. However, under the inverse situation, such 6
For a brief exposure and discussion of this issue, refer to text related to Fig.10.1.
10.6
Calibration of Detailed Simulation Programs
447
Fig. 10.26 Elements of a building energy simulation program. (Modified from Ayers and Stamper 1995)
high-fidelity tools suffer from bias and randomness/ stochasticity in terms of mismatch between simulation results and actual measured performance. The inaccurate representation could be due to various reasons unrelated to the accuracy of the set of models themselves. Common causes are differences due to design-assumed and actual values of some of the primary and secondary variables, or/and to improper specification of the system functioning and operation. This aspect is further expanded in Sect. 10.6.4 in terms of a specific application. Calibration and validation of such models has been addressed in several books and journal papers in diverse areas of engineering and science such as environmental, structural, hydrology, epidemiology, and structural engineering. The crux of the problem is that the highly over-parameterized situation leads to a major difficulty aptly stated by Hornberger and Spear (1981) as: “. . . most simulation models will be complex, with many parameters, state variables and non-linear relations. Under the best circumstances, such models have many degrees of freedom and, with judicious fiddling, can be made to produce virtually any desired behavior, often with both plausible structure and parameter values.” This process is also referred to in the scientific community as GIGOing (garbage in–garbage out) where a false sense of confidence can result since precise outputs are obtained by arbitrarily restricting the input space (Saltelli 2002). Thus, the basic issue with calibrating detailed simulation programs is that of over-parametrization: there are far too many parameters used to specify the various elements/
components and their operation/control schemes compared to the performance data available/collected. Such a detail in specification is needed for accurate modeling during design, but the same level of specificity becomes a major albatross around the neck of the inverse modeler (a metaphor for burden one cannot escape from)!7
10.6.3 Detailed Simulation Models for Energy Use in Buildings Selecting a specific application to illustrate the different calibration approaches would be more insightful than providing too generalized a view. Towards that end, the aspect of calibrating detailed building energy use simulation programs has been selected. It is left to the reader to make suitable extensions/adaptations to their domain of study. Numerous reference handbooks and textbooks are available which deal with the principles of modeling and simulating energy flows in buildings to select and size equipment and calculate the annual energy consumption (e.g., ASHRAE, Fundamentals 2021; Reddy et al. 2016). Broadly, the various elements of a building energy simulation program such as the widely popular DOE-2 (Birdsall et al. 1990; DOE-2 1993) and EnergyPlus (2009) are depicted in Fig. 10.26.
With apologies to Samuel Taylor Coleridge’s “The Rime of the Ancient Mariner”
7
448
The description box includes building physical inputs describing/specifying location, physical envelope (walls, windows, etc.), the zoning of the building, pertinent information related to air infiltration, internal equipment/lights/ occupants as well as the diurnal and weekly operation schedules and thermostats control strategy. These and the hourly weather data allow one to calculate the building heating and cooling loads which refer to the hourly thermal heat rates that must be supplied to or removed from the interior spaces of a building to maintain the desired comfort conditions (Fig. 10.27). The thermal mass of a building is sufficiently significant to delay the heat gains or losses, and so dynamic models are warranted. Even though finite difference methods are used in certain simulation programs, the professional community has chosen to adopt the transfer function time series method involving present and lagged terms as described in Sect. 8.6.2. This approach, however, requires the use of different sets of weighting factors to model the effects of different heat gains components, namely: (a) Thermal mass of the building envelope subject to changes in outdoor ambient air conditions and in solar radiation over the day (b) Directly transmitted solar radiation which is first absorbed in the internal mass of the various elements within a room/space (such as slabs, internal walls, and furniture) and released later in time Fig. 10.27 The sources of heat flow in a load calculation. (Reddy et al. 2016)
Fig. 10.28 Schematic illustrating relationships between heat gain, cooling load and heat extraction rate. (From Reddy et al. 2016)
10
Inverse Methods for Mechanistic Models
(c) Internal mass interacting with operational schedules causing diurnal fluctuations in internal load (lights, equipment, and occupants) (d) Internal mass interaction with control strategies of the indoor environment (diurnal thermostat set-up or set-back or variations in supply air volume) The above flows are referred to as the heat gain which is the rate at which heat is transferred by radiation, conduction/ convection through the windows, walls, or roofs. The heat flows through opaque construction elements are then split into their radiative and convective parts, the latter appear as loads immediately (Fig. 10.28). The individual radiative heat flows are not immediately transferred to the indoor air space/ node depending on the type of room construction and internal masses such as furniture. A second set of weighting factors must be specified to account for dynamic delayed release of the radiative component to the room air by means of convection. The cooling load is the rate at which the cooling equipment would have to remove the sum of convective thermal energy from the indoor air to maintain constant indoor temperature and humidity. Finally, the heat extraction rate is the rate at which the cooling equipment actually removes thermal energy from the space when the indoor air temperature is varied during the day and over a week at a preset thermostat schedule (as is common).
10.6
Calibration of Detailed Simulation Programs
The second element (Fig. 10.28) is to use this heat extraction rate information to determine system loads which consist of secondary systems, i.e., energy distribution systems such as piping/pumps for water, and ducts/fans for air, heating/ cooling coils, and supervisory control of equipment and sub-systems. Note that there is a huge selection of generic secondary system types with different combinations of components (refer to any appropriate textbook, for example, Reddy et al. 2016). Usually, the heat extraction rate is equal to the secondary system loads excluding the electric power needed to run the pumps or fans. Once determined, these secondary loads serve as the basis of calculating the yearlong hour-by-hour (or sub-hourly) energy consumed by the primary systems such as chillers and boilers. The simulation program must separately consider energy flows and energy use of the conditioning equipment for each zone of the building. Often the secondary thermal loads materialize at the cooling/heating coil level for several or all the zones taken together, and this further complicates the calibration. The energy used (electricity and natural gas) by the primary systems is invariably monitored and serves as the main data channel for calibration. More recently, indoor air temperature values are also monitored at each zone, and this data is also used to improve the calibration. It is important to point out that the secondary and primary equipment energy use, with the exception of large chilled water distribution and thermal energy storage systems, does not usually require dynamic models, and so steady-state models are adequate at hourly simulation time steps.
449
10.6.4 Uses of Calibrated Simulation
In contrast, large commercial buildings have much higher utility costs, and the HVAC&R systems are not only more complex but more numerous as well. Hence, the retrofits were not only more extensive, but the large cost associated with them justified a relatively large budget for M&V as well. The black-box and gray-box statistical analysis tools used for DSM projects were often found to be too imprecise and inadequate, which led to subsequent interest in the development of more specialized inverse modeling and analysis methods which use hourly and disaggregated end-use monitored energy data from the building along with other variables such as climatic variables and operating schedules. Most of the topics presented in this book also have direct relevance to such specialized modeling and analysis methods. A well calibrated simulation model potentially offers numerous advantages compared to regression models or gray-box statistical approaches. Initial attempts, dating back to the early 1980s, involved using utility bills with which to perform the calibration. A large number of energy professionals are involved in performing calibrated simulations, and numerous more profess an active interest in this area. At the onset, it is important to state that the whole area of calibrating detailed building energy simulation programs still lacks maturity in many respects (some of which will be stated below). Even though this area requires further research before it can be used routinely by the professional community, it is useful for those involved in applied data analysis and modeling to be familiar with the general issue and with some of the work done in this area (for reviews, see Reddy 2006 and Coakley et al. 2014). Calibrated simulation can be used for the following purposes:
Utility deregulation (initiated from the 1990s) and the more recent electric market transformation efforts to better integrate solar photovoltaic and other renewable energy systems with the existing electric power grid have led to new thinking towards proactive load management of single and multiple buildings. The proper implementation of demand-side management (DSM) measures involved first the identification of the appropriate energy conservation measures (ECMs), and then assessing their impact or performance once implemented. This need resulted in monitoring and verification (M&V) activities to acquire key importance. Typically, retrofits involved rather simple energy conservation measures in numerous similar buildings. The economics of these retrofits dictated that associated M&V effort also be low-cost. This led to utility bill analysis (involving no extra metering cost), and even analyzing only a representative sub-set of the entire number of retrofitted residences or small commercial buildings. Statistical analysis methods were usually adequate for this task.
(a) End-use profiles from aggregated data: To provide an electric utility with a breakdown of baseline, cooling, and heating energy use for one or several buildings based on their utility bills to predict impact of different load control measures on the aggregated electrical load (Mayer et al. 2003). (b) Cost effective efficiency improvements: To support investment-grade recommendations made by an energy auditor tasked to identify cost effective ECMs (equipment change, schedule change, control settings, ...) specific to the individual building and determine their payback. (c) Monitoring and verification (M&V): Some of the tasks are (i) to identify a proper contractual baseline energy use against which to measure energy savings due to ECM implementation; (ii) to allow making corrections to the contractual baseline under unanticipated but all-too-common changes (creep in plug load, changes in operating hours, changes in occupancy or conditioned
450
area, addition of new equipment. . .); (iii) when the M&V requires that the effect of a end-use retrofit be verified using only whole building monitored data; (iv) when retrofits are complex and interactive (ex., lighting and chiller retrofits) and the effect of individual retrofits need to be isolated without having to monitor each sub-system individually; (v) either pre-retrofit or post-retrofit data may be inadequate or not available at all (e.g., for a new building or if the monitoring equipment is installed after the ECM has been implemented), and (vi) when length of post-retrofit monitoring for verification of savings needs to be reduced (ASHRAE 14 2014). (d) Better control and operation: To provide facility/building management services to owners and ESCOs the capability of implementing: (i) continuous commissioning or fault detection (FD) measures to identify equipment malfunction and take appropriate action (such as tuning/optimizing HVAC and primary equipment controls—Claridge and Liu 2001), (ii) optimal supervisory control, equipment scheduling and operation of building and its systems, either under normal operation or under active load control in response to real-time price signals. (e) To meet compliance or certification requirements: The drive for green and energy efficient buildings championed by agencies such as the Green Building Council has resulted in the need for architects to demonstrate by actual monitoring that their specially designed building is, say 50%, more energy efficient (EE) than the baseline, i.e., one built according to the prevailing energy code. Since such a counterfactual baseline is not built, the only recourse is to develop a calibrated simulation model for the energy efficient building and then make suitable changes to the input parameters of the simulation model to reflect how it would have been for the baseline building. The inferred energy savings would be the basis of awarding a rating certificate, say gold or silver, to the building. In addition, an increasing number of cities and even the federal government are mandating by legislation that new buildings demonstrate such an energy use improvement prior to being granted an occupancy certificate.
10.6.5 Causes of Differences There are several reasons why there are large differences between simulated and measured building energy performance (even with no uncertainty in the measurements themselves). The differences (which can be interpreted as uncertainties in the simulated results) can arise from (de Witt 2003):
10
Inverse Methods for Mechanistic Models
(a) Specification uncertainty which is due to differences between the way the building (physical construction as well as its materials and systems) is simulated and the way it is actually built. For example, the wall insulation could have been improperly sprayed causing thermal bridging and so the actual wall resistance could be lower than the documented design value. (b) Modeling uncertainty is due to the simplifying assumptions in the models of the various energy flows in the building. To begin, there is no uncertainty about the principles; the relevant laws of physics are known beyond a shadow of a doubt. How to apply them is another matter. Some of the processes are extremely complicated and difficult to model/calculate; one needs to know which approximations are acceptable in practice. Prime among them is the number of spaces and thermal zones assumed to represent the overall large office building. While design of such buildings require the specification of dozens or more number of zones, the calibration process can only support far fewer number of zones and so simplifications have to be made. (c) Numerical uncertainty arises because of the manner of coding and solving the various equations (discretization, numerical convergence, . . .). A methodology to identify and diagnose differences in simulation predictions that may be caused by algorithmic differences, modeling limitations, coding errors, or input errors has been developed and made into a standard (ANSI/ASHRAE Standard 140 2011). (d) Scenario uncertainty which is due to improper specification of the forcing functions. For example, the weather conditions used for the building simulation may be from a location several miles away from the building. Another cause could be that the equipment scheduling of the building and the number of occupants is vaguely known and so were improperly specified as inputs. Uncertainties associated with human occupancy and human driven internal gains (lights and equipment) greatly contribute to scenario uncertainty and are difficult to combat. The conventional view is that uncertainty sources (b) and (c) are minor, and so most of the focus of calibration studies in on the other two types of uncertainties. Further, due diligence by the data analyst prior to undertaking a calibration study would involve carefully looking at the as-built drawings and/or making a detailed walk-through site audit for gathering the necessary actual information on the building envelope, the secondary and primary equipment and the building operation and scheduling. This diligence would serve to reduce errors arising from sources (a) and (d). Unfortunately, time and cost constraints of a consulting company
10.6
Calibration of Detailed Simulation Programs
often preclude such an effort; this is a major cause for the differences between simulation and measurement.
10.6.6 Definition of Terms The word “calibration” is used rather loosely, and it is appropriate to distinguish between three allied terms (Subbarao et al. 2022): (a) Tuning is a process whereby a subset of the very large number of raw inputs/parameters to the simulation program are varied in some manner to improve the difference between simulation results and measured data. Such a brute force manual approach to a vastly underdetermined or over-parametrized problem does not offer meaningful interpretation of the parameters of the reconciled model. Variants based this approach should be referred to as “raw input tuning” (RIT) and, unfortunately, are the ones most widely used currently (described in Sect. 10.6.7). While several professionals still adopt a manual tuning approach albeit guided by domain-knowledge based heuristics, automated tuning methods have been proposed (for example, Chaudhary et al., 2016) and are gaining popularity. (b) Parameter estimation is a more physically aligned approach and is somewhat akin to the simplified graybox parameter estimation approach (described in Sects. 10.2 and 10.4) but using the simulation model and not a simplified model. Two general approaches have been proposed: (i) A more scientifically grounded variant of the general tuning process is the semi-analytical method (SAM) described in Sect. 10.6.8. (ii) Another approach seeks to develop corrective equations or factors to physically relevant parameters or sets of macro-parameters which are directly related to the types of applications to which calibrated simulation is to be used for subsequently (listed in Sect. 10.6.4). The more realistic/accurate the model inputs assumed for the simulation, the smaller the magnitude of these corrective functions. This approach should be strictly referred to “physical parameter estimation” (PPE) and one such attempt is described in Sect 10.6.9. (c) Calibration, in the context of metrology, essentially entails modifying the raw output of a measuring device with a previously determined corrective correlation (deducing by comparison against a standard reference instrument) in order to obtain an improved or more realistic measurement value. In the context of simulation programs, calibration should imply that the “outputs” of a simulation model are being corrected in some manner (either purely empirically or
451
based on say a black-box model such as a neural network model or on a physical basis as done in the PPE method), and that it “ought not to be the inputs” which are tuned as done under the RIT approach. A physical interpretation of the corrections is not to be expected. However, this is not how the building energy industry uses this word. Provided one is aware of this distinction, the pervasive use of the term “calibration” should be acceptable as a catch-all term for methods meant to reconcile simulations with measured data.
10.6.7 Raw Input Tuning (RIT) While being the crudest, this is the more widely adopted method. The analyst will typically “adjust” simulation inputs and operating parameters on a trial-and-error basis until the program output matches the measured/known data. This “fudging” process often results in the manipulation of many variables which may significantly decrease the credibility of the entire simulation. The problem is further compounded by the fact that it is a dynamic matching over one year, and not a static one at one condition or time frame (like in classical optimization problems). Hence, the analyst relies on his judgment and the circumstances specific to his problem (e.g., some of the model inputs may be better specified than others) to first reduce the order of the model by selecting “best-guess” values of certain parameters, and then to calibrate the remaining parameters. The process is largely dependent upon user knowledge, past experience, statistical expertise, engineering judgment, and contains an abundance of trial and error until an acceptable calibrated model is reached. It is thus highly dependent on the personal judgment of the analyst doing the calibration. Which parameters to freeze, how many parameters can one hope to identify, what will be the uncertainty of such calibrated models are important aspects which are not properly considered. One can distinguish between two broad sets of methods under the RIT approach: • Manual, Iterative, and Pragmatic Methods: Till the late 1990s or so, utility bills (whole building electricity and gas use at monthly time scales) were the only data available for tuning. If basic input data on the location, building geometry and its envelope, secondary and primary systems as well as information on the building scheduling and controls is available, a simulation run is created (referred to as the “audit building”). Differences between simulated and measured values provided a pathway to iteratively tune the inputs. Walkthrough audits, spot or/and short-term monitoring of certain key end-uses were often used to improve the calibration. In case of hourly data, the analyst is easily overwhelmed with the quantity of data and so appropriate
452
graphical plots can help in the input selection. These include carpet plots, 3-D time series plots of energy use and residuals, superposed and juxtaposed binned box, whisker and mean (BWM) plots in addition to the standard 2-D plots such as scatter plots and time series plots. The interested reader can refer to Reddy (2006) for more complete description. It is the belief that the analyst learns the necessary “skills of such a trade” gradually over time and relying on appropriate heuristic knowledge along with some key information about the building operation gets more adept at overcoming the inherent degree of indeterminacy; unfortunately, such skills are not easily taught. • Monte Carlo Methods: Section 6.7 dealt with design of computer simulation experiments with a focus on the use of detailed hourly time step computer simulation programs for the design of energy efficient buildings which predict the hourly energy consumption and the indoor comfort conditions for a whole year for any building geometry and climatic conditions. It presented the various methods of sampling the large set of input design variables with pre-specified range of variability (the Latin Hypercube Monte Carlo is often the preferred method of choice) along with how to perform regional sensitivity analysis to identify important/influential input variables (similar to screening) and also to narrow down their range of variability based on a chi-square test for non-randomness. The same approach can be used for calibration; but here the screening for influential inputs would be based on the goodness-of-fit criteria (such as RMSE or CV and MBE, see Sect. 5.3.2) between simulated and measured results. This approach is best described by a simplified version8 of a research study fully documented in Reddy et al. (2007). The scope of the study was limited to the case when monthly utility bills only were available (which was the common scenario at the time the study was done); this approach was subsequently extended to the cases when calibration is done with hourly energy use data. The approach was evaluated with both synthetic data as well as real measured data; only the results of the former are presented here. The advantage of methodology evaluation with synthetic data is that identified parameters can be directly compared and contrasted. The Monte Carlo calibration methodology can be summarized by the following phases (see Fig. 10.29): (i) Gather relevant building, equipment, and sub-system information along with performance data in the form of either utility bills and/or hourly monitored data.
10
Inverse Methods for Mechanistic Models
(ii) Identify a building energy program which has the ability to simulate the types of building elements and systems present, and set up the simulation input file to be as realistic as possible. (iii) Reduce the dimensionality of the parameter space by resorting to walk-thru audits and heuristics. For a given building type, identify/define a set of influential parameters and building operating schedules along with their best-guess estimates (or preferred values) and their range of variation characterized by either the minimum or the maximum range or the upper and lower 95th probability threshold values. The set of influential parameters to be selected should be such that they correspond to specific and easy-to-identify inputs to the simulation program. (iv) Perform a “bounded” grid calibration (or unstructured or blind search) using LHMC trials or realizations with different combinations of input parameter values assuming a pre-specified range of variability. Preliminary filtering or identification of a small set of the trials which meet pre-specified goodness-of-fit criteria along with regional sensitivity analysis (refer to Sect. 6.7.4) to provide a means of screening i.e., identifying the weak and the strong parameters as well as determining narrower bounds of variability of these strong parameters. In Fig. 10.30, the results of 3000 runs or points are shown; but only those close to the origin and that meet some pre-set criteria of goodness-of-fit would be the ones on which the regional sensitivity analysis will be performed. (v) While performing calibration, the many degrees of freedom may produce good calibration overall even though the individual parameters may be incorrectly identified (e.g., the monthly fits using utility bills for several of the top plausible set of inputs closely match- as shown in Fig. 10.31. Subsequently, altering one or more of these incorrectly identified parameters to mimic the intended ECM is very likely to yield biased predictions. One work-around is to rely on several plausible predictions on which to make inferences which partially overcome the danger of misleading predictions of “so-called” calibrated models. Thus, rather than using only the “best” calibrated solution of the input parameter set (determined solely on how well it fits the data), a small number of the top plausible solutions are identified instead with which to evaluate the effect of intended ECMs. Not only is one likely to obtain a more robust prediction of the energy and demand reductions, but this would allow determining their associated prediction uncertainty as well.
8
The original study suggested an additional phase involving refining the estimates of the strong parameters after the bounded grid search was completed. This could be done by one of several methods such as analytical optimization or genetic algorithms. This step has been intentionally left out in order not to overly burden the reader.
The top 20 calibrated solutions (an arbitrary choice) in terms of a weighted CV were selected as consisting of the plausible set of solutions. The median and the inter-quartile
10.6
Calibration of Detailed Simulation Programs
453
Fig. 10.29 Flowchart of the methodology for calibrating detailed building energy simulation programs and then using them to determine ECM savings. (Modified from Reddy et al. 2007)
standard deviation (i.e., the 10 trials whose predicted savings are between the 25% and 75% percentiles) are then calculated from these 20 predictions. The predictive accuracy of the calibrated simulations has been investigated using four different sets of ECM measures, some of which would result in increased energy use and demand, such as ECM_C. The results are computed as the predicted % savings in energy use and demand compared to the original building (recall that this is a simulated synthetic building and so the “true” parameters are known in advance). From Fig. 10.32, one notes that the calibration methodology seems to work
satisfactorily for the most part (with the target values contained within the spread of the whiskers of the interquartile range of the top 20 calibrated simulation predictions).
10.6.8 Semi-Analytical Methods (SAM) Several researchers have made computational advances on extending the Monte Carlo approach to sample simulation inputs and to automate the RIT calibration process: one such
454
Fig. 10.30 Scatter plots depicting goodness-of-fit of annual electricity use (in kWh/year) of a synthetic building with 3000 LHMC trials. Only those of the numerous trials close to the origin inside the box drawn are then used for parameter screening using regional sensitivity analysis (From Reddy et al. 2007)
10
Inverse Methods for Mechanistic Models
development is the “Autotune methodology” (Chaudhary et al. 2016). The Autotune methodology also uses a database of building energy profiles, in combination with data mining, to automatically modify the simulation inputs in a semiguided manner. However, there are several ad hoc elements in the Monte Carlo calibration approach. The determination of the number of influential inputs to tune and which specific ones was based on their impact on the goodness-of-fit criteria. However, does the richness of the data support the identification of so many inputs? An overall conclusion of the previous Monte Carlo approach was that trying to calibrate a detailed simulation program with only utility bill information is never likely to be satisfactory because of the numerical identifiability issue. One needs to either enrich the data set by non-intrusive sub-monitoring of key energy uses, at say,
Fig. 10.31 Time series plots of the three energy use channels for the best calibrated trial corresponding to one-stage run with 3,000 LHMC trials. (a) Electricity use. (b) Electricity demand. (c) Gas + thermal energy. (From Reddy et al. 2007)
Fig. 10.32 Electricity use (kWh) savings (in % of the baseline synthetic building values) along with 25% and 75% percentiles for the four ECM measures predicted by the top 20 calibrated solutions from different number of LHMC trials (the numbers such as 1500, 3000 etc. correspond to the numbers of LHMC runs while the designation (a, b) corresponds to slightly different ways of performing the parameter screening. The “correct” values corresponding to the simulated synthetic building are the ones without whiskers. (From Reddy et al. 2007)
10.6
Calibration of Detailed Simulation Programs
455
hourly time scales for a few months if not the whole year, measure non-energy related variables such as indoor air temperature, or reduce the number of strong parameters to be tuned by performing controlled experiments if possible when the building is unoccupied, and estimate their numerical values. An attempt to enhance the LHMC approach by adopting a general analytic framework with firmer mathematical and statistical basis is that by Sun and Reddy (2006). A systemlevel simulation that consists of a set of equations of any general model being calibrated can be written as η = ηðv, βÞ
ð10:58Þ
where η is a vector of the simulated output, e.g., energy consumption, v is a vector of the measured input variables, β is a vector of the model parameters to be calibrated. Difference between the simulated outputs and the actual measured data can be quantified in several ways. The most common one is the least squares method which assumes that the best estimated parameters are the ones that result in minimal sum of the deviations squared (least square error): Minimize J = Δη wΔη T
ð10:59Þ
s:t: βlow ≤ β ≤ βhigh where J is the objective function, Δη the vector of model residuals. The weight matrix w is introduced to normalize for the numerical magnitude of the parameters β in case the relative importance of each simulated output is different (e.g., electric energy use and natural gas use may have to be calibrated to different accuracy levels). Note that lower and upper bounds of each parameter can be stipulated resulting in a constrained optimization problem. The analytical calibration process proposed involves four distinct processes: sensitivity analysis, identifiability analysis, optimization, and uncertainty analysis. Before the tuning stage, which is now framed as an optimization problem, can be reached, one must perform both sensitivity and identifiability analyses.
Sensitivity Analysis allows the strong influential input parameters to be identified (note that the sensitivity coefficients should be normalized by their Euclidean norm for clearer determination of their influence) while the values of the weak variables are frozen at their default values without much adverse impact to the calibration process; this is similar to the regional sensitivity screening process using the LHMC method described in the previous section.
Identifiability Analysis is next done to determine how many parameters and which parameters can be estimated uniquely based on available measured data. The sufficient condition for a local minimum in the neighborhood of the point β0 is that the Hessian matrix should be positive definite. The existence of a local minimum implies that one can find a set of values of parameters β by solving the above optimization problem. Interpreted in practical terms, this would imply that all the parameters β can be calibrated simultaneously. Determining whether the Hessian matrix is positive definite is difficult because second order derivatives of the objective function need to be computed (a simplified approach involving first-order derivatives only has been proposed by Sun and Reddy 2006). Then, the identifiability criterion can be stated as “the number of identifiable parameters is equal to the highest order definite sub-matrix (or the rank of the Hessian matrix) whose condition number9 is less than a pre-selected threshold value.” For example, the LHMC method based on the monthly utility bills for the same synthetic building results presented earlier in Sect. 10.6.7 indicated that six inputs were influential. The corresponding Hessian matrix resulted in the following eigenvalues vector: λ6 = ½0:0264
0:0366 0:1794
0:5997
2:4842
27:1637
A condition number of cd6 = 1028.93 (as the ratio of the largest to the smallest eigenvalue) is much too high, and so the data cannot support the identification of all six inputs. The same process was applied to the sub-matrices, and the eigenvalues and condition numbers were again computed. The maximum condition numbers of 5-order, 4-order, 3-order, and 2-order sub-matrices were found to be: cd5 = 742.18, cd4 = 151.41, cd3 = 45.30, cd2 = 10.93. If the cutoff value is taken to be 50, only three of the six strong continuous variables can be identified simultaneously. These correspond to: Uwall- overall heat loss coefficient, EIRenergy input ratio which is the inverse of the coefficient of performance (COP) of the cooling system, and MinSA- minimum value of the outdoor supply air flow rate. The initial/reference values, the parameter ranges assumed and the final calibrated values of the three inputs are assembled in Table 10.8. The uncertainty ranges are determined using a Monte Carlo approach (described in Sun and Reddy 2006). It is clear that this calibration approach is able to identify the parameters accurately. This illustrative example demonstrates that the proposed analytical procedure, though involving several steps, can correctly tune the parameters back to their perfectly known reference values assumed for 9 The concept of condition number has been discussed in Sect. 9.2.2 under ill-conditioning
456
10
Inverse Methods for Mechanistic Models
Table 10.8 Calibration results of the semi-analytical method Identifiable parameters Uwall EIR MinSA
Reference values 0.640 0.450 0.650
Low bound 0.055 0.359 0.300
High bound 0.800 0.650 1.000
Final calibration values 0.655 0.463 0.612
Initial guess values 0.400 0.400 0.400
From Sun and Reddy (2006)
this synthetic building. This study was more like a proof-ofconcept study and ought to be more systematically studied with hourly data which will contain more information (and noise), and thereby allow a greater number of parameters to be identified. Whether this approach can be improved further and reach a stage of wide acceptance remains to be seen.
10.6.9 Physical Parameter Estimation (PPE) Few attempts have been made towards “calibrating”10 a detailed simulation program to faithfully estimate parameters which are physically relevant. One approach will be outlined here called EPE (Enhanced Parameter Estimation) approach (Subbarao et al. 2022). The underlying premise is to estimate certain macro parameters without modifying the simulation input variables. An inexpensively created simulation model, called the audit model, is the one to be reconciled with performance data. To keep the statistical estimation robust, only the most important flows are considered, their effect characterized by macro parameters and estimated from the audit building simulation, and then corrected by introducing renormalization coefficients. A two-stage process is used: in stage 1, the shell-related parameters are estimated, while the equipment-related parameters are estimated in stage 2. This would allow the transient heat flows appearing in stage 1 to be handled separately and not muddy the estimation of parameters of the secondary HVAC system and primary systems (chillers and boilers) which are static models. The approach is to work with macro heat flows contributing to the overall building energy balance rather than with the numerous inputs of the simulation program. Certain simulation programs provide this type of disaggregation. The PPE method is fundamentally similar to another well-known calibration approach called PSTAR (Primary and Secondary Terms Analysis and Renormalization) method (Subbarao 1988) which identifies and computes heat flow terms of primary and secondary importance, and then introduces corrective parameters to the audit macroparameters which are estimated by regression from monitored data to enforce energy balance in the actual building. Commercial detailed simulation programs, e.g., EnergyPlus consider micro heat flows reaching, for example, the indoor 10
In the true sense of the term as defined in Sect. 10.6.6c.
air node from each of the other nodes (as well as any direct input such as convective portion of internal gains). They keep track of many nodes, and thereby a large number of micro heat flows, but do not distinguish between how much of the heat flow is due to the major external drivers such as indoor temperature, outdoor temperature, and solar radiation. To overcome this deficiency, the PPE method determines the macro heat flows reaching a node by simulating the audit building over the whole year wherein specific input driving functions are modified in an artificial but carefully selected manner. These year-long time series data sets of inputs and responses can then be analyzed in a stage-wise regression framework to identify more realistic macro parameters characterizing the building thermal performance. This method is an extension of sorts to the PSTAR approach in terms of operation. The interested reader can refer to Subbarao et al. (2022) where this method is discussed and illustrated in some detail.
10.6.10 Thoughts on Statistical Criteria for Goodness-of-Fit Usually energy-related utility bills include electricity use in kWh, demand in kW and gas use, and calibration involves reconciling all three quantities. Traditionally, the two most often used statistical indices for goodness-of-fit are the Normalized Mean Biased Error (NMBE) and the Coefficient of Variation (CV); these are defined in Sect. 5.3.2 and are akin to the mean and the standard deviation of the residuals. Guidelines on calibration accuracy have also been proposed for general use (ASHRAE 14-2014, and). The reconciliation could be undertaken so that each end-use data channel meets these statistical criteria individually or a single weighted value could alternatively be used (as suggested by Reddy et al. 2007). One needs to distinguish between a calibration done with monthly aggregated residuals bills as against those done with hourly residuals. Even though computer simulations predict energy use at the hourly level, one need to sum these up to the monthly level to compare them with the utility bill reading. Obviously, one would expect that match between simulated and measured at the monthly level to be more accurate than at hourly level. Guideline ASHRAE 14 (2014) proposes threshold CV-RMSE values of 15 and 30% for monthly and hourly
10.6
Calibration of Detailed Simulation Programs
457
Table 10.9 Comparison of goodness of fit indices for electricity use aggregated at monthly level and at individual hourly level between actual calibration study and the guideline thresholds Actual building (%) 4.8 2.6 13.3 2.5
CV-RMSE (monthly) NMBE (monthly) CV-RMSE (hourly) NMBE (hourly)
Table 10.10 Normalized differences between measured and simulation electricity use in kWh at the monthly level for an actual calibrated building with hourly monitored data Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec Yearly
Simulated 8.77E+05 8.00E+05 8.78E+05 8.48E+05 1.05E+06 1.25E+06 1.35E+06 1.27E+06 1.18E+06 8.92E+05 8.37E+05 8.74E+05 12,105,950
Measured 9.17E+05 8.41E+05 8.82E+05 8.37E+05 1.17E+06 1.29E+06 1.42E+06 1.30E+06 1.17E+06 8.84E+05 8.31E+05 8.51E+05 12,404,406
NMBE (%) 4.5 5.1 0.5 -1.3 11.5 3.6 5.7 2.3 -0.9 -0.8 -0.7 -2.7 2.5
CV-RMSE (%) 10.3 9.8 9.4 10.2 22.5 10.6 10.2 10.9 16.0 17.4 9.5 8.9 13.3
level calibration to be deemed acceptable, and 5 and 10%, respectively, for the NMBE values. These thresholds are quite arbitrary and were selected to be reflective of the then prevailing accuracy levels achievable by good consulting firms for the types of applications calibrated simulations were used for (typically applications (a) – (c-ii) listed in Sect. 10.6.4). These thresholds are certainly not appropriate for research purposes nor reflective of what current analysts with better understanding and sophistication ought to achieve; note also that the applications to which calibrated simulation is used have been expanded to applications (c-iv and c-v), (d) and (e) in Sect. 10.6.4). Alternative metrics have been proposed and should be seriously considered (but yet to be accepted in consensus guideline documents) such as those given by Eqs. 5.10b and 5.10c where the RMSE values are normalized differently than the traditional definition of CV-RMSE given by Eq. 5.10a. In addition, one should investigate month-by-month differences rather than relying on a single annual value; this would yield additional insights and is likely to improve the calibration greatly. An actual mixed use very large commercial building located in the Washington DC area for which hourly monitored data is available has been calibrated using the heuristic/ manual RIT approach.11 The actual goodness-of-fit values 11
This data was provided by Srijan Didwania for which we are grateful
ASHRAE guideline 14 (%) 15 5 30 10
are so much better than those proposed by the standard guideline (Table 10.9) that one would conclude that the calibration is excellent. However, if one looks at Table 10.10 which provides a comparison at the month-bymonth level, the normalized differences in electricity use for certain months are very high with high NMBE values (such as May whose NMBE = 11.5% and CV-RMSE = 22.5%). Also concerning are the large differences in CV-RMSE of certain months (Sept and Oct) even though the NMBE