119 63 3MB
English Pages 298 Year 2024
Synthesis Lectures on Mathematics & Statistics
Emmanuel N. Barron · John G. Del Greco
Probability and Statistics for STEM A Course in One Semester Second Edition
Synthesis Lectures on Mathematics & Statistics Series Editor Steven G. Krantz, Department of Mathematics, Washington University, Saint Louis, MO, USA
This series includes titles in applied mathematics and statistics for cross-disciplinary STEM professionals, educators, researchers, and students. The series focuses on new and traditional techniques to develop mathematical knowledge and skills, an understanding of core mathematical reasoning, and the ability to utilize data in specific applications.
Emmanuel N. Barron · John G. Del Greco
Probability and Statistics for STEM A Course in One Semester Second Edition
Emmanuel N. Barron Department of Mathematics and Statistics Loyola University Chicago Chicago, IL, USA
John G. Del Greco Department of Mathematics and Statistics Loyola University Chicago Chicago, IL, USA
ISSN 1938-1743 ISSN 1938-1751 (electronic) Synthesis Lectures on Mathematics & Statistics ISBN 978-3-031-38984-9 ISBN 978-3-031-38985-6 (eBook) https://doi.org/10.1007/978-3-031-38985-6 1st edition: © Morgan and Claypool Publishers 2020 2nd edition: © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Emmanuel N. Barron: Dedicated to Christina. John G. Del Greco: for Jim.
Preface to the Second Edition
The second edition has several improvements and additions. We hope we have made corrections to the first edition without adding new errors although that event has probability zero. We have included new sections Conditional Distributions, Medians, Order Statistics, and Multiple Mean Hypothesis Testing. Order statistics are not only important in their own right but we need to know about this topic so that we can discuss the important Central Limit Theorem for Medians which is covered in Chap. 3. This topic is ignored in many textbooks but it is important in statistics for creating statistical intervals and hypothesis testing for medians. As in the first edition, the goal is to produce a basic textbook for mathematically prepared students in the essentials of probability and statistics. This is presented typically at sophomore level or higher but is accessible to anyone with calculus preparation. In addition, the entire book is designed to be covered in one semester. Any errors which remain or are introduced in this edition are attributed to the co-author. Chicago, USA June 2023
Emmanuel N. Barron John G. Del Greco
Acknowledgement Susanne Filler at Springer is an amazing Editor.
vii
Preface to the First Edition
Every student anticipating a career in science and technology will require at least a working knowledge of probability and statistics, either for use in their own work, or to understand the techniques, procedures, and conclusions contained in scholarly publications and technical reports. Probability and statistics has always been and will continue to be a significant component of the curricula of mathematics and engineering science majors. These two subjects have become increasingly important in areas that have not traditionally included them in their undergraduate courses of study like biology, chemistry, physics, and economics. Over the last couple of decades, methods originating in probability and statistics have found numerous applications in a wide spectrum of scientific disciplines, and so it is necessary to at least acquaint prospective professionals and researchers working in these areas with the fundamentals of these important subjects. Unfortunately, there is little time to devote to the study of probability and statistics in a science and engineering curriculum that is typically replete with required courses. What should be a comprehensive two-semester course in probability and statistics has to be, out of necessity, reduced to a single-semester course. This book is an attempt to provide a text that addresses both rigor and conciseness of the main topics for undergraduate probability and statistics. It is intended that this book be used in a one-semester course in probability and statistics for students who have completed two semesters of calculus. It is our goal that readers gain an understanding of the reasons and assumptions used in deriving various statistical conclusions. The presentation of the topics in the book is intended to be at an intermediate level, sophomore, or junior level. Most two-semester courses present the subject at a higher level of detail and address a wider range of topics. On the other hand, most one-semester, lower-level courses do not present any derivations of statistical formulas and provide only limited reasoning motivating the results. This book is meant to bridge the gap and present a concise but mathematically rigorous introduction to all of the essential topics in a first course in probability and statistics. If you are looking for a book which contains all the nooks and crannies of probability or statistics, this book is not for you. If you plan on becoming (or already are) a practicing scientist or engineer, this book will certainly contain much of what you need to know. But, if not, it will give you ix
x
Preface to the First Edition
the background to know where and how to look for what you do need, and to understand what you are doing when you apply a statistical method and reach a conclusion. The book provides answers to most of the problems in the book. While this book is not meant to accompany a strictly computational course, calculations requiring a computer or at least a calculator are inevitably necessary. Therefore, this course requires the use of a TI-83/84/85/89, or any standard statistical package like Excel. Tables of things like the standard normal, t-distribution, etc., are not provided. All experiments result in data. These data values are particular observations of underlying random variables. To analyze the data correctly the experimenter needs to be equipped with the tools that an understanding of probability and statistics provide. That is the purpose of this book. We will present basic, essential statistics, and the underlying probability theory to understand what the results mean. The book begins with three foundational chapters on probability. The fundamental types of discrete and continuous random variables and their basic properties are introduced. Moment generating functions are introduced at an early stage and used to calculate expected values and variances, and also to enable a proof of the Central Limit Theorem, a cornerstone result. Much of statistics is based on the Central Limit Theorem, and our view is that students should be exposed to a rigorous argument for why it is true and why the normal distribution plays such a central role. Distributions related to the normal distribution, like the χ 2 , t, and F distributions, are presented for use in the statistical methods developed later in the book. Chapter 3 is the prelude to the statistical topics included in the remainder of the text. This chapter includes the analysis of sample means and sample standard deviations as random variables. The study of statistics begins in earnest with a discussion of confidence intervals in Chap. 4. Both one-sample and two-independent-samples confidence intervals are constructed as well as confidence intervals for paired data. Chapter 5 contains the topics at the core of statistics, particularly important for experimenters. We introduce tests of hypotheses for the major categories of experiments. Throughout the chapter, the dual relationship between confidence intervals and hypotheses tests is emphasized. The power of a test of hypotheses is discussed in some detail. Goodness-of-fit tests, contingency tables, tests for independence, and one-way analysis of variance is presented. The book ends with a basic discussion of linear regression, an extremely useful tool in statistics. Calculus students have more than enough background to understand and appreciate how the results are derived. The probability theory introduced in earlier chapters is sufficient to analyze the coefficients derived in the regression.
Preface to the First Edition
xi
The book has been used in the Introduction to Statistics and Probability course at our university for several years. It has been used in both two lectures per week and three lectures per week formats. Each semester typically involves at least two midterms (usually three) and a comprehensive final exam. In addition, class time includes time for group work as well as in-class quizzes. It makes it a busy semester to finish. Chicago, USA June 2020
Emmanuel N. Barron John G. Del Greco
Contents
1 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Equiprobable Sample Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Appendix: Counting Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Multiplication Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 4 5 9 10 13 13 14 15 17
2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Discrete RVs and PMFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Cumulative Distribution Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Continuous RVs and PDFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Important Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Discrete Uniform RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Bernoulli RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Binomial RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Geometric RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Negative Binomial RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.6 Poisson RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.7 Hypergeometric RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.8 Multinomial RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.9 Simulating Discrete RVs Using a Box Model . . . . . . . . . . . . . . . . . 2.3 Important Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Uniform RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Exponential RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Normal RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 24 25 28 30 30 31 31 32 33 34 35 37 38 40 40 41 43 xiii
xiv
Contents
2.4
Expectations, Variances, Medians, and Percentiles . . . . . . . . . . . . . . . . . . . 2.4.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Medians and Percentiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Moment-Generating Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Two Discrete RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Two Continuous RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Independent RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 An Application of Conditional Distributions: Bayesian Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 The General Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Chebychev’s Inequality and the Weak Law of Large Numbers . . . . . . . . 2.9 Other Distributions Important in Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.1 Chi Squared Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Student’s t Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.3 Fisher-Snedecor F Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 TI-8x Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48 48 50 51 53 56 57 58 61 62 65
3 Distributions of Sample Mean and Sample SD . . . . . . . . . . . . . . . . . . . . . . . . . . X of a Random Sample . . . . . . . . . . . . . . . . . . . . 3.1 The Statistics X , S 2 and 3.2 Normal Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 X ∼ N (μ, σ ), σ Known . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 X ∼ N (μ, σ ), σ Unknown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 The Population X is not Normal but has Known Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 The Population is Bernoulli, p is Known . . . . . . . . . . . . . . . . . . . . 3.2.5 The Population is Bernoulli, p is Unknown . . . . . . . . . . . . . . . . . . 3.3 Sampling Distributions of Differences of Two Samples . . . . . . . . . . . . . . 3.4 The Median: Order Statistics and the Central Limit Theorem . . . . . . . . . 3.4.1 Continuous RVs and Order Statistics . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Discrete RVs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Order Statistics and Sample Percentiles . . . . . . . . . . . . . . . . . . . . . . 3.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91 91 95 96 97
68 71 73 75 76 76 78 79 79 80
99 101 103 104 106 106 111 113 115
Contents
4 Confidence and Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Confidence Intervals for a Single Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Controlling the Error of an Estimate Using Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Pivotal Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Confidence Intervals for the Mean and Variance of a Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Confidence Intervals for a Proportion . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 One-Sided Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Confidence Intervals for Two Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Difference of Two Normal Means . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Confidence Interval for the Ratio of Variances . . . . . . . . . . . . . . . . 4.2.3 Difference of Two Binomial Proportions . . . . . . . . . . . . . . . . . . . . . 4.2.4 Paired Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Prediction Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The Basics of Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Hypotheses Tests for One Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Hypotheses Tests for the Normal Parameters, Critical Value Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The P-Value Approach to Hypothesis Testing . . . . . . . . . . . . . . . . 5.3.3 Test of Hypotheses for Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Hypotheses Tests for Two Populations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Test of Hypotheses for Two Proportions . . . . . . . . . . . . . . . . . . . . . 5.5 Power of Tests of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Factors Affecting Power of a Test of Hypotheses . . . . . . . . . . . . . 5.5.2 Power of One-Sided Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 More Tests of Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Chi-Squared Statistic and Goodness-of-Fit Tests . . . . . . . . . . . . . . 5.6.2 Contingency Tables and Tests for Independence . . . . . . . . . . . . . . 5.6.3 Analysis of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Multiple Testing Problem and ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Bonferroni and Šidák Corrections . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Tukey’s Simultaneous Confidence Intervals . . . . . . . . . . . . . . . . . . 5.8 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.9 Summary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xv
123 123 124 125 127 131 134 135 135 140 141 143 144 147 153 153 156 157 158 162 164 166 168 171 173 175 176 177 182 189 196 196 198 201 216
xvi
Contents
6 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction and Scatter Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Introduction to Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 The Linear Model with Observed X . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Estimating the Slope and Intercept from Data . . . . . . . . . . . . . . . . 6.2.3 Errors of the Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 The Distributions of aˆ and bˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Confidence Intervals for Slope and Intercept & Hypothesis Tests . . . . . . 6.4.1 Hypothesis Tests for Slope and Intercept . . . . . . . . . . . . . . . . . . . . . 6.4.2 ANOVA for Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Confidence and Prediction Bands . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Hypothesis Test for the Correlation Coefficient . . . . . . . . . . . . . . . 6.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
219 219 219 223 225 227 231 233 234 236 238 241 245
7 Appendix: Answers to Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Answers to Chap. 1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Answers to Chap. 2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Answers to Chap. 3 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Answers to Chap. 4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Answers to Chap. 5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Answers to Chap. 6 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
253 253 256 262 266 267 281
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
285
About the Authors
Emmanuel N. Barron received his B.S. (1970) in Mathematics from the University of Illionois at Chicago and his M.S. (1972) and Ph.D. (1974) from Northwestern University in Mathematics specializing on partial differential equations and differential games. After receiving his Ph.D., Dr. Barron was an Assistant Professor at Georgia Tech, and then became a Member of Technical Staff at Bell Laboratories. In 1980 he joined the Department of Mathematical Sciences at Loyola University Chicago, where he is Emeritus Professor of Mathematics and Statistics. Professor Barron has published over 80 research papers. He has also authored the book Game Theory: An Introduction, 2nd Edition in 2013. Professor Barron has received continuous research funding from the National Science Foundation, and the Air Force Office for Scientific Research. Dr. Barron has taught Probability and Statistics to undergraduates and graduate students since 1974. John G. Del Greco a native of Cleveland, Ohio, Dr. John G. Del Greco holds a B.S. in mathematics from John Carroll University, an M.A. in mathematics from the University of Massachusetts, and a Ph.D. in industrial engineering from Purdue University. Before joining Loyola’s faculty in 1987, Dr. Del Greco worked as a systems analyst for Micro Data Base Systems, Inc., located in Lafayette, Indiana. His research interests include applied graph theory, operations research, network flows, and parallel algorithms, and his publications have appeared in such journals as Discrete Mathematics, Discrete Applied Mathematics, The Computer Journal, Lecture Notes in Computer Science, and Algorithmica. He has been teaching probability and statistics at all levels in the Department of Mathematics and Statistics at Loyola for the past 20 years.
xvii
1
Probability
There are two types of events that occur: deterministic and stochastic (or random). A event is deterministic if the same set of inputs always yields the same output. For example, Newtonian physics models deterministic phenomena. Engineering design, on the other hand, necessarily involves probability models since the materials out of which system components are fabricated typically have defects that cannot be controlled completely. For example, airplane components degrade with every takeoff and landing. Changes in temperature and pressure and a myriad of other factors during flight put stress on critical parts. Expensive maintenance procedures can be done only periodically and only if qualified staff are available. These factors, difficult to predict with certainty in advance and therefore considered random effects, must be taken into account in the design process. Failure to do so may result in a catastrophic accident with the loss of many lives. Probability is a rigorous mathematical discipline that attempts to quantify the results and effects of randomness, and is therefore a critical component of proper engineering design. In this chapter and in Chaps. 2 and 3, we cover the basic concepts and techniques of probability that we will use throughout the text. In the remainder of the book, probability is applied to construct statistical techniques used to assess and draw conclusions from the data associated with the engineering of complex systems.
1.1
The Basics
We are all familiar with experiments conducted in a chemistry, biology, or physics laboratory. In probability, we also perform experiments, but probability experiments are very different from laboratory experiments. A laboratory experiment is a procedure designed to produce an outcome whose purpose is to test the validity of a conjecture about some aspect of nature. Although we may have some theoretical expectation as to the result of the experiment, a complete description of the outcome is not known in advance. Also, the outcome is the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 E. N. Barron and J. G. Del Greco, Probability and Statistics for STEM, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-38985-6_1
1
2
1 Probability
same for every repetition of the experiment (determinism). In contrast, in a probability experiment, we are not testing a conjecture about nature. In addition, we do know an exact description of the outcomes that may occur. Finally, different repetitions of the experiment may yield different outcomes (nondeterminism or randomness). For example, in a poker game in which the players are dealt five-card hands, each player knows the possibilities (outcomes) for the hand that he or she may be dealt, but different rounds of the game may yield different hands for the players. Definition 1.1 A probability experiment is a procedure that produces a well-defined set of outcomes that occur by chance alone. Going forward, the term ‘experiment’ will always denote a probability experiment. Definition 1.2 The set S of all outcomes of an experiment is called the sample space of the experiment. The set S could be a finite set like the collection of all five-card poker hands that are possible in a poker game, a countable set (a set that can be put into a one-to-one correspondence with the set {1, 2, 3, ...}) like the number of users that log onto a certain computer system, or a continuum (an uncountable set) like the set of real number between 0 and 1. Definition 1.3 An event is any subset A of S, written A ⊆ S. The sample space S is called the sure (or certain) event, and the empty set ∅ is called the impossible (or null) event. We denote the set of all events by F , and therefore F = {A | A ⊆ S}. If S is a finite set with N elements, we write |S| = N . Note that |F | = 2 N . (Why?). Example 1.4 Consider the following sample spaces and events. (i) If we roll a single die, S = {1, 2, 3, 4, 5, 6}. Rolling an even number is the event A = {2, 4, 6}. (ii) If we observe the number of customers entering a bakery, S = {0, 1, 2, 3, ...}, and the event that we observe between two and seven customers is A = {2, 3, 4, 5, 6, 7}. (iii) If we throw a dart at a circular dart board of radius two feet and observe its landing position as an ordered pair of real numbers, then S = {(x, y) | x 2 + y 2 ≤ 4}. The event that the dart lands in the first quadrant is A = {(x, y) | x 2 + y 2 ≤ 4, x ≥ 0, y ≥ 0}. We say that an event A occurs if some outcome in the event actually occurs when the experiment is performed. Note that if |S| = N and the experiment is performed, exactly
1.1 The Basics
3
2 N −1 events occur since for each event A, the outcome of the experiment must be in either A or Ac = S − A, the set of all outcomes not in A. Events can be combined using set operations to form new events. For A, B ∈ F , the three basic ones are listed below. • A ∪ B, called the union of A and B, is the event that A or B (or both) occur. • A ∩ B, called the intersection of A and B, is the event that both A and B occur. (A ∩ B is sometimes written as AB.) • Ac , called the complement of A, is the event that A does not occur. (Ac is sometimes written as A or, in some texts, A .) If A and B are events and A ∩ B = ∅, then A and B cannot occur together, that is, they are mutually exclusive. We also say that A and B are disjoint events. Note that A and Ac are disjoint. Finally, note that A ∪ Ac = S meaning that either A or Ac occurs since any outcome is in S. There are many set identities that are useful in various situations. A couple of the more important ones are listed below. • A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C). (Distributive Laws) • (A ∩ B)c = Ac ∪ B c and (A ∪ B)c = Ac ∩ B c . (DeMorgan’s Laws) These and most other set identities can be (informally) proved using Venn diagrams. We are now ready to define what is meant by the probability of an event. Definition 1.5 A probability function is a function P : F → R satisfying the following three axioms. (i) (Nonnegativity Axiom) For each A ∈ F , P(A) ≥ 0. (ii) (Certainty Axiom) P(S) = 1. (iii) (Disjoint Event Axiom) If A and B are disjoint events, then P(A ∪ B) = P(A) + P(B). For an event A, we interpret P(A) as the probability of A. Whenever we write ‘P’, we assume it is a probability function. We list some of the most important properties of a probability function in the next proposition.
4
1 Probability
Proposition 1.6 Suppose P is a probability function on F and A, B ∈ F . (i) (Complement Property) P(Ac ) = 1 − P(A). (ii) (Monotone Property) If A ⊆ B, then P(A) ≤ P(B). (iii) (General Sum Property) P(A ∪ B) = P(A) + P(B) − P(A ∩ B). (iv) (Difference Property) P(A − B) = P(A) − P(A ∩ B).1 (v) (Law of Total Probability) P(A) = P(A ∩ B) + P(A ∩ B c ). Proof (i) Since S = A ∪ Ac , by the Certainty and Disjoint Event Axioms, 1 = P(S) = P(A) + P(Ac ). (ii) Note that B = A ∪ (B − A). Since A and B − A are disjoint, by the Disjoint Event Axiom, P(B) = P(A) + P(B − A). Since P(B − A) ≥ 0 by the Nonnegativity Axiom, (ii) follows. As a special case, if we take B = S, then P(A) ≤ P(S) = 1, and so for every event A, 0 ≤ P(A) ≤ 1. (iii) For (iii), by the Disjoint Event Axiom, we have P(A) = P(A − B) + P(A ∩ B) and P(B) = P(B − A) + P(A ∩ B) Adding these two equations gives P(A) + P(B) = P(A − B) + P(A ∩ B) + P(B − A) + P(A ∩ B) P(A∪B)
and (iii) follows. For (iv), A = (A − B) ∪ (A ∩ B). By the Disjoint Event Axiom, P(A) = P(A − B) + P(A ∩ B) giving (iv). (v) Note that A ∩ B and A ∩ B c are disjoint events. Therefore, P(A) = P(A ∩ S) = P(A ∩ (B ∪ B c )) = P((A ∩ B) ∪ (A ∩ B c )) (Distributive Law) = P(A ∩ B) + P(A ∩ B c ). (Disjoint Event Axiom)
1.1.1
Equiprobable Sample Spaces
When the sample space is finite, say |S| = N , and all outcomes in S are equiprobable (or equally likely), we may define a function
1 A − B = A ∩ Bc
1.2
Conditional Probability
5
P(A) =
|A| . N
To see that P is a probability function, we must verify the conditions of the definition. (i) P(A) ≥ 0 since |A| ≥ 0. (ii) P(S) =
|S| N
=
N N
= 1.
(iii) If A and B are disjoint, P(A ∪ B) =
|A∪B| N
=
|A| N
+
|B| N
= P(A) + P(B).
The requirement that the outcomes in S be equiprobable is essential. For example, suppose we roll a pair of dice and add the numbers showing on each individual die. Our sample space is then S = {2, 3, ..., 12}. If we assume that the outcomes are equally likely, then 1 which is not correct. The if A is the event that we roll a 7, we would obtain P(A) = 11 problem is that in this particular sample space, the outcomes are not equiprobable. If we desire equiprobable outcomes, the sample space must be expanded to account for the result on each die: S = {(1, 1), (1, 2), ..., (1, 6), ..., (6, 1), (6, 2), ..., (6, 6)}. Clearly |S| = 36. Again, if A is the event of rolling a 7, then A = {(1, 6), (2, 5), (3, 4), (4, 3), 6 = 16 . (5, 2), (6, 1)}. Therefore, P(A) = 36 If the sample space is finite and the outcomes can be easily enumerated, probabilities can usually be computed by simple counting. (Counting techniques will be discussed more fully in the Appendix.) For example, suppose we again roll a pair of dice. Let D1 and D2 be the number showing on first and second die respectively. Consider the event A = {(D1 , D2 ) | D1 > D2 }. Then A = {(2, 1), (3, 1), (3, 2), (4, 1), (4, 2), (4, 3), (5, 1), (5, 2), (5, 3), (5, 4), (6, 1), (6, 2), (6, 3), (6, 4), (6, 5)}. 5 Since |A| = 15, P(A) = 15 36 = 12 . Another way to obtain the same result is to first observe that if B = {(D1 , D2 ) | D1 < D2 }, then P(A) = P(B) and P(A) + P(B) + P({(D1 , D2 ) | D1 = D2 }) = 1. Then since P({(D1 , D2 ) | D1 = D2 }) = 16 , we have that 2P(A) = 56 5 implying that P(A) = 12 .
1.2
Conditional Probability
Events do not typically occur in a vacuum. More often than not, the occurrence (or nonoccurrence) of an event has an impact on a occurrence (or nonoccurrence) of a second, different event. In this situation, the probability of the first event is conditioned on the probability of the second event.
6
1 Probability
Definition 1.7 The conditional probability of an event A, given that event B has occurred, is P(A ∩ B) P(A | B) = if P(B) > 0. P(B) If P(B) = 0, the conditional probability is considered undefined. As a justification for this definition, consider the case in which the sample space S is finite and the outcomes equally likely. Suppose |S| = N . If the event B has occurred, then the sample space S has effectively contracted to the event B, and therefore the probability of A in this reduced sample space should be simply the ratio of the number of outcomes in A that are also in B, that is |A ∩ B|, to the number of outcomes in B, namely |B| Therefore, |A ∩ B| = |B|
|A∩B| N |B| N
=
P(A ∩ B) . P(B)
Conditional probability can be written in an equivalent form called the Multiplication Rule. (Multiplication Rule)P(A ∩ B) = P(A | B)P(B) = P(B | A)P(A). Example 1.8 In a controlled experiment to determine if a certain drug is effective, 71 patients were given the drug (event D), and 75 were give a placebo (event D c ). A patient exhibits a response (event R) or does not (event R c ). The table below (called a two-way or contingency table) summarizes the results of the experiment. Drug Placebo Subtotals Probability Response No Response Subtotals Probability
26 45 71 0.486
13 62 75 0.514
39 107 146 ∗
0.267 0.733 ∗ ∗
The sample space consists of 146 outcomes of the form (Drug, Response), (Placebo, Response), (Drug, No Response), and (Placebo, No Response), all assumed equally likely. The values in the table are recorded after the experiment is performed, and the probabilities are estimated as P(D) =
39 71 = 0.486, P(R) = = 0.267, etc. 146 146
For example, P(R) is obtained from 39 of the equally likely chosen patients that exhibit a response, either to the drug or the placebo. We can also use the Law of Total Probability to compute these probabilities. If we want the probability that a randomly chosen patient will exhibit a response, we use the observation that
1.2
Conditional Probability
7
26 13 39 + = = 0.267 and 146 146 146 26 45 71 P(D) = (D ∩ R) ∪ (D ∩ R c ) = + = = 0.486. 146 146 146 P(R) = (R ∩ D) ∪ (R ∩ D c ) =
We can now answer certain questions concerning conditional probabilities. (i) If we choose a patient at random and we observe that this patient exhibited a response, we can compute the probability that this patient was administered the drug as P(D | R) =
26 P(D ∩ R) = = 39 P(R)
26 146 39 146
.
We used the reduced sample space R to obtain the first equality. (ii) If we choose a patient at random and we observe that this patient took the drug, we compute the probability that this patient exhibited a response. In this case, P(R | D) = 26 71 . Notice that P(D | R) = P(R | D). (iii) If we wish to compute P(R c | D), then P(R c | D) =
45 P(R c ∩ D) = = 71 P(D)
45 146 71 146
.
It is interesting to note that since P(D) = P(R ∩ D) + P(R c ∩ D), P(R c | D) =
P(R c ∩ D) P(D) − P(R ∩ D) = = 1 − P(R | D). P(D) P(D)
The Law of Total Probability can be rewritten in terms of conditional probability. Theorem 1.9 If A and B are events with 0 < P(B) < 1, then P(A) = P(A | B)P(B) + P(A | B c )P(B c ). Proof We use the Law of Total Probability combined with the Multiplication Rule to get P(A) = P(A ∩ B) + P(A ∩ B c ) = P(A | B)P(B) + P(A | B c )P(B c ).
It is often the case that we want to compute the conditional probability of an event and take into account the occurrence or nonoccurrence of some third event.
8
1 Probability
Corollary 1.10 If A, B, and C are events, then P(A | B) = P(A | B ∩ C)P(C | B) + P(A | B ∩ C c )P(C c | B) assuming that all the conditional probabilities exist. Proof P(A ∩ B) = P(A ∩ B ∩ C) + P(A ∩ B ∩ C c ) P(A ∩ B ∩ C) P(A ∩ B ∩ C c ) P(B ∩ C) + P(B ∩ C c ) P(B ∩ C) P(B ∩ C c ) = P(A | B ∩ C)P(B ∩ C) + P(A | B ∩ C c )P(B ∩ C c ).
=
Therefore, P(A ∩ B) P(A | B ∩ C)P(B ∩ C) P(A | B ∩ C c )P(B ∩ C c ) = + P(B) P(B) P(B) c c = P(A | B ∩ C)P(C | B) + P(A | B ∩ C )P(C | B).
P(A | B) =
Another useful fact about conditional probability is that given a fixed event B with P(B) > 0, taking conditional probabilities with B yields a probability function. Corollary 1.11 If B is a fixed event with P(B) > 0, then Q(A) = P(A | B) is a probability function on the set of events F . Proof We must show that Q satisfies the axioms of a probability function. Clearly, since P(B) P(B) > 0, Q(A) is defined and Q(A) ≥ 0. Also, Q(S) = P(S | B) = P(S∩B) P(B) = P(B) = 1. Finally, if A1 and A2 are disjoint events, then P((A1 ∪ A2 ) ∩ B) P(B) P((A1 ∩ B) ∪ (A2 ∩ B)) P(A1 ∩ B) P(A2 ∩ B) = = + P(B) P(B) P(B) = P(A1 | B)P(B) + P(A2 | B)P(B) = Q(A1 ) + Q(A2 )
Q(A1 ∪ A2 ) = P(A1 ∪ A2 | B) =
since A1 ∩ B and A2 ∩ B are disjoint, and P is a probability function.
Example 1.12 (Simpson’s Paradox) Nick and John are major league baseball players for the Chicago Cubs. Their statistics against left- and right-handed pitchers are listed below.
1.2
Conditional Probability
9 Lefties
John Nick
Righties
At Bats Hits Ave At Bats Hits Ave 25 5 0.20 75 25 0.33 80 20 0.25 20 7 0.35
Nick has a better batting average against both right- and left-handed pitchers, and so we would expect that he would also have a better average overall. Let J and N be the event that John and Nick get a hit respectively. Let J L and N L be the event that John and Nick face left-handed pitchers. Define J R and N R similarly. Computing P(J ) and P(N ), we get P(J ) = P(J | J L)P(J L) + P(J | J R)P(J R) = (0.20)(0.25) + (0.33)(0.75) = 0.30 and P(N ) = P(N | N L)P(N L) + P(N | N R)P(N R) = (0.25)(0.80) + (0.35)(0.20) = 0.27.
We see that, counter to our intuition, John has a better batting average overall. Notice that both John and Nick had better performance against right-handed pitchers. Fortunately for John, he had most of his at bats against right-handed pitchers. As a last observation, we 5 7 25 note that P(N | N L) = 20 80 > 25 = P(J | J L), and P(N | N R) = 20 > 75 = P(J | J R). However, 5 + 25 20 + 7 < = P(J ). P(N ) = 80 + 20 25 + 75 In general, if a, b, c, d, A, B, C, and D are numbers for which A+C a+c > b+d . is not necessarily the case that B+D
1.2.1
A B
>
a b
and
C D
> dc , then it
Independence
The concept of conditional probability leads naturally to what it means for the occurrence or nonoccurrence of an event to have no impact on the probability of a second event. This notion of two events being in some sense ‘independent’ from one another is one of the most important concepts in probability and statistics. Definition 1.13 Assuming that P(A) > 0 and P(B) > 0, the events A and B are said to be independent if P(A | B) = P(A), and P(B | A) = P(B). Remark 1.14 This definition depends on the existence of the conditional probabilities involved. It turns out that there is a more useful definition that eliminates this dependence. We will say that A and B are independent if P(A ∩ B) = P(A)P(B). If P(A) > 0 and P(B) > 0, it is easy to prove that this alternate definition is equivalent to the definition above. Since A ∩ B ⊆ A and A ∩ B ⊆ B, it is easy to see that if either P(A) = 0 or
10
1 Probability
P(B) = 0, then by the Monotone Property, P(A ∩ B) = P(A)P(B). We will adopt this more versatile definition of independence. Example 1.15 (i) Suppose an experiment has two possible outcomes, a and b, and so S = {a, b}. Assume that P(a) = p, and therefore P(b) = 1 − p. Now suppose we perform this experiment n times under identical conditions. The outcomes from one repetition to the next are independent. Clearly, · · · a b) = p n−1 (1 − p). P(aa · · · a) = p n and P(aa n times
n−1 times
(ii) The following two-way table contains data on place of residence and political leaning.
Moderate
Conservative
Total
200 75 275
100 225 325
300 300 600
Urban Rural Total
Is a person’s place of residence and political leaning independent? Let U be the event that a person lives in a city, and let R, M, and C be defined is a similar manner. Then 1 300 1 275 11 1 P(U ∩ M) = 200 600 = 3 , P(U ) = 600 = 2 , and P(M) = 600 = 24 , but since 3 = P(U ∩ M) = P(U )P(M) = 11 48 , being a city dweller and being moderate are not independent.
1.2.2
Bayes’Theorem
Bayes’ Theorem is one of the most important applications of conditional probability and is the starting point of an important area of statistics called Bayesian statistics. (We perform a simple Bayesian analysis in Chap. 2.) Before stating and applying Bayes Theorem, we need a slight generalization of the Law of Total Probability. Definition 1.16 A collection of events {B1 , B2 , ..., Bn } in a sample space S is called a partition of S if (i) S = B1 ∪ B2 ∪ · · · ∪ Bn and (ii) Bi ∩ B j = ∅ , i = j (pairwise disjoint). We can extend the Total Law of Probability to partitions. For any event A, P(A) =
n i=1
P(A ∩ Bi ) =
n i=1
P(A | Bi )P(Bi ).
1.2
Conditional Probability
11
This is clearly a generalization of the Law of Total Probability introduced earlier. Just take the partition of S to be B and B c . Example 1.17 Suppose we draw a card from a well-shuffled deck, and then we draw a second card not observing the card we drew first. Suppose A is the event that this second card is an ace, and suppose that we wish to compute P(A). At first glance, this probability seems to depend on whether the first card was an ace or not. Let B be the event that the first card was an ace. Clearly, S = B ∪ B c . P(A) = P(A | B)P(B) + P(A | B c )P(B c ) 3 4 4 48 4 = · + · = . 51 52 51 52 52 Amazingly, the probability that the second card is an ace is the same as the probability that the first card is an ace. Thinking about it, this result makes sense. If we do not know what the first card was, the second card has the same probability of being an ace as the first card. In fact if we draw 51 cards from the deck not observing any of them as we draw, the 4 . probability that the 52nd (last) card is an ace is still 52 Suppose we have a partition {B1 , B2 , ..., Bn } of the sample space S, and some event A occurs. As mentioned previously, in this case, the sample space has effectively contracted to the event A. How does this effect the probabilities of the sets Bi ? Bayes’ Theorem answers this important question. Theorem 1.18 (Bayes’ Theorem) Let S be a sample space, and let {B1 , B2 , ..., Bn } be a partition of S. If A is any event, then for each i = 1, 2, ..., n, P(Bi | A) =
P(A | Bi )P(Bi ) . n P(A | Bi )P(Bi ) i=1
Proof P(Bi | A) =
P(Bi ∩ A) P(A ∩ Bi ) P(A | Bi )P(Bi ) = = . P(A) P(A) P(A)
The numerator follows from the Multiplication Rule, and the denominator follows from the Law of Total Probability. The probabilities P(Bi ) are called the prior probabilities, and the probabilities P(Bi | A) are called the posterior probabilities.
12
1 Probability
Example 1.19 A box has 10 coins in it. Nine of the 10 are fair coins (probability of a head is 21 ) while the remaining coin has a head on both sides. A coin is chosen at random and tossed five times. If all five tosses result in a head, what is the probability that the sixth toss is also a head? To answer this question, we first set up some events. Let A be the event that the sixth toss is a head, B the event that the first five tosses resulted in a head, and C the event that the chosen coin is fair. Clearly, P(A | C) = 21 , and P(A | C c ) = 1. We can compute P(A | B) as P(A | B) = P(A | B ∩ C)P(C | B) + P(A | B ∩ C c )P(C c | B) P(B | C)P(C) 1 + (1)(1 − P(C | B)) = · 2 P(B | C)P(C) + P(B | C c )P(C c ) P(∗ | B) is a probability function Bayes’ Theorem for P(C | B)
=
73 ∼ 1 9 9 = · + 1− = 0.8902. 2 41 41 82
Example 1.20 Medical tests are not foolproof. Suppose a certain test for a virus has a sensitivity of 95% and a specificity of 92%. Let T P be the event that a randomly chose person tests positive for the disease and D the event that the person has the disease. Sensitivity and specificity of a test are defined as sensitivity = P(T P | D) and specificity = P(T P c | D c ). Further suppose that the occurrence of the virus is 5% in a certain population. What is the chance that a randomly chosen person from the population that tests positive actually has the virus? We use Bayes’ Theorem to compute P(T P | D)P(D) P(T P | D)P(D) + P(T P | D c )P(D c ) P(T P | D)P(D) = P(T P | D)P(D) + (1 − P(T P c | D c ))P(D c ) (0.95)(0.05) ∼ = = 0.3846. (0.95)(0.05) + (1 − 0.92)(0.95)
P(D | T P) =
There is only an approximately 38.5% change of having the virus if a person tests positive for the virus. Example 1.21 Now suppose there is only a 1% chance of contracting a rare disease. Let D be the event that a randomly chosen person in a certain population has the disease and T P be the event that the person tests positive for the disease. Assume that the sensitivity is P(T P | D) = 0.98 and the specificity is P(T P c | D c ) = 0.95. We compute
1.3
Appendix: Counting Techniques
13
P(T P | D)P(D) P(T P | D)P(D) + P(T P | D c )P(D c ) (0.98)(0.01) = = 0.16526. (0.98)(0.01) + (1 − 0.95)(0.99)
P(D | T P) =
Now suppose that a second, independent repetition of the test is conducted which also results in a positive test. Now what is the probability that the person has the disease? To answer this question, let T Pi be the event that the person tests positive on test i, i = 1, 2. By Bayes’ Theorem, P(T P1 ∩ T P2 | D)P(D) P(T P1 ∩ T P2 | D)P(D) + P(T P1 ∩ T P2 | D c )P(D c ) P(T P1 | D)P(T P2 | D)P(D) P(T P1 | D)P(T P2 | D) P(D) + (1 − P(T P1c | D c ))(1 − P(T P2c | D c )) P(D c )
P(D | T P1 ∩ T P2 ) =
conditional independence
=
(0.98)2 (0.01) (0.98)2 (0.01) + (1 − 0.95)2 (0.99)
conditional independence
= 0.79510.
The probability of having the disease after two positive tests has increased dramatically! As a note, conditional independence is used in the derivation above. Specifically, two events are called conditionally independent if P(A ∩ B | C) = P(A | C)P(B | C).
1.3
Appendix: Counting Techniques
Counting methods are important in discrete probability theory. Problems in discrete probability involve computing the probability of an event if an experiment has only a finite number of outcomes. We will discuss three types of counting methods: multiplication principal, permutations, and combinations.
1.3.1
Multiplication Principle
The Multiplication Principle is the result of the following theorem. Theorem 1.22 Suppose a task T can be divided into two sequential subtasks, T1 followed by T2 . If there are m ways of accomplishing T1 and n ways of accomplishing T2 , then there are mn ways of accomplishing T . Proof Choose one of the m ways of doing T1 , and then perform T1 . After T1 is accomplished, there are n ways of doing T2 . In the diagram below, choice i represents the ith way to accomplish T1 . There are m of these choices. After making a choice, there are n ways to
14
1 Probability
do task T2 . n + ··· + n + choice 1
choice 2
n choice (m−1)
+ n = mn choice m
Example 1.23 Let S = {a, b, c, d, e, f , g}. (i) How many 7-letter strings are there? We first choose the first letter of the string. There are seven choices (the subtask T1 ). There are again seven choices for the second letter in the string (the subtask T2 ), etc. The total number of possible 7-letter strings from S is 77 = 823, 543. (ii) How many 7-letter strings are there that begin with an ‘a’ or ‘b’ and end with a ‘c’ or ‘d’? This time, we are restricted to only two choices for the first and last subtask. Otherwise we can choose any of the seven letters. The number is 2 · 75 · 2 = 67, 228. (iii) How many 4-letter strings begin with an ‘a’ or end with an ‘a’? The strings have one of the following three forms: (1) a - - a, (2) a - - x, and (3) x - - a. (x is a letter other than a.) There are 72 strings of type (1), 6 · 72 strings of type (2) since choosing the last letter can only be done in six ways, and 6 · 72 strings of type (3). The total number of strings is 13 · 72 = 637. (iv) How many 5-letter strings are there with no letters repeated? Choosing the first letter can be done in seven ways, but choosing the second letter can only be done in six ways since no letter can be repeated, and so forth. The total number of strings is 7 · 6 · 5 · 4 · 3 = 2, 520.
1.3.2
Permutations
We begin with the definition of a permutation. Definition 1.24 A permutation of k objects from n objects is an ordered arrangement of k of the n objects. Example 1.25 For the letters in the set {a, b, c, d} and k = 3, the set of permutations of length three from four objects are listed below. abc abd acd bdc acb adb adc bcd bac bad cad cbd bca bda cda cdb cab dab dac dbc cba dba dca dcb
There are a total of 24 permutations.
1.3
Appendix: Counting Techniques
15
Theorem 1.26 The number of permutation of length k from n objects is n Pk
=
n! k!(n−k)!
= n(n − 1)(n − 2) · · · · · (n − k + 1).
Proof There are n ways to choose the first object, but only n − 1 ways to choose the second object (since no object is ever repeated), and so forth. Example 1.27 A certain college student has the following inventory of textbooks: two anthropology books, four computer science books, three biology books, five chemistry books, and three mathematics books. What is the probability that having placed the textbooks on her shelf in random order, all the textbooks of the same subject are grouped together? The number of all possible arrangements of the textbooks on the shelf is 17! First choose a permutation of the subjects. There are 5! ways of doing this. Permuting the books within each subject results in 5!(2!4!3!5!3!) arrangements in which the textbooks on the same subject are grouped together. Let A be the event that all textbooks on the same subject are grouped together. Then,
P(A) =
1.3.3
1 5!(2!4!3!5!3!) ∼ = = 6.9958 × 10−8 . 17! 14, 294, 280
Combinations
Permutations are ordered arrangements. What if we relax the condition that the objects be ordered? Definition 1.28 A combination of size k from n objects is a set of k unordered objects drawn from n objects. Example 1.29 For the letters in the set {a, b, c, d} and k = 3, the set of combinations are listed below. {a, b, c} {a, b, d} {a, c, d} {b, c, d}
Let x denote the number of combinations of size k from n objects. Choose one of these x combinations. Then permute all the objects within the combination chosen. We have that
16
1 Probability
x · k! =n Pk ⇒x= The symbol
n k
nPk
k!
=
n! (n−k)!
k!
=
n! n = . (abbreviated as) k!(n − k)! k
is called a binomial coefficient. We have proved the following theorem.
Theorem 1.30 The number of combinations of k objects from n objects is
n n! n C k = k = k!(n−k)! . Example 1.31 What is the probability of drawing a full house from a 52-card poker deck? (A full house is three cards of one of 13 denomination (2,3,...,J,K,A) and two cards of another denomination.) Since the order of the cards in the hand is not important, there are
52 5 possible 5-card poker hands in the deck. Let A denote the event of being dealt a full house. Counting the number of full houses can be done by completing the subtasks in the table below. Subtask
Action
1
Choose the two denominations in the full house
2
Choose the denomination having three cards
3
Choose three cards in the denomination in subtask 2
4
Choose cards in the remaining denomination
Therefore, the number of full houses is 2
1523 5
2
2
1 =2
4 3
4 2
13 2 4 4 2 1 3 2 , and so
13 2 4 4 P(A) =
Number of ways
13
2
=
6 ∼ = 0.00144. 4, 165
The probability of being dealt other types of poker hands can be computed in similar ways. Example 1.32 Five numbers are drawn at random from the set {1, 2, ..., 20}. What is the probability that the minimum number drawn is at least 6? The number of ways to select
five numbers from 20 numbers is 20 5 . Let A denote the event that the minimum number is at least 6. It is more efficient to work with Ac , the minimum number is at most 5. Let Bi be the event that the minimum number is exactly i, i = 1, 2, ..., 5. If the minimum is i, then the other four numbers can only be chosen from the set {i + 1, i + 2, ..., 20}. Therefore,
the number of ways to choose four numbers the minimum of which is exactly i is 20−i 4 . Therefore,
1.4
Problems
17
P(A) = 1 − P(Ac ) = 1 − P(B1 ∪ B2 ∪ B3 ∪ B4 ∪ B5 ) = 1 − (P(B1 ) + P(B2 ) + P(B3 ) + P(B4 ) + P(B5 )) (disjoint events)
19 18 17 16 15 + 4 + 4 + 4 + 4 1, 001 ∼ = =1− 4 = 0.193692.
20 5, 168 5
1.4
Problems
1.4.1 Suppose P(A) = p, P(B) = 0.3, P(A ∪ B) = 0.6, find p so that P(A ∩ B) = 0. Also, find p so that A and B are independent. 1.4.2 When P(A) = 1/3, P(B) = 1/2, P(A ∪ B) = 3/4, what is (a)P(A ∩ B)? and (b) what is P(Ac ∪ B)? 1.4.3 Show P(AB c ) = P(A) − P(AB) and P(exactly one of A or B occur) = P(A) + P(B) − 2P(A ∩ B). 1.4.4 32% of Americans smoke cigarettes, 11% smoke cigars, and 7% smoke both. (a) What percent smoke neither cigars nor cigarettes? (b) What percent smoke cigars but not cigarettes? 1.4.5 Let A, B, C be events. Write the expression for (a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
only A occurs both A and C occur but not B at least one occurs at least 2 occur all 3 occur none occur at most 2 occur at most 1 occurs exactly 2 occur at most 3 occur
1.4.6 Suppose n(A) is the number of times A occurs if an experiment is performed N times. Set FN (A) = n(A) N . Show that FN satisfies the definition to be a probability function. This leads to the frequency definition of the probability of an event P(A) = lim N →∞ FN (A), i.e., the probability of an event is the long term fraction of time the event occurs.
18
1 Probability
1.4.7 Three events A, B, C cannot occur simultaneously. Further it is known that P(A ∩ B) = P(B ∩ C) = P(A ∩ C) = 1/3. Can you determine P(A)? Hint: A ⊂ (B ∩ C)c . 1.4.8 (a) Give an example to illustrate P(A) + P(B) = 1 does not imply A ∩ B = ∅. (b) Give an example to illustrate P(A ∪ B) = 1 does not imply A ∩ B = ∅. (c) Prove that P(A) + P(B) + P(C) = 1 if and only if P(AB) = P(AC) = P(BC) = 0. 1.4.9 A box contains 2 white balls and an unknown amount (finite) of non-white balls. Suppose 4 balls are chosen at random without replacement, and suppose the probability of the sample containing both white balls is twice the probability of the sample containing no white balls. Find the total number of balls in the box. 1.4.10 Let C and D be two events for which one knows that P(C) = 0.3, P(D) = 0.4, and P(C ∩ D) = 0.2. What is P(C c ∩ D)? 1.4.11 An experiment has only two possible outcomes, only one of which may occur. The first has probability p to occur, and the second probability p 2 . What is p? 1.4.12 We repeatedly toss a coin. A head has probability p, and a tail probability 1 − p to occur, where 0 < p < 1. What is the probability the first head occurs on the 5th toss? What is the probability it takes 5 tosses to get two heads? 1.4.13 Show that if A ⊆ B, then P(A) ≤ P(B). 1.4.14 Analogous to the finite sample space case with equally likely outcomes, we may define P(A) = area ofA/area ofS, where S ⊂ R2 is a fixed two-dimensional set (with equally likely outcomes) and A ⊂ S. Suppose that we have a dart board given by S = {x 2 + y 2 ≤ 9}, and A is the event that a randomly thrown dart lands in the ring with inner radius 1 and outer radius 2. Find P(A). 1.4.15 Show that P(A ∪ B ∪ C) = P(A) + P(B) + P(C) − P(A ∩ B) − P(A ∩ C) − P(C ∩ B) + P(A ∩ B ∩ C). 1.4.16 Show that P(A ∩ B) ≥ P(A) + P(B) − 1 for all events A, B ∈ F . Use this to find a lower bound on the probability that both events occur if the probability of each event is 0.9. 1.4.17 A fair coin is flipped twice. We know that one of the tosses is a head. Find the probability the other toss is a head. (Hint: The answer is not 1/2.)
1.4
Problems
19
1.4.18 Find the probability of two pair in a 5-card poker hand. 1.4.19 Show that DeMorgan’s Laws (A ∪ B)c = Ac ∩ B c and (A ∩ B)c = Ac ∪ B c hold, and then find the probability neither A nor B occur and the probability either A does not occur or B does not but one of the two does occur. Your answer should express these probabilities in terms of P(A), P(B), and P(A ∩ B). 1.4.20 Show that if A and B are independent events, then so are A and B c as well as Ac and B c . 1.4.21 If P(A) =
1 3
and P(B c ) = 41 , is it possible that A ∩ B = ∅? Explain.
1.4.22 Suppose we choose one of two coins C1 or C2 in which the probability of getting a head with C1 is 1/3, and with C2 is 2/3. If we choose a coin at random what is the probability we get a head when we flip it? 1.4.23 Suppose two cards are dealt one at a time from a well-shuffled standard deck of cards. Cards are ranked 2 < 3 < · · · < 10 < J < Q < K < A. (a) Find the probability the second card beats the first card. Hint: Look at k P(C2 > C1 | C1 = k)P(C1 = k). (b) Find the probability the first card beats the second and the probability the two cards match. 1.4.24 A basketball team wins 60% of its games when it leads at the end of the first quarter, and loses 90% of its games when the opposing team leads. If the team leads at the end of the first quarter about 30% of the time, what fraction of the games does it win? 1.4.25 Suppose there is a box with 10 coins, 8 of which are fair coins (probability of heads is 1/2), and 2 of which have heads on both sides. Suppose a coin is picked at random, and it is tossed 5 times. Given that we got 5 straight heads, what are the chances the coin has heads on both sides? 1.4.26 Is independence for three events A, B, C the same as: A, B are independent; B, C are independent; and A, C are independent? Perform two independent tosses of a coin. Let A = heads on toss 1, B = heads on toss 2, and C = the two tosses are equal. (a) Find P(A), P(B), P(C), P(C|A), P(B|A), and P(C|B). What do you conclude? (b) Find P(A ∩ B ∩ C) and P(A ∩ B ∩ C c ). What do you conclude?
20
1 Probability
1.4.27 First show that P(A ∪ B) = P(A) + P(Ac ∩ B) and then calculate (a) P(A ∪ B) if it is given that P(A) = 1/3 and P(B|Ac ) = 1/4. (b) P(B) if it is given that P(A ∪ B) = 2/3 andP(Ac |B c ) = 1/2. 1.4.28 The events A, B, and C satisfy P(A|B ∩ C) = 1/4, P(B|C) = 1/3, and P(C) = 1/2. Calculate P(Ac ∩ B ∩ C). 1.4.29 Two independent events A and B are given, and P(B|A ∪ B) = 2/3, P(A|B) = 1/2. What is P(B)? 1.4.30 You roll a die and a friend tosses a coin. If you roll a 6, you win. If you don’t roll a 6 and your friend tosses a head, you lose. If you don’t roll a 6, and your friend does not toss a head, the game repeats. Find the probability you win. 1.4.31 Suppose a single gene controls whether or not a male is bald: bald (B) is dominant and hairy (b) is recessive. Therefore, a male will be bald unless its genotype is bb. A male and female, each with genotype Bb, mate and produce a single male child. The laws of genetics dictate that the child is equally likely to be any of BB, Bb, bB, or bb (with Bb and bB genetically equivalent). (a) What is the probability the male child will be bald (eventually)? (b) Given that the child becomes bald, what is the probability its genotype is Bb? 1.4.32 You are diagnosed with an uncommon disease. You know that there only is a 4% chance of having the disease. Let D = you have the disease, and T = the test says you have it. It is known that the test is imperfect: P(T |D) = 0.9 and P(T c |D c ) = 0.85. (a) Given that you test positive, what is the probability that you really have the disease? (b) You obtain a second and third opinion, two more (conditionally) independent repetitions of the test. You test positive again on both tests. Assuming conditional independence, what is the probability that you really have the disease? 1.4.33 Two dice are rolled. What is the probability that at least one is a six? If the two faces are different, what is the probability that at least one is a six? 1.4.34 Suppose there are 2 boxes. Box 1 has a $100 bill and a $1 bill. Box 2 has two $100 bills. You do not know which box is which. You choose a box at random and select a bill at random. It turns out to be $100. What is the probability the other bill in the chosen box is also $100?
1.4
Problems
21
1.4.35 15% of a group are heavy smokers, 30% are light smokers, and 55% are nonsmokers. In a 5-year study it was determined that the death rates of heavy and light smokers were 5 and 3 times that of nonsmokers, respectively. What is the probability a randomly selected person was a nonsmoker, given that he died? 1.4.36 A, B, and C are mutually independent, and P(A) = 0.5, P(B) = 0.8, andP(C) = 0.9. Find the probabilities (i) all three occur, (ii) exactly 2 of the 3 occur, and (iii) none occurs. 1.4.37 A box has 8 red and 7 blue balls. A second box has an unknown number of red and 9 blue balls. If we draw a ball from each box at random, we know the probability of getting 2 balls of the same color is 151/300. How many red balls are in the second box? 1.4.38 Show that: (a) P(A|A ∪ B) ≥ P(A|B). Hint: A = (A ∩ B) ∪ (A ∩ B c ) and A ∪ B = B ∪ (A ∩ B c ). (b) If P(A|B) = 1, then P(B c |Ac ) = 1. (c) P(A|B) ≥ P(A) =⇒ P(B|A) ≥ P(B). 1.4.39 Coin 1 is a head with probability 0.4, and Coin 2 is a head with probability 0.7. One of these coins is chosen at random and flipped 10 times. Find (a) P(coin lands heads on exactly 7 of the 10 flips). (b) Given the first of these 10 flips is a head, find the conditional probability that exactly 7 of the 10 flips is a head. 1.4.40 Show the extended version of the Law of Total Conditional Probability P(A|E i ∩ B)P(E i |B), S = ∪i E i , E i ∩ E j = ∅ i = j. P(A|B) = i
1.4.41 There are two universities. The breakdown of males and females majoring in Math at each university is given in the tables. Univ 1 Math major Other Males Females
200 150
800 850
22
1 Probability Univ 2 Math major Other Males Females
30 1000
70 3000
Show that this is an example of Simpson’s paradox. 1.4.42 The table gives the result of a drug trial:
Drug No drug
M Recover
M Die
15 20
40 40
F Recover 90 20
F Die 50 10
O Recover O Die 105 40
90 50
Here M = male, F = female, and O-overall. Show that this is an example of Simpson’s paradox.
2
Random Variables
Random variables are the central objects of study in probability. The name ‘random variables’ is somewhat of a misnomer since they are neither random nor are they variables in the conventional sense. A random variable is simply a real-valued function on the sample space of an experiment whose purpose is to extract information, that is, data, from the outcomes of that experiment. For example, we might want to know the number of heads that appear in 10 flips of a fair coin. The random variable assigns to each of the 256 sequences of heads and tails in the sample space the number of heads in that sequence. Or, having thrown a dart at a dart board of radius 1 modeled as the disk x 2 + y 2 ≤ 1, we might be interested in how far the dart lies from the center of the board. In this case, the random variable assigns to each set of landing coordinates (x, y) in the (uncountable) sample space the distance between (x, y) and the origin (0, 0). Random variable can be classified as either discrete or continuous, and within each category, they fall into broad classes with specific purposes and properties. Complex probability calculations can be greatly simplified if the problem can be formulated as an instance of a random variable in one of these ‘standard’ classes. Random variables play a central role not only in probability but also in statistics. Statisticians analyze data sets trying to make inferences from limited amounts of available data, and the data sets they work with are nothing more than a small sample of values from one or more random variables.
2.1
The Basics
In this section we make precise what is meant by a random variable. We will first describe discrete random variables and then later in the section continuous random variables since these two categories of random variables are handled somewhat differently.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 E. N. Barron and J. G. Del Greco, Probability and Statistics for STEM, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-38985-6_2
23
24
2 Random Variables
Definition 2.1 A random variable (abbreviated rv) is a function X : S → R where S is the sample space of some experiment such that E = {s ∈ S | X (s) ≤ r } is an event in S for each r ∈ R. Example 2.2 Let S be the set of sequences of heads and tails obtained from tossing a fair coin four times. Denote a head by 1 and a tail by 0. ⎫ ⎧ , 0001 , 0010 , 0011 , 0100 , 0101 , 0110 , 0111 ,⎪ 0000 ⎪ ⎪ ⎪ ⎬ ⎨ 1 2 3 2 3 4 3 2 S= ⎪ 1000 , 1001 , 1010 , 1011 , 1100 , 1101 , 1110 , 1111 ⎪ ⎪ ⎭ ⎩ ⎪ 2
3
4
3
2
3
2
1
The number i below each outcome is the number of runs that appear in the sequence. A run is a contiguous block of 0s or 1s. We can define X : S → R as X (s) = i. Clearly, X (S) = {1, 2, 3, 4} where X (S) = {X (s) : s ∈ S}. X (S) is also denoted as R(X ). Example 2.3 Consider the dart example mentioned above. Throwing a dart at a dart board of radius 1 yields the sample space S = {(x, y) | x 2 + y 2 ≤ 1}. Let X (x, y) = x 2 + y 2 , the distance between the landing coordinates and the origin. In this case, X (S) = [0, 1]. The above examples are both examples of rvs, but there does appear to be some differences between them. For example, the rv in Example 2.2 takes on only four values whereas the rv in Example 2.3 takes on an entire continuum of values. This difference is a fundamental one.
2.1.1
Discrete RVs and PMFs
In this section, we restrict ourselves to discrete rvs although some of the concepts we introduce apply to both types. If X is any rv, the notation {X = x} denotes the event {s ∈ S | X (s) = x}. In other words, {X = x} is X −1 (x), the set of all outcomes in S whose image under X is x. For example, in Example 2.2, {X = 2} = {0001, 0011, 0111, 1000, 1100, 1110}. In Example 2.3, X = 21 is the circle of radius 21 . If X is an rv, let p X : R → [0, 1] be defined by p X (x) = P(X = x). Definition 2.4 The rv X is called discrete if {x ∈ R | p X (x) > 0} is nonempty and finite (or at most countably infinite) in which case the function p X (x) is called the probability mass function (abbreviated pmf) of X .
2.1 The Basics
25
The pmf of a discrete rv X weights each value x ∈ R by the probability of the event {X = x} in S. Obviously, some values have a greater ‘mass’ than others, like in particle physics in which certain elementary particles have greater masses than others. The table below displays the pmf for the rv in Example 2.2 in which the outcomes are assumed equiprobable, that is, have the same probability of occurring. x
1 2 3 4 x∈ / {1, 2, 3, 4}
p X (x) 18 38 38 18
0
We will derive the pmf of various ‘standard’ discrete rvs in the next section. Remark 2.5 The masses over all values of a discrete rv must add up to 1 because the events {X = x} are disjoint, and every outcome in the sample space must be in one (and only one) of these events. Therefore, x∈X (S) p X (x) = 1. The sum could be an infinite sum if X (S) is countably infinite. Definition 2.6 Let D be a countable set (finite or countably infinite) on which there is defined a function p(x) such that (i) 0 < p(x) ≤ 1 for x ∈ D, (ii) p(x) = 0, x ∈ / D, and (iii) x∈D p(x) = 1. The function p (x) is called a distribution. Remark 2.7 It is easy to show that given a distribution, there is a sample space S and a discrete rv X on S such that p(x) = p X (x). (How?) The term ‘distribution’ is used somewhat loosely in probability. Depending on the context, it may refer to the pmf or the probability density function (pdf) to be defined later for continuous rvs. It may also refer to one of the standard discrete or continuous rvs like the binomial (discrete) or normal (continuous) rv.
2.1.2
Cumulative Distribution Functions
For any rv, a certain function can be defined on all of R that ‘accumulates’ probability. Definition 2.8 If X is any rv, the function FX : R → [0, 1], called the cumulative distribution function (abbreviated cdf), is defined as FX (x) = P(X ≤ x) where {X ≤ x} = {s ∈ S | X (s) ≤ x}.
26
2 Random Variables
Note that as x becomes larger, more probability is accumulated since if x1 < x2 , P(X ≤ x1 ) ≤ P(X ≤ x2 ) by the Monotone Property. Example 2.9 Consider the rv X of Example 2.2. Its cdf is given below. ⎧ ⎪ 0, x < 1 ⎪ ⎪ ⎪ 1 ⎪ ⎨ 8, 1 ≤ x < 2 FX (x) = 21 , 2 ≤ x < 3 ⎪ 7 ⎪ ⎪ ⎪ 8, 3 ≤ x < 4 ⎪ ⎩ 1, x ≥4 A graph of FX (x) is illustrated below.
2.1.2.1 Properties of Cdfs Cdfs of rvs have a number of important properties. Theorem 2.10 If X is any rv, then FX (x) has the following properties. (i) FX (x) is nondecreasing. (ii) lim FX (x) = 0, and lim FX (x) = 1. x→−∞
x→∞
(iii) FX (x) is right continuous, that is, lim y→x+ FX (y) = FX (x) (y → x+ means that y approaches x from the right of x). Proof We prove (i) only. If x < y, then {X ≤ x} ⊆ {X ≤ y}. By the Monotone Property, FX (x) = P(X ≤ x) ≤ P(X ≤ y) = FX (y).
2.1 The Basics
27
Fig. 2.1 FX (x)
We can compute probabilities using only the cdf. (We do not need the pmf in the case where X is discrete.) All these results are summarized in the following theorem. The theorem also holds for continuous rvs to be discussed later. The notation FX (x−) = lim y→x− FX (y) (meaning that y approaches x from the left of x). Theorem 2.11 Let X be any rv, and let FX (x) be its cdf. (i) (ii) (iii) (iv) (v)
P(X ≤ a) = FX (a) P(X > a) = 1 − FX (a) P(X < a) = FX (a−) P(X ≥ a) = 1 − FX (a−) P(a ≤ X ≤ b) = FX (b) − FX (a−)
(vi) P(a < X ≤ b) = FX (b) − FX (a) (vii) P(a ≤ X < b) = FX (b−) − FX (a−) (viii) P(a < X < b) = FX (b−) − FX (a) (i x) P(X = a) = FX (a) − FX (a−)
Proof We prove (vi). Proofs of the others are similar. Note that {a < X ≤ b} = {X ≤ b} − {X ≤ a}. Therefore, by the Difference Property, P(a < X ≤ b) = P(X ≤ b) − P({X ≤ b} ∩ {X ≤ a}) = P(X ≤ b) − P(X ≤ a) = FX (b) − FX (a).
Example 2.12 Suppose X is a random variable with values 1, 2, 3 with probabilities 1/6, 1/3, 1/2, respectively. In Fig. 2.1 the jumps are at x = 1, 2, 3. The size of the jump is P(X = x), x = 1, 2, 3, and at each jump the left endpoint is not included while the
28
2 Random Variables
right endpoint is included because the cdf is continuous from the right. We calculate P(X < 2) = P(X = 1) = 1/6, but P(X ≤ 2) = P(X = 2) + P(X = 1) = 1/2. Example 2.13 Consider the rv X in Example 2.2. Several representative computations are illustrated below. P(−1 ≤ X < 2.5) = FX (2.5−) − FX (−1−) = 48 − 0 = 21 . P(2 < X < 4) = FX (4−) − FX (2) = 78 − 21 = 38 . P(X = 3) = FX (3) − FX (3−) = 78 − 48 = 38 .
2.1.3
Continuous RVs and PDFs
As noted, continuous rvs are handled differently from discrete ones. We begin with a motivating example. Example 2.14 Consider Example 2.3. We compute the cdf FX (x). Clearly if x < 0, FX (x) = 0, and if x > 1, FX (x) = 1. What if 0 < x < 1? By definition, FX (x) = P(X ≤ x), that is, the probability that the dart lands at a distance at most x from the origin. We can compute this probability as the ratio of the area of the subdisk of radius x to the area of the 2 entire dart board. Therefore, if 0 < x < 1, FX (x) = P(X ≤ x) = ππx = x 2 . The cdf of X is ⎧ ⎨ 0, x < 0 FX (x) = x 2 , 0 ≤ x < 1 . ⎩ 1, x ≥ 1 The graph of FX (x) is given below. Note that FX (x) is a continuous functions, and so FX (a+) = FX (a−) = FX (a).
Using property (v) of cdfs, if 0 ≤ a < b ≤ 1, we get that
2.1 The Basics
29
b
P(a ≤ X ≤ b) = FX (b) − FX (a−) = b2 − a 2 =
2x d x. a
It appears that FX (x) is the antiderivative of f (x) = 2x. This observation will motivate the definition of a continuous rv. Continuous rvs come in two varieties, absolutely continuous and singularly continuous. Singularly continuous rvs will not be discussed in this text since they are of limited value in STEM applications. Going forward, a continuous rv will automatically mean absolutely continuous. Motivated by the previous example, we make the following definition. Definition 2.15 A rv X is called continuous if there exists a function f X : R → R, called the probability density function (abbreviated pdf) of X , such that f X (x) ≥ 0 and FX (x) =
x −∞
f X (t) dt.
Since FX (x) is given by an integral, FX (x) is a continuous function. Pdfs have several important properties summarized in the following theorem. Theorem 2.16 Let X be a continuous rv with pdf f X (x). ∞ (i) −∞ f X (x) d x = 1. (ii) FX (x) = f X (x) at every point x at which f X (x) is continuous. b (iii) For real numbers a and b, a ≤ b, P(a ≤ x ≤ b) = a f X (x) d x. Proof For (i), we know that lim x→∞ FX (x) = 1. Therefore x ∞ f X (t) dt = f X (x) d x. 1 = lim x→∞ −∞
−∞
Property (ii) follows from the Fundamental Theorem of Calculus. (See Remark 2.17 below.) For (iii), since FX (x) is continuous, FX (a−) = FX (a). Therefore, P(a ≤ x ≤ b) = FX (b) − FX (a−) = FX (b) − FX (a) b a = f X (x) d x − f X (x) d x = −∞
−∞
b
f X (x) d x.
a
Remark 2.17 The pdf f X (x) may be written as simply f (x) if the rv is clear from the a context. If X is continuous, note that P(X = a) = 0 since a f X (x) d x = 0. (This property
30
2 Random Variables
distinguishes continuous rvs from discrete rvs, since for a discrete rv X , P(X = a) > 0 for some a.) Also, since FX (x) is continuous, P(a ≤ X ≤ b) = P(a < X ≤ b) = P(a ≤ X < b) = P(a < X < b) =
b
f X (x) d x.
a
Endpoints can be ignored. This is not the case if X is discrete! The value f X (a) has no probability interpretation other than if f X (a) < f X (b) for two distinct values a and b, it is more likely that X takes a value near b than near a. Finally, if X is continuous with pdf f X (x) and cdf FX (x), we know that FX (x) = x −∞ f X (t)dt. Therefore, we can find the pdf if we know the cdf using the Fundamental Theorem of Calculus, FX (x) = f X (x). Another way to see this is from the intuitive notion that P(X ≤ x + d x) − P(X ≤ x) f X (x) d x ∼ . = P(x ≤ X ≤ x + d x), so that f X (x) = lim d x→0 dx Since the last difference quotient is just limd x→0 f X (x) = FX (x).
2.2
FX (x+d x)−FX (x) dx
= FX (x), that is why
Important Discrete Distributions
Discrete rvs tend to be counting variables. If s is the outcome of an experiment, then X (s) is the number of times a specific property occurs in s, and their pmfs give the probability that the property occurs in the outcome X (s) times. As noted in the introduction, rvs tend to fall into broad classes with specific properties and purposes. For each of the rvs discussed in this section, we give a simple example before giving the formal definition of the rv.1
2.2.1
Discrete Uniform RVs
Discrete uniform rvs are the simplest type of discrete rvs. Example 2.18 A single card is drawn from a shuffled poker deck of 52 cards. Label the cards from 1 to 52 in some manner. The rv X assigns a card to its distinct label. The 1 probability of drawing any particular card, for example the ace of spades, is 52 . Definition 2.19 A discrete uniform rv is any rv X with pmf
1 Refer Sect. 2.10 for TI-8x commands for pdfs and cdfs of all rvs
2.2
Important Discrete Distributions
p X (x) =
31
1 , x = 1, 2, ..., n. n
We write X ∼ U ni f (n). Remark 2.20 As discussed above, a discrete uniform rv X can be viewed as a labeling function. Each of the n outcomes in the sample space is assigned a label from 1 to n by X .
2.2.2
Bernoulli RVs
Bernoulli rvs form the basis of the binomial rv, arguably the most important discrete distribution. Example 2.21 Suppose an experiment only has two outcomes, say a (viewed as a failure) and b (viewed as a success). Such an experiment is called a Bernoulli trial. For example, tossing a (not necessarily fair) coin is an example of such an experiment. The sample space S = {a, b}. Let X (a) = 0 and X (b) = 1. So X counts the number of successes in an outcome. Define p X (1) = P(X = 1) = p and p X (0) = P(X = 0) = 1 − p, 0 < p < 1. Definition 2.22 A Bernoulli rv with parameter p is any rv X with pmf p X (1) = P(X = 1) = p and p X (0) = P(X = 0) = 1 − p. We write X ∼ Ber n( p).
2.2.3
Binomial RVs
Performing Bernoulli trials independently a finite number of times is the basis for the binomial rv. Suppose we perform n Bernoulli trials and observe x successes and n − x failures. By independence, the probability of such a sequence is p x (1 − p)n−x . Clearly, the number n! . Therefore, the probability that we observe x successes of such sequences is nx = k!(n−k)! n x is x p (1 − p)n−x . Example 2.23 A certain restaurant serves 8 entrées of fish, 12 of beef, and 10 of poultry. Assuming that customers choose between these entrées randomly, what is the probability that two out of the next four customers choose fish? Let X count the number of customers that choose fish. Each of the four customers can be viewed as a Bernoulli trial with success p = 4 1 − p = 11 15 (choosing fish) and failure 15 (choosing beef or poultry). If four customers enter 4 the restaurant, there are 2 choices for the two customers choosing fish with a probability 2 11 2 2 11 2 ∼ 0.22945. of 4 of making that choice. Therefore, P(X = 2) = 4 4 = 15
15
2
15
15
32
2 Random Variables
Definition 2.24 A binomial rv with parameters n and p is any rv X with pmf p X (x) = P(X = x) =
n x
p x (1 − p)n−x , x = 0, 1, ..., n.
We write X ∼ Binom(n, p). Example 2.25 In roulette, a popular casino game, the dealer, called the croupier, tosses a ball onto a spinning wheel, typically in the opposite direction of the spin. The outside edge of the wheel is marked with the numbers 1 through 36 in alternate red and black colors. There is also a 0 and 00 marked in green. In one version of the game, as the ball is circling the roulette wheel, players bet on whether the ball will come to rest on a red number. A bet on red has a p = 18 38 probability of winning. Suppose a certain gambler bets $5 on red each time for 100 rounds. Let X be the total dollar amount won or lost on these 100 rounds. The rv X takes on values 0, ±5, ±10, ..., ±500. If M is the rv that counts the number of games won, then X = 10M − 500. Clearly, M ∼ Binom(100, 18 38 ). The probability that 100 18 50 20 50 the gambler wins 50 games is P(M = 50) = 50 38 ( 38 ) = 0.0693, so the chance that the gambler breaks even is P(X = 0) = P(M = 50) = 0.0693.
2.2.4
Geometric RVs
Geometric rvs count the number of Bernoulli trials needed to achieve a single success. Example 2.26 Cards are drawn at random and with replacement (card is returned to the deck and the deck is reshuffled) from a 52-card poker deck. What is the probability that at least 10 draws are needed to draw the first ace? Let X denote the number of draws needed to 1 obtain the first ace. The card draws can be viewed as Bernoulli trials with success p = 13 12 (an ace is drawn) and failure 1 − p = 13 (a non-ace is drawn). The probability that only 10 9 1 (nine failures and one success), the probability that 11 draws draws are needed is 12 12 10 113 13 are needed is 13 13 (ten failures and one success), and so forth. Therefore, ∞ ∞ 12 i−1 1 1 12 i+9 P(X ≥ 10) = (reindexing) = 13 13 13 13 i=10 i=0 ∞ 1 12 9 12 i 1 12 9 1 = = (geometric sum) 13 13 13 13 13 1 − 12 13 i=0
∼ = 0.48657. Definition 2.27 A geometric rv with parameter p is any rv X with pmf p X (x) = P(X = x) = (1 − p)x−1 p, x = 1, 2, ....
2.2
Important Discrete Distributions
33
We write X ∼ Geom( p).
2.2.5
Negative Binomial RVs
Negative binomial rvs are generalizations of geometric rvs. Let X denote the number of trials until we get r successes. We must have at least r trials to get r successes, and we get r successes with probability pr and x − r failures with probability (1 − p)x−r . Since we stop counting when we get the r th success, the last trial must be a success. Therefore, in the preceding x − 1 trials, we obtain r − 1 successes, and there are rx−1 −1 ways that can happen. x−1 r x−r That’s why P(X = x) = r −1 p (1 − p) , x ≥ r . Definition 2.28 A negative binomial rv with parameters r and p is any rv X with pmf p X (x) = P(X = x) =
x −1 r p (1 − p)x−r , x = r , r + 1, r + 2, ... r −1
We write X ∼ N eg Bin( p, r ). Remark 2.29 Note that if r = 1, then X ∼ N eg Bin(1, p), and X ∼ Geom( p). Example 2.30 A certain mathematics professor has a craving for a Snickers™ bar which costs $1. She has two quarters and five dimes. The vending machine malfunctions and does not accept the coins every time they are inserted into the machine, most of the time letting them slip through to the change return tray. The probability that the machine malfunctions on a given insertion of a coin is 0.90. What is the probability that it takes at least 20 tries to get the Snickers bar? Let X be the number of trials (coin insertions) needed to pay for the Snickers bar. The coin insertions can be viewed as Bernoulli trials with success p = 0.10 (a coin is accepted) and failure 1 − p = 0.90 (a coin slips through). The probability that 20 7 13 attempts are needed for all the coins to be accepted is 19 6 (0.10) (0.90) , the probability 20 7 14 that 21 attempts are needed is 6 (0.10) (0.90) , and so forth. Therefore, P(X ≥ 20) =
∞ 19 x −1 x −1 (0.10)7 (0.90)x−7 = 1 − (0.10)7 (0.90)x−7 6 6
x=20
x=7
∼ = 0.9983. The professor will be standing at the machine for quite a while it appears! Example 2.31 (Best of seven series.) The baseball World Series and NBA playoffs determines a winner by the two teams playing up to seven games, with the first team to win four
34
2 Random Variables
games the champion. Suppose team A wins with probability p, and loses to team B with probability 1 − p. (a) If p = 0.52, what is the probability A wins the series? For A to win the series, A can win 4 straight, or in 5 games, or in 6 games, or in 7 games. This is negative binomial with r = 4, 5, 6, 7, so if X is the number of games to 4 successes for A, X ∼ NegBin (.52,4), and P(A wins series) = P(X = 4) + P(X = 5) + P(X = 6) + P(X = 7) 4 5 6 3 .524 (.48)1 + .524 (.48)2 + .524 (.48)3 .524 (.48)0 + = 3 3 3 3 = 0.54368. If p = 0.55, the probability A wins the series goes up to 0.60828, and if p = 0.6, the probability A wins is 0.7102. (b) If p = 0.52 and A wins the first game, what is the probability A wins the series? Once game one is over, A has to be the first to 3 wins in at most 6 trials. Let X 1 be the number of games (out of the remaining 6) until A wins 3. Then, P(A wins series) = P(X 1 = 3) + P(X 1 = 4) + P(X 1 = 5) + P(X 1 = 6) 2 3 4 5 3 0 3 1 3 2 .52 (.48) + .52 (.48) + .52 (.48) + .523 (.48)3 = 2 2 2 2 = 0.6929.
2.2.6
Poisson RVs
We will provide a brief summary of the important facts for a Poisson rv. First, here is the definition. Definition 2.32 A Poisson rv with parameter λ is any rv X with pmf p X (x) = P(X = x) =
λx e−λ , x = 0, 1, ... x!
We write X ∼ Pois(λ). Where does this come from? It turns out that if X ∼ Binom(n, p) and n > 20 and either np < 5 or n(1 − p) < 5, then the exact binomial probability p X (x) = nx p x (1 − p)n−x ∼ λx e−λ where λ = np.2 can be approximated as p X (x) = x!
2 This approximation was obtained by the French mathematician Simeon-Denis Poisson (1781–1840).
2.2
Important Discrete Distributions
35
Example 2.33 If approximately 1.2% of books made at a certain factory have defective bindings, what is the probability that five books in a production run of 400 books have defective bindings? Let X count the number of books having defective bindings out of the 400 books. Clearly, X ∼ Binom(400, 0.012). The exact probability is 5 395 ∼ 0.17584. With λ = 0.012 · 400 = 4.8, the Poisson p X (5) = 400 = 5 (0.012) (0.988) 4.85 e−4.8 ∼ = 0.17475. approximation gives p X (5) = 5!
From elementary calculus we know that
∞
x=0
λx e−λ
λx e−λ x!
= e−λ
∞
λx x=0 x!
= e−λ eλ = 1.
Therefore, the function p(x) = x! is a distribution, and so must be the pmf of a random variable on S = {0, 1, 2, ...}. A Poisson rv counts the number of random events occurring at a rate of λ events per unit time. Example 2.34 On a certain summer evening, asteroids which randomly enter the atmosphere are observed at a rate of one every 12 minutes. What is the probability that no asteroids are observed in the next minute? Let X be the number of asteroids observed 1 in one minute. Asteroids are observed at a rate of λ = 12 asteroids/minute. We get
P(X = 0) =
2.2.7
1 12
0
1
e− 12
0!
1
= e− 12 = 0.92.
Hypergeometric RVs
The hypergeometric rv is used mainly for sampling without replacement, with sample size n, from a population of finite size N that contains two distinct classes of objects, one of size k and the other of size N − k. Definition 2.35 A hypergeometric rv with parameters N , k, and n is any rv X with pmf k N −k p X (x) = P(X = x) =
x
n−x
N
, x = 0, 1, ..., n.
n
We write X ∼ H ypGeo(N , k, n). Example 2.36 In a certain lake, there are 10, 000 trout, 500 of which are tagged for future scientific study. If 10 trout are captured at random, what is the probability that at least two will be tagged? Let X be the number of tagged trout in the sample of size n = 10. In this example, the classes of objects are the tagged and untagged trout. The population size is . 10, 000 trout from which 500 are tagged. The total number of samples of size 10 is 10,000 10 5009,500 The number of samples containing no tagged trout is 0 10 , and the number containing 9,500 . We get exactly one tagged trout is 500 1 9
36
2 Random Variables
Fig. 2.2 P(X = k): Hypergeometric if enough trials
5009,500
5009,500 P(X ≥ 2) = 1 − P(X = 0) − P(X = 1) = 1 −
0
10,00010 10
−
110,0009
∼ = 0.086056.
10
Remark 2.37 In our example, N = 10, 000, k = 500, and n = 10. The sampling is done without replacement. Once a fish is caught, it is not returned to the lake. If a fish was 0 returned after being caught, then X ∼ Binom(10, 0.05) and P(X ≥ 2) = 1 − 10 0 (0.05) 1 9 (0.95)10 − 10 1 (0.05) (0.95) = 0.086138, close but not the same. Example 2.38 Suppose we have 10 patients, 7 of whom have a genetic marker for lung cancer and 3 of whom do not. We will choose 6 at random (without replacing them as we make our selection). What are the chances we get exactly 4 patients with the marker and 2 without? If X is the number of patients having the genetic marker, then X ∼ H ypGeo(10, 6, 7). View this as drawing 6 people at random from a group of 10 without replacement, with the probability of success (= genetic marker) changing from draw to draw. The trials are not Bernoulli. We have 73 1 P(X = 4) = 4102 = . 2 6
If we incorrectly assumed this was X ∼ Binom(6, 0.7), we would get P(X = 4) = 6 4 2 4 (0.7) (0.3) = 0.324.
2.2
Important Discrete Distributions
37
As a further example, suppose we have a group of 100 patients with 40 patients possessing the genetic marker (= success). We draw 50 patients at random without replacement and ask for the P(X = k), k = 10, 11, . . . , 50. Here’s the figure of that distribution (see Fig. 2.2). The fact that the figure for the hypergeometric distribution looks like a bell curve is not a coincidence, as we will see, when the population is large.
2.2.8
Multinomial RVs
Consider an experiment that has k outcomes, say A1 , A2 , ..., Ak , each with probability pi , 1 ≤ i ≤ k, and p1 + p2 + · · · + pk = 1. (If k = 2, then we can think of one being a success and the other being a failure as in a Bernoulli trial.) Let X i be the number of occurrences of outcome Ai when n independent trials of the experiment are performed. For example, if we toss a 6− sided die n times, then X i is the number of times that the number i comes up. If the die is fair, then p1 = p2 = · · · = p6 = 1/6. Let N be the number of ways to arrange n objects, x1 of type A1 , x2 of type A2 , ..., and xk of type Ak . Fix one of these N arrangements, and call it τ . Within τ , we can arrange the objects of type Ai in xi ! ways and not change τ . For example, in the fixed arrangement τ = 1, 2, 2, 3, 2, 4, 3, 4, 2, we can arrange the 2’s in 4! ways within the arrangement. Therefore, we see that there are x1 !x2 ! · · · xk ! ways the outcomes Ai can be arranged within τ , and so there are a total of N · (x1 !x2 ! · · · xk !) such arrangements. Clearly now we have that n! = N · (x1 !x2 ! · · · xk !), and so N = n!/(x1 !x2 ! · · · xk !). We use the notation x1 ,x2n,...,xk , called a multinomial coefficient, to denote n!/(x1 !x2 ! · · · xk !). Since the trials are independent, the probability of obtaining τ is p1x1 p2x2 · · · pkxk , and since there are N such arrangements, P(X 1 = x1 , X 2 = x2 , ..., X k = xk ) =
x1 x2 n! x1 !x2 !···xk ! p1 p2
· · · pkxk .
Example 2.39 Bob and Alice play chess. Bob wins 40% of the time, and Alice wins 35% of the time. Bob and Alice draw 25% of the time. Suppose Bob and Alice play a sequence of 20 games. Let B be the number of times Bob wins, A be the number of times Alice wins, and D be the number of times they draw. What is the probability that Bob wins 15 games, Alice wins only 3 games, and they draw 2 games? We have P(B = 15, A = 3, D = 2) =
20! (0.40)15 (0.35)3 (0.25)2 = 0.00044610. 15!3!2!
Definition 2.40 A multinomial rv with parameters n, p1 , p2 , ..., pk is any random vector X = (X 1 , X 2 , ..., X n ) with pmf p X (x1 , x2 , ..., xk ) = P(X 1 = x1 , X 2 = x2 , ..., X k = xk ) =
x1 x2 n! x1 !x2 !···xk ! p1 p2
where x1 + x2 + · · · + xk = n. We write X ∼ Multin(n, p1 , p2 , ..., pk ).
· · · pkxk
38
2 Random Variables
Example 2.41 Suppose 25 registered voters are chosen at random from a population in which we know that 55% are Democrats, 40% are Republican, and 10% are Independents. In our sample of 25 voters, what is the chance that we get 10 Democrats, 10 Republicans, and 5 Independents? Let D, R, and I denote the number of Democrats, Republicans, and Independents in the sample. The vector X = (D, R, I ) ∼ Multin(25, 0.55, 0.40, 0.05). 25 (0.55)10 (0.40)10 (0.05)5 = 0.000814 P(D = 0.55, R = 0.40, I = 0.05) = 10, 10, 5 which is very small since we are asking for the probability of the exact vector (10, 10, 5). It is much more tedious to calculate, but we can also compute quantities like P(D ≤ 15, R ≤ 12, I ≤ 20) = 0.6038. Notice that we do not require that 15 + 12 + 20 be the number of trials. Example 2.42 Consider a continous rv X with pdf 60x 2 (1 − x)3 , 0 ≤ x ≤ 1 f X (x) = . 0, otherwise Suppose 20 independent samples are drawn from X . An outcome is a sample value falling i in the range [0, 15 ] when i = 1 or ( i−1 5 , 5 ] for i = 2, 3, 4, 5. What is the probability that 3 observations fall into the first range, 9 fall into the second range, 4 fall into the third and fourth ranges, and no observations fall into the last range? Let X = (R1 , R2 , R3 , R4 , R5 ). We compute p1 as 0.2 60x 2 (1 − x)3 = 0.098. p1 = 0
All the probabilities are displayed in the table below. Range [0, 0.2] (0.2, 0.4] (0.4, 0.6] (0.6, 0.8] (0.8, 1.0] Probability 0.098 0.356 0.365 0.162 0.019
Then X ∼ Multin(20, 0.098, 0.356, 0.365, 0.162, 0.019), and 20 (0.098)3 (0.356)9 (0.365)4 (0.162)4 P(R1 = 3, R2 = 9, R3 = 4, R4 = 4, R5 = 0) = 3, 9, 4, 4, 0 = 0.00205.
2.2.9
Simulating Discrete RVs Using a Box Model
Suppose we have a box containing slips of paper, called tickets, each one with a 0 or a 1 written on it. Such a box, called a population box, is illustrated below.
2.2
Important Discrete Distributions
39
0 1 1 0 0 0 1 0 1 0 . If a ticket is randomly selected from the box, the probability of drawing a 1 is exactly 25 . Now consider drawing n = 20 tickets from the box. After each draw, the ticket is returned to the box. Therefore, the tickets are drawn with replacement. After the 20 draws, the number of 1’s obtained is recorded. Since each draw is done independently of any other draw, each draw is a Bernoulli trial. If X denotes the number of 1’s obtained when 20 draws are made, then clearly X ∼ Binom 20, 25 . Values of the variable X (commonly called the population random variable) obtained in this way are called random variates, or simply variates. In general, to obtain variates of X ∼ Binom(n, p), the population box is filled with tickets such that the ratio of 1’s to 0’s is p. Drawing n tickets from the box with replacement and counting the number of 1’s yields a binomial variate. Now suppose that m binomial variates from X ∼ Binom(n, p) are obtained by independently repeating the above procedure m times. This sequence of m variates is a random sample of size m from the population rv X . Each variate is an integer in the range 0, ..., n. Let X i represent the draw on the ith repetition. (Clearly, X i is an rv, and X i ∼ Binom(n, p).) The sequence X 1 , X 2 , ..., X m , also called a random sample of size m, is a sequence of independent rvs where each has exactly the same distribution. If we let X i = xi be the notation that represents the ith variate that is obtained, then the sequence x1 , x2 , ..., xm is a single instance of values for X 1 , X 2 , ..., X m . Therefore, X 1 , X 2 , ..., X m can be viewed as a ‘theoretical’ data set drawn from the rv X . It is from random samples that many statistical procedures are constructed. Variates for most discrete rvs can be obtained from population boxes. Let N denote the number of tickets in the box, and let N1 be the number of tickets with 1 written on them. Therefore, the number of tickets with 0 written on them is N0 = N − N1 . Finally, suppose that p = NN1 . • X ∼ Ber n( p). Draw a single ticket from the box. The variate is the number that is drawn. (A draw of 1 represents a success, and a 0 represents a failure.) • X ∼ Binom(n, p). Draw n tickets with replacement from the box. The number of 1’s is the variate. • X ∼ Geom( p). Continuously draw tickets from the box with replacement until a 1 is obtained. The number of draws is the variate. • X ∼ N eg Bin(r , p). Continuously draw tickets from the box with replacement until the r th 1 is obtained. The number of draws is the variate. • X ∼ H ypGeo(N , N1 , n). Draw n tickets from the box without replacement. The number of 1’s after the n draws is the variate. • X ∼ Multin(n, p1 , p2 , ..., pk ). The tickets have numbers written on them from 1 to k. If Ni is the number of tickets with i written on them, then pi = NNi . If n draws are made with replacement, then (n 1 , n 2 , ..., n k ) is a variate where n i is the number of tickets drawn with i written on them.
40
2 Random Variables
The Poisson rv cannot be simulated exactly with a population box. What we can do is devise an approximate procedure, but that depends on the Poisson approximation to the binomial rv which we will not discuss here. In the case of continuous rvs, population boxes cannot be used to obtain variates. Other methods must be used that are beyond the scope of the text. Nevertheless, the concept of a random sample is still valid. More specifically, the sequence X 1 , X 2 , ..., X m of continuous, independent rvs with the same distribution as a continuous rv X is a random sample from X .
2.3
Important Continuous Distributions
There are many continuous rvs of fundamental important in statistics and a wide variety of scientific and engineering disciplines. They include the uniform, exponential, normal, gamma, chi squared, Student t, and the Fisher–Snedecor F. The uniform, exponential, and normal will be discussed in this section. The chi squared, t, and F will be delayed until Sect. 2.6.
2.3.1
Uniform RVs
Suppose a real number is chosen randomly from the interval (a, b). We would expect that given two nonoverlapping subintervals of (a, b) of the same width, the probability that the number is chosen from either one is the same. This observation suggests that the pdf would have to be a constant function on the interval (a, b), but since the area under the pdf must 1 , a < x < b, and 0 otherwise. equal 1, the pdf would have to have the form f (x) = b−a Definition 2.43 A uniform rv on the open interval (a, b) is any rv X with pdf f X (x) =
1 b−a
0,
,a s + t, X > t) P(X > s + t) 1 − FX (s + t) = = P(X > t) P(X > t) 1 − FX (t) e−λ(s+t) 1 − (1 − e−λ(s+t) ) = e−λs = = 1 − (1 − e−λt ) e−λt = 1 − (1 − e−λs ) = 1 − FX (s) = P(X > s).
P(X > s + t | X > t) =
Example 2.51 The lifetime X of a certain cell phone brand is an exponential random 1 . If a statistics professor’s variable with a mean of 10 years. Therefore, X ∼ E x p 10 cell phone has already functioned for 10 years, what is the probability that it will keep functioning for the next 10 years? By the memoryless property, the probability is the same as the probability that a new cell phone will last at least 10 years which is 1 P(X > 10) = 1 − P(X ≤ 10) = 1 − (1 − e 10 ·10 ) = e−1 ∼ = 0.36788.
Remark 2.52 The discrete analog of the exponential rv having the memoryless property is the geometric rv.
2.3.3
Normal RVs
Normal rvs are arguably the most important class of continuous rvs for a variety of reasons, a number of which will be discussed in this text. As noted in the last section, the binomial rv can be approximated by the Poisson rv under certain conditions on n and p. Another French mathematician Pierre-Simon Laplace (1749–1827) discovered a completely different approximation to the binomial rv. He showed that if X ∼ Binom(n, p), then P
X − np ≤x √ np(1 − p)
1 →√ 2π
x −∞
1 2
e− 2 t dt as n → ∞.
x 1 2 It turns out that the function (x) = √1 −∞ e− 2 t dt has all the properties of the cdf of 2π a continuous rv, namely (i) (x) is continuous and increasing, (ii) (x) → 0 as x → −∞, and (iii) (x) → 1 as x → ∞. Proving (iii) is nontrivial and is omitted. The derivative of (x) must then be the pdf of a continuous rv, namely, 1 2 1 (x) = f (x) = √ e− 2 x , − ∞ < x < ∞. 2π
The function f (x) above can be generalized by adding parameters to make it more useful for modeling data.
44
2 Random Variables
Definition 2.53 A normal rv with parameters μ and σ is any rv X with pdf
1 exp − f X (x) = √ 2 2π σ 1
x −μ σ
2 , − ∞ < x < ∞.
We write X ∼ N (μ, σ ). If μ = 0 and σ = 1, then X is called a standard normal rv and is often denoted Z . Remark 2.54 The cdf (x) of Z can be computed on a TI-8x calculator using the command normalcdf(−999, x, 0, 1) (see Sect. 2.10). The number −999 represents −∞. Any large negative number (in absolute value) can be used, and the last two parameters 0 and 1 can be omitted. The pdf of X ∼ N (3, 1.5) is depicted below. Normal rvs are typically used to model data that clusters around μ but becomes less concentrated as the distance from μ grows. The line of symmetry of the pdf is at x = μ, the point at which the graph achieves a maximum. The inflection points occur at the points x = μ ± σ . Both of these facts can be easily checked using basic calculus.
It turns out that every normal rv can be converted to a standard normal rv in a process called standardization which is very important for computational purposes. We will demonstrate this by proving the following more general result. Theorem 2.55 Let X be any continuous rv with pdf f X (x). Let Y = α X + β, α = 0. Then f Y (y) =
1 fX |α|
y−β α
.
2.3
Important Continuous Distributions
45
Proof The cdf of Y is
⎧ ⎨ P X ≤ y−β , α>0 α FY (y) = P(Y ≤ y) = P(α X + β ≤ y) = ⎩ 1 − P X ≤ y−β , α < 0 α ⎧ y−β ⎨ FX α , α > 0 = . ⎩ 1 − FX y−β , α < 0 α
Therefore, f Y (y) =
FY (y)
=
⎧ ⎨
y−β 1 , α f X α ⎩ − 1 f X y−β , α α
α>0
1 fX = |α| α 0. Notice that M X (0) = 1. One of the principal reasons mgfs are so valuable is that they can be used to generate the so-called moments of an rv which are simply E(X n ), n = 1, 2, .... Theorem 2.81 If an rv X has mgf M X (t), then E(X n ) =
dn dt n
M X (t)
t=0
.
Proof We prove the theorem in the continuous case. Assuming that we can interchange the derivative and the integral, we obtain ∞ n ∞ ∞ d tx dn dn tx M X (t) = n e f X (x) d x = e f X (x) d x = x n et x f X (x) d x. n dt n dt −∞ −∞ dt −∞ ∞ dn = −∞ x n f X (x) d x = E(X n ). Therefore, dt n M X (t) t=0
We now give a few examples of computing the mgfs of important rvs, and we will verify some of the formulas for the means and variances of these rvs given in the table in the preceding section. Example 2.82 Suppose X ∼ Binom(n, p).
54
2 Random Variables
M X (t) =
n x=0
n n x n n−x p (1 − p) ( pet )x (1 − p)n−x = ( pet + (1 − p))n . e = x x xt
x=0
The last equality follows from the Binomial Theorem.4 Therefore, M X (t) = n( pet + (1 − p))n−1 ( pet ), and so M X (0) = E(X ) = np. Taking the second derivative, we get n−2 npet − p + 1 =⇒ M X (0) = np (np − p + 1) . M X (t) = npet pet − p + 1 Finally, V ar (X ) = E(X 2 ) − E(X )2 = np (np − p + 1) − n 2 p 2 = np(1 − p). Example 2.83 Let X ∼ Pois(λ). M X (t) =
n
e xt
x=0
∞ t x λe λx e−λ t t = e−λ = e−λ eλe = eλ(e −1) . x! x! x=0
Therefore, M X (t) = λeλ(e −1) et . We now get E(X ) = M X (0) = λ. Continuing, M X (t) = t λet eλ(e −1) λet + 1 , and so M X (0) = λ(λ + 1). Therefore, V ar (X ) = E(X 2 ) − E(X )2 = λ(λ + 1) − λ2 = λ. The expectation and variance of X are identical. t
Example 2.84 Suppose X ∼ E x p(λ). M X (t) =
∞
e λe tx
−λx
dx = λ
0
∞
e 0
−(λ−t)x
∞ e−(λ−t)x λ d x = −λ = λ−t λ−t 0
= λ(λ − t)−1 , t < λ. Notice that M X (t) is only defined for t < λ since the integral only converges in this range. Taking derivatives we get M X (t) = λ(λ − t)−2 , and so M X (0) = E(X ) = λ1 . Continuing, M X (t) = 2λ(λ − t)−3 . Therefore, E(X 2 ) = M X (0) = λ22 . We then obtain V ar (X ) = E(X 2 ) − E(X )2 = λ22 − λ12 = λ12 . Example 2.85 Suppose X ∼ N (0, 1). ∞ ∞ 1 2 1 2 1 1 M X (t) = et x √ e− 2 x d x = √ et x− 2 x d x 2π 2π −∞ −∞ ∞ ∞ 1 2 1 1 2 1 1 1 2 2 = e 2 t √ e− 2 (x−t) d x = e 2 t √ e− 2 (x−t) d x 2π 2π −∞ −∞ 1 2
= e 2 t (make the substitution u = x − t in the last integral).
4 Binomial Theorem says that
n n x n−x a b = (a + b)n . x
x=0
2.5
Moment-Generating Functions
If X ∼ N (μ, σ ), we know Z =
X −u σ ,
55
and therefore X = σ Z + μ.
M X (t) = E(et(σ Z +μ) ) = E(etσ Z +μt) ) = E(etσ Z eμt ) 1
= eμt E(e(tσ )Z ) ) = eμt M Z (tσ ) = eμt e 2 σ
2t2
1
= eμt+ 2 σ
2t2
.
We have already derived the expectation and variance of the normal rv in the previous section. We close this section with an important result that will be used later. The proof of the following theorem is very technical and is omitted. The first part states that if two rvs have the same mgfs, they have the same distribution. The second part states that if the mgfs of a sequence of rvs converge to the mgf of an rv, then the cdfs of the sequence of rvs must also converge to the cdf of the limit rv. Theorem 2.86 (i) If X and Y are rvs such that M X (t) = MY (t) for t ∈ (−δ, δ), then X and Y have the same cdfs. (ii) If X k , k = 1, 2, ... is a sequence rvs with mgfs M X k (t), k = 1, 2, ..., and limk→∞ M X k (t) = M X (t) for some rv X , then limk→∞ FX k (x) = FX (x) at each point of continuity x of FX (x). x
Example 2.87 A certain continuous rv has a pdf f X (x) = √1 √1x e− 2 , x > 0, and 0 oth2π erwise. We first compute M X (t) as ∞ x 1 1 et x √ e− 2 d x. M X (t) = √ x 2π 0 √ Making the substitution u = x, we get ∞ ∞ 2 1 1 2 u t− 2 t x 1 − 2x M X (t) = √ e √ e dx = √ e du. x 2π 0 2π 0 √ √ Making another substitution z = u 2 1 − 2t, t < 21 , we obtain ∞ 2 1 ∞ 1 2 2 2 1 u t− 2 M X (t) = √ e du = √ e− 2 z dz ·√ 1 − 2t 0 2π 0 2π √ 2 2π 1 1 · . =√ =√ ·√ 2 1 − 2t 1 − 2t 2π 3
5
Therefore, M X (t) = (1 − 2t)− 2 , and so E(X ) = M X (0) = 1. Also, M X (t) = 3(1 − 2t)− 2 , and so E(X 2 ) = M X (0) = 3. Thus, V ar (X ) = 3 − 1 = 2. Now consider the rv Y = Z 2 . We compute the mgf of Y as
56
2 Random Variables
MY (t) = E e
t Z2
=
∞
−∞
e
t z2
1 2 2 1 √ e− 2 z dz = √ 2π 2π
∞
e
z 2 t− 21
dz.
0
This is the same integral generated in the computation of the mgf of X after making the √ substitution u = x. By the theorem, the cdf of Y = Z 2 is the same as X . We can check this! √ √ FY (y) = P(Y ≤ y) = P(X 2 ≤ y) = P(− y ≤ X ≤ y) √y √y 1 2 1 2 − 21 x 2 =√ e dx = √ e− 2 x d x. √ 2π − y 2π 0 Differentiating, we get that FY (y)
= f Y (y) =
1 1 √2 e− 2 y √ 2 y 2π
=
1 √1 √1 e− 2 y , 2π y
0,
y>0 elsewhere
.
The two pdfs (and therefore their cdfs) are exactly the same as guaranteed by the theorem. The rv X (or equivalently Z 2 ) is a member of an important family of continuous rvs called chi squared distributions that will be described in a later section. The table below lists the mgfs of all the rvs studied so far. (There is no closed expression for the mgf of the hypergeometric rv.) X
M X (t)
X
M X (t)
U ni f (n)
et −e(n+1)t n(1−et )
H ypGeo(N , k, n)
Ber n( p)
(1 − p) + pet
Multin(n, p1 , ..., pk )
No closed form exists n k ti p e i bt i=1at e −e , t = 0 t(b−a) 1 t =0
Binom(n, p) ( pet + (1 − p))n Geom( p) N eg Bin(r , p) Pois(λ)
2.6
pet , 1−(1− p)et
t < − ln(1 − p) r pet , t 1−(1− p)e
t < − ln(1 − p) t eλ(e −1)
U ni f (a, b) E x p(λ)
λ λ−t , t < λ
N (μ, σ )
eμt+
σ 2t2 2
Joint Distributions
In applications of probability and statistics, we are often confronted with problems that involve more than one rv which may interact with one another or not affect each other at all.
2.6
Joint Distributions
57
We will restrict our discussion to two rvs in this section. Suppose X and Y are two rvs on a sample space S of an experiment. To study X and Y together as a unit, we define a joint pmf or joint pdf for X and Y .
2.6.1
Two Discrete RVs
Definition 2.88 If X and Y are two discrete rvs on the same sample space S, the joint probability mass function (jpmf) p X ,Y : R2 −→ R of X and Y is defined by p X ,Y (x, y) = P({X = x} ∩ {Y = y}) (abbreviated P(X = x, Y = y)). Remark 2.89 Notice that p X ,Y (x, y) = 1.
p X ,Y (x, y) = 0
for
x∈ / X (S), y ∈ / Y (S).
Also
all x all y
Example 2.90 Three balls are randomly selected from an urn containing three red, four white, and five blue balls. Let R and W denote the number of red and white balls chosen respectively The jpmf of R and W is given by p R,W (i, j) = P(R = i, W = j) where i = (3)(4) 18 0, 1, 2, 3 and j = 0, 1, 2, 3. For example, p R,W (1, 2) = 1 12 2 = 220 and p R,W (2, 1) = (3) 3 4 (2)(1) 12 = 220 . All the probabilities p R,W (i, j) are summarized in the following table. (123) j i
0
10 220 30 1 220 15 2 220 1 3 220 56 pW ( j) 220
0
1 40 220 60 220 12 220
0
2
3
30 4 220 220 18 220 0
0
0
0
0
112 48 4 220 220 220
p R (i) 84 220 108 220 27 220 1 220
1
The row and column sums are given in the margins of the table, and are often referred 3 p R (i) = 1 to as the marginal pmfs of R and W respectively. They are pmfs since i=0 and 3j=0 pW ( j) = 1. Finally, note that we can construct the corresponding tables for p R,B (i, k) and pW ,B ( j, k). As can be seen in the above example, the pmfs of the two discrete rvs can be easily recovered from the jpmf by ‘summing out’ the unwanted variable.
58
2 Random Variables
Theorem 2.91 If X and Y are discrete rvs, then the marginals must satisfy p X (x) = p X ,Y (x, y), and pY (y) = p X ,Y (x, y). all y
all x
Proof We prove the theorem for p X (x). p X (x) = P(X = x) = P(X = x, Y = y) = p X ,Y (x, y). all y
all y
Example 2.92 Roll a fair die until a six is obtained. Let Y be the number of rolls needed (including the six at the end). Let X be the number of ones obtained before the first six. 1 x 4 y−x−1 . Therefore, Note that P(X = x | Y = y) = y−1 5 5 x p X ,Y (x, y) = P(X = x, Y = y) = P(X = x | Y = y)P(Y = y) ⎧ ⎨ y−1 1 x 4 y−x−1 5 y−1 1 , x < y 5 5 6 6 x . = ⎩0, x≥y Clearly, Y ∼ Geom (Why?)
2.6.2
1 6
. What about X ? It turns out that p X (x) =
1 x+1 2
, x = 0, 1, ....
Two Continuous RVs
Recall that if we have two discrete rvs X and Y , we can define the jpmf p X ,Y (x, y) of X and Y . We would like to do the same for two continuous rvs. Since continuous rvs are defined in terms of their cdfs, we need to discuss joint cdfs first. Definition 2.93 Suppose that X and Y are two rvs, both discrete or both continuous. The joint cumulative distribution function (jcdf) of X and Y is the function FX ,Y : R2 → R defined for each (x, y) ∈ R2 by FX ,Y (x, y) = P(X ≤ x, Y ≤ y). If X and Y are both discrete, then FX ,Y (x, y) can be easily computed as p X ,Y (u, v). In Example 2.90 of the previous section, FX ,Y (1, 2) = FX ,Y (x, y) = u≤x v≤y 188 p X ,Y (i, j) = 220 . The cdfs FX (x) and FY (y) can be recovered from from the jcdf
i≤1 j≤2
FX ,Y (x, y) as FX (x) = lim FX ,Y (x, y), and FY (x) = lim FX ,Y (x, y). y→∞
x→∞
The cdfs FX (x) and FY (y) are called the marginal cdfs of X and Y .
2.6
Joint Distributions
59
We now define what it means for two continuous random variables to be jointly continuous. Definition 2.94 Two continuous rvs X and Y are called jointly continuous if FX ,Y (x, y) is continuous. Remark 2.95 It turns out that if X and Y are jointly continuous rvs, then both X and Y are continuous since it can be shown (proof omitted) that the marginal cdfs FX (x) and FY (y) are both continuous. As with a single continuous rv, jointly continuous rvs again come in two varieties, absolutely continuous and singularly continuous. Definition 2.96 Two jointly continuous rvs X and Y are called jointly absolutely continuous if there exists a nonnegative function f X ,Y : R2 → R called the joint probability density function (jpdf) of X and Y such that FX (x, y) =
x
y
−∞ −∞
f X ,Y (s, t) dt ds.
Going forward, jointly continuous rvs will automatically mean jointly absolutely continuous since jointly singularly continuous rvs are of no interest in this text. As in the single rv case, the jpdf is given by differentiation. Assuming the partial derivatives of F(x, y) exist, ∂2 f X ,Y (x, y) = ∂ x∂ y F(x, y). In the discrete case, the pmfs of X and Y can be recovered by ‘summing out’ the unwanted variable. In the continuous case, the pdfs of X and Y can be recovered by ‘integrating out’ the unwanted variable. In particular, the marginal pdfs are given by f X (x) =
∞ −∞
f X ,Y (x, y) dy and f Y (x) =
∞ −∞
f X ,Y (x, y )d x.
The following example sets up a couple of computations involving the jpdf of two rvs. The calculations of the integrals involved are routine and omitted. Example 2.97 The jpdf of X and Y is given by −x −2y 2e e , x, y > 0 . f X ,Y (x, y) = 0, otherwise Compute (a) P(X > 1, Y < 1), (b) P(X < Y ), (c) the pdf of the random variable Z = X + Y , and (d) the marginal pdfs of X and Y and their expectations.
60
2 Random Variables 6
5
4
y=x
3
2
1
1
2
3
4
5
6
Fig. 2.4 X < Y
(a) Setting up the double integration (both ways: d xd y and then d yd x), we get P(X > 1, Y < 1) =
1 ∞
0
2e−x e−2y d xd y =
1
∞ 1 1
0
2e−x e−2y d yd x = e−1 − e−3 ∼ = 0.318 09.
(b) We now compute P(X < Y ). The region determined by the inequality X < Y is depicted in Fig. 2.4. ∞ y ∞ ∞ 1 −x −2y P(X < Y ) = 2e e d xd y = 2e−x e−2y d yd x = . 3 0 0 0 x (c) To compute the pdf of Z = X + Y , we first find the cdf of Z and then differentiate. Consider the region described by X + Y ≤ Z (see Fig. 2.5). z z−y 2e−x e−2y d x d y FZ (z) = P(Z ≤ z) = P(X + Y ≤ z) = 0 0 z z−x 2 = 2e−x e−2y d y d x = e−z − 1 . 0
0
Therefore,
FZ (z) =
Now differentiate to get the pdf of Z .
0, z 0, the conditional pmf of X given Y = y, denoted X |Y , is defined as p X |Y (x|y) = P(X = x|Y = y) = if X and Y are discrete, and
p X ,Y (x, y) pY (y)
66
2 Random Variables
f X |Y (x|y) =
f X ,Y (x, y) f Y (y)
if X and Y are continuous. With these densities we may define conditional expectations. If h : R → R, then E(h(X )|Y = y) = ∞x
h(x) p X |Y (x|y),
−∞ h(x) f X |Y (x|y)
X , Y discrete d x, X , Y continuous
.
It is important to notice that E(h(X )|Y = y) is a function of y, say g(y) = E(h(X )|Y = y), and g(Y ) = E(h(X )|Y ) is a random variable. We may ask what is the mean of g(Y ), and the answer is E(g(Y )) = E(E(h(X )|Y )) = ∞y
E(h(X )|Y = y)P(Y = y), Y discrete
−∞
E(h(X )|Y = y) f Y (y) dy, Y continuous
.
But we have a formula for E(h(X )|Y = y), and if we use it in the case of continuous rvs, we see that ∞ E(g(Y )) = E(E(h(X )|Y )) = E(h(X )|Y = y) f Y (y) dy −∞ ∞ ∞ ∞ f X ,Y (x, y) = h(x) h(x) f X (x) d x = E(h(X )), f Y (y) d x d y = f Y (y) −∞ −∞ −∞ ∞ because f X (x) = −∞ f (x, y) dy is the marginal density of X . The Law of Total Probability can be written down for conditional distributions. Theorem 2.114 Given an rv X and any event A, ⎧ ⎨ P(A|X = x)P(X = x), X discrete . P(A) = x ⎩ ∞ P(A|X = x) f (x) d x, X continuous X −∞ In particular, if Y is a rv and A = {Y ≤ y}, ⎧ ⎨ P(Y ≤ y|X = x)P(X = x), X discrete P(Y ≤ y) = FY (y) = x ⎩ ∞ P(Y ≤ y|X = x) f (x) d x, X continuous X −∞ where P(Y ≤ y|X = x) = FY |X (y|x) = are continuous with jpdf f Y ,X .
y
−∞
f Y |X (w|x) dw =
y
−∞ f Y ,X (w,x) f X (x)
dw
if X , Y
2.7
Independent RVs
67
Now let’s do a few examples. Example 2.115 We know that a 5-card poker hand has 2 Aces. What is the probability the hand also has 2 Kings and how many Kings will a hand hold on average which has 2 Aces? Let A be the number of aces in a 5-card hand and K be the number of Kings. The first part of the question is asking for P(K = 2|A = 2). We get 4444 P(A = 2, K = 2) 2 1 = 0.0153. = 24 p K |A (2|2) = P(K = 2|A = 2) = 48 P(A = 2) 2
3
The second part is asking for E(K |A = 2). For this, E(K |A = 2) =
3
k P(K = k|A = 2) =
k=1
3
44 44 k
k
2
2
k=1
3−k
448
= 0.25.
3
On average, every 4 hands containing 2 Aces will also have 1 King. Example 2.116 A random number Y in (0,1) is chosen. Then a random number X in (0, Y ) is chosen. We want the pdf of X and E(X ). To find E(X ) it seems we should know the pdf first. But we can find it directly because if we know Y = y, then the mean of X is y/2 since X is Unif (0, y). Here are the details. E(X ) = E(E(X |Y )) = E(Y /2) = 1/4. Now let’s do this by finding f X (x).
1 f (x, y) dy = f X |Y (x|y) f Y (y) dy 0 0 1 1 1, 0 < y < 1 y, 0 < x < y · dy = 0, otherwise 0, otherwise 0 1 1 = dy = − ln x, 0 < x < 1. x y
f X (x) =
1
Also, f X (x) = 0 if x ∈ / (0, 1). We have found the pdf of X . Let’s calculate E(X ) using it.
1
E(X ) =
x(− ln x) d x =
0
using integration by parts. The answers match.
1 , 4
68
2 Random Variables
Example 2.117 Independence also allows us to find an explicit expression for the jcdf of the sum of two random variables as an application of the Law of Total Probability. Here’s the statement. If X and Y are independent continuous rvs, then ∞ P(X ≤ w − y) f Y (y) dy. FX +Y (w) = P(X + Y ≤ w) = −∞
To see this, set W = X + Y . Then,
P(W ≤ w) = P(X + Y ≤ w) = =
=
P(X + Y ≤ w|Y = y) f Y (y) dy P(X ≤ w − y|Y = y) f Y (y) dy FX (w − y) f Y (y) dy.
The first equality uses the Law of Total Probability, and the last equality uses independence. Suppose specifically that X and Y are independent Exp(λ) rvs. Then, for w ≥ 0, w ∞ FX (w − y) f Y (y) dy = (1 − e−λ(w−y) )λe−λy dy P(X + Y ≤ w) = 0
0
= 1 − (λw + 1)e−λw = FX +Y (w). If w < 0, FX +Y (w) = 0. To find the density, we take the derivative with respect to w to get f X +Y (w) = λ2 w e−λ w , w ≥ 0. It turns out that this is the pdf of a so-called Gamma(λ, 2) rv.
2.7.2
An Application of Conditional Distributions: Bayesian Analysis
In the previous section, conditional distributions were introduced. In this section, an important application of conditional distributions is presented: Bayesian analysis. Recall that if X and Y are random variables, then the conditional pdf of X given Y is p X | Y (x | y) =
p X ,Y (x, y) . pY (y)
We also have the conditional pdf of Y given X as pY |X (y | x) =
p X ,Y (x, y) , p X (x)
2.7
Independent RVs
69
and so p X ,Y (x, y) = pY |X (y | x) p X (x). Substituting pY | X (y|x) p X (x) for p X ,Y (x, y), we obtain pY |X (y | x) p X (x) . p X |Y (x | y) = pY (y) (If X and Y are continuous, replace p by f .) This formula should look familiar. It can be viewed as Bayes’ Theorem for distributions, and it informs us how the distribution of X , called the prior distribution of X , will change given a single observation of Y . This altered distribution p X | Y (x|y) is called the posterior distribution of X . We will illustrate the steps in a Bayesian analysis with several simple examples. Example 2.118 A coin (which may not be fair) is tossed four times. Let X be the random variable representing the possible set of values for the probability that the coin comes up a head. Suppose the possible values of X are X = 0.4, 0.5, 0.6. Suppose that we have no reason to favor one value of X over the remaining two. Therefore, the prior distribution of the random variable X is give by p X (x) =
1 , x = 0.4, 0.5, 0.6. 3
Let Y count the number of heads in four tosses of the coin. Therefore, the values of Y are 0, 1, ..., 4. Notice that Y ∼ Binom(4, X ), and so the random variable X is a parameter of Y . The distribution of Y depends on the distribution of X . To do a Bayesian analysis, we find the posterior distribution of X given that Y = 0, 1, ..., 4. The posterior distribution of X can be computed as pY (y | x) p X (x) . p X (x | y) = pY (y) The numerator is pY (y | x) p X (x) =
1 4 y x (1 − x)4−y . y 3
We summarize the analysis in the following table. The details of the computation are given only for the column pY (0 | x) p X (x). X
p X (x)
pY (0 | x) p X (x)
pY (1 | x) p X (x)
pY (2 | x) p X (x)
pY (3 | x) p X (x)
pY (4 | x) p X (x)
0.4
1 3 1 3 1 3
27 625 1 48 16 1875
72 625 1 12 32 625
72 625 1 8 72 625
32 625 1 12 72 625
16 1875 1 48 27 625
2177 30 000
1873 7500
1777 5000
1873 7500
2177 30 000
0.5 0.6 pY (y)
Where do the numbers come from? For example, computations are similar.
4 4 6 4 1 0 ( 10 )( 10 ) 3 =
27 625
. The remaining
70
2 Random Variables
What if we toss four heads? Therefore, Y = 4. The posterior distribution for X is given in the table below. The increase or decrease from p X (x) = 13 is indicated in the last column.
p X (x) X p X (x) (prior) p X (x | 4) = pY (4p| x) (posterior) (4) Y
0.4 0.5 0.6
1 3 1 3 1 3
256 2177 = 0.118 (decreased) 625 2177 = 0.287 (decreased) 1296 = 0.595 (increased) 2177
1
1
What if we toss no heads? In this case, Y = 0. X
p X (x) p X (x) (prior) p X (x | 0) = pY (0p| x) (posterior) (0) Y
0.4 0.5 0.6
1 3 1 3 1 3
1296 = 0.595 (increased) 2177 625 2177 = 0.287 (decreased) 256 2177 = 0.118 (decreased)
1
1
Now consider the following situation. Suppose that Y = 3. The posterior for p is given in the table below. X
p X (x) p X (x) (prior) p X (x | 3) = pY (3p| x) (posterior) Y (3)
0.4 0.5 0.6
1 3 1 3 1 3
0.205 0.334 0.461
1
1
Let us suppose that we toss the the coin seven more times obtaining only one head. How do we compute the new posterior distribution for X ? One approach is to use the posterior distribution obtained after four tosses of the coin as our prior distribution for X . Our results are listed in the table below. X 0.4 0.5 0.6
p X (x) p X (x | 3) (prior) pY (1 | x) p X (x) p X (x | 4) = pY (1p| x) (posterior) (1) Y
0.205 0.334 0.461
2.6783 × 10−2 1.8249 × 10−2 7.9357 × 10−3
0.50565 0.34453 0.14982
1
5.2967 × 10−2
1
2.7
Independent RVs
71
Can the posterior in the above table be computed directly from the initial prior without generating an intermediate prior? We now address this question. Suppose two observations of a random variable Y are made before any posterior is computed. The posterior distribution of X can be computed directly from the initial posterior without computing an intermediate posterior. Let Y , Y denote the two observations, and let pY ,Y (y, y ) be the jpmf of Y and Y . The posterior for X is computed as p X (x | y, y ) =
p X ,Y ,Y (x, y, y ) . pY ,Y (y, y )
How do we compute the numerator? By definition we have the following. p X ,Y ,Y (x, y, y ) = P(X = x, Y = y, Y = y ). Now observe that P(Y = y , Y = y, X = x) P(Y = y, X = x) · P(Y = y, X = x) P(X = x) P(Y = y , Y = y, X = x) = . P(X = x)
pY (y | y, x) pY (y | x) =
Thus, we have pY (y | y, x) pY (y | x) =
P(Y = y , Y = y, X = x) . P(X = x)
Multiply both sides by P(X = x). We get pY (y | y, x) pY (y | x)P(X = x) = P(Y = y , Y = y, X = x). Therefore we obtain pY (y | y, x) pY (y | x) p X (x) = P(Y = y , Y = y, X = x). Putting everything together we get p X (x | y, y ) =
pY (y | y,x) pY (y | x) p X (x) . pY ,Y (y,y )
The posterior distribution of X can now be computed directly from the initial prior.
2.7.3
Covariance and Correlation
A very important quantity measuring the linear relationship between two rvs is the following. Definition 2.119 Given two rvs X and Y , the covariance of X , Y is defined by
72
2 Random Variables
Cov(X , Y ) = E(X Y ) − E(X )E(Y ) = E((X − E(X ))) E((Y − E(Y ))). The correlation coefficient is defined by ρ(X , Y ) =
Cov(X , Y ) , σ X2 = V ar (X ), σY2 = V ar (Y ). σ X σY
X and Y are said to be uncorrelated if Cov(X , Y ) = 0 or, equivalently, ρ(X , Y ) = 0. Remark 2.120 It looks like covariance measures how independent X and Y are. It is certainly true that if X and Y are independent, then ρ(X , Y ) = 0, but the reverse is false. Consider again Example 2.110. We have Cov(X , Y ) = E(X Y ) − E(X )E(Y ) = 0 −
2 · 0 = 0. 3
However, X and Y = X 2 are not independent. If X and Y are not necessarily independent, we can obtain a formula for V ar (X + Y ) that involves the Cov(X , Y ). V ar (X + Y ) = E((X + Y )2 ) − (E(X + Y ))2 = E(X 2 + 2X Y + Y 2 ) − (E(X ) + E(Y ))2 = E(X 2 ) + 2E(X Y ) + E(Y 2 ) − (E(X ))2 − 2E(X )E(Y ) − (E(Y ))2 & % & % = E(X 2 ) − (E(X ))2 + E(Y 2 ) − (E(Y ))2 + 2 [E(X Y ) − E(X )E(Y )] = V ar (X ) + V ar (Y ) + 2Cov(X , Y ). It’s an important observation that if X and Y are uncorrelated, then the variances add: X , Y uncorrelated =⇒ V ar (X + Y ) = V ar (X ) + V ar (Y ). The above calculation can be extended to any number of rvs. Theorem 2.121 If X 1 , X 2 , ..., X n are rvs, then V ar
n i=1
Xi
=
n i=1
V ar (X i ) + 2
n i< j
Cov(X i , X j ).
2.7
Independent RVs
2.7.4
73
The General Central Limit Theorem
For statistics, one of the major applications of independence is the following fact from Theorem 2.112: if X i ∼ N (μi , σi ), i = 1, 2, . . . , n, and are independent, then ' ⎛ ⎞ ( n n n ( Xi ∼ N ⎝ μi , ) σi2 ⎠ . i=1
i=1
i=1
This result is very useful for calculating things about the sample mean and variance as we will see below. The Central Limit Theorem, which is called that because it is central to probability and statistics, says that even if the X i s are not normal, the sum is approximately normal if the number of rvs is large. We have already seen a special case of this for binomial rvs, but it is true in a much more general setting. The full proof is covered in more advanced courses. Theorem 2.122 (Central Limit Theorem) Let X 1 , X 2 , . . . be a sequence of independent rvs all having the same distributions, E(X 1 ) = μ, and V ar (X 1 ) = σ 2 . Then for any a, b ∈ R, X 1 + · · · + X n − nμ ≤ b = P(a ≤ Z ≤ b) lim P a ≤ √ n→∞ σ n where Z ∼ N (0, 1). In other words, for large n, (generally n ≥ 30), √ Sn = X 1 + · · · + X n ≈ N (nμ, σ n) Sn and, dividing by n, E( Snn ) = nμ n = μ and V ar ( n ) = Sn cally important rv X = n satisfies
1 2 σ n n2
=
σ2 n .
In summary, the criti-
Sn σ X= . ≈ N μ, √ n n This is true no matter what the distributions of the individual X i s are as long as they all have the same finite means and variances.
Example 2.123 An elevator is designed to hold 2000 pounds. The mean weight of a person using the elevator is 175 lbs with a standard deviation of 15 lbs. How many people can board the elevator so that the probability of overloading it is at most 1%? To answer this question, let W = X 1 + X 2 + · · · + X n where X i is the rv representing the weight of person i. All the X i have the same distribution, but we do not know what that is. We want to compute
74
2 Random Variables
P(W ≥ 2000) ≤ 0.01. Therefore,
2000 − 75n W − 75n ≥ P(W ≥ 2000) = P √ √ 15 n 15 n 2000 − 75n ∼ ≤ 0.01. √ =P Z≥ 15 n
Using a calculator (like the TI-8x, for example), we compute x0.99 = 2.326. (Recall that x0.99 is the 99th percentile of Z in this case. If using a TI-8x, the command is invNorm(0.99, 0, 1). The last two parameters 0 and 1 can be omitted.) Therefore, 2000 − 75n ≤ 2.326 =⇒ n ≤ 11. √ 15 n The maximum number of people that can use the elevator at the same time is 11 so that the probability of overloading it is at most 1%. Here is an indication of how the CLT is proved. It can be skipped on first reading. Sketch of Proof of the CLT (Optional): We may assume μ = 0 (why?) and we also √ may assume a = −∞. Set Z n = Sn /(σ n). Then, if M(t) = M X i (t) is the common mgf of the rv’s X i , * +n √ t = exp(n ln M(t/(σ n))). M Z n (t) = M √ σ n If we can show that lim M Z n (t) = et
2 /2
n→∞
then by Theorem 2.86 we can conclude that the cdf of Z n will converge to the cdf of the 2 random variable that has mgf et /2 . But that random variable is Z ∼ N (0, 1). That will complete the proof. Therefore, all we need to do is to show that √ t2 lim n ln M(t/(σ n)) = . n→∞ 2 √ To see this, change variables to x = t/(σ n) so that √ t 2 ln M(x) lim n ln M(t/(σ n)) = lim 2 . n→∞ x→0 σ x2 Since ln M(0) = 0 we may use L’Hopital’s rule to evaluate the limit. We get
2.8
Chebychev’s Inequality and the Weak Law of Large Numbers
75
t2 t 2 ln M(x) M (x)/M(x) = 2 lim 2 2 x→0 σ x σ x→0 2x M (x) t2 lim = (using L’Hospital again) 2σ 2 x→0 x M (x) + M(x) M (0) σ2 t2 t2 t2 = = = , 2σ 2 0 M (0) + M(0) 2σ 2 0 0 + 1 2 lim
since M(0) = 1, M (0) = E(X ) = 0, and M (0) = E(X 2 ) = σ 2 . This completes the proof.
2.8
Chebychev’s Inequality and the Weak Law of Large Numbers
Suppose that X is a rv, either discrete or continuous, with a finite mean μ and variance σ 2 . Intuitively, it would seem reasonable that the probability is small that X takes a value far from μ. Chebychev’s Inequality gives an upper bound on that probability. The result assumes nothing about the distribution of X . Theorem 2.124 (Chebychev’s Inequality) If X is a rv with a finite mean μ and variance σ 2 , then σ2 P(|X − μ| ≥ c) ≤ 2 . c Proof We give the proof in the continuous case. σ 2 = E((X − μ)2 ) = E(|X − μ|2 ) |x − μ|2 f X (x) d x + |x − μ|2 f X (x) d x = x:|x−μ|≥c x:|x−μ|≤c |x − μ|2 f X (x) d x ≥ c2 f X (x) d x ≥ x:|x−μ|≥c 2
x:|x−μ|≥c
= c P(|X − μ| ≥ c).
Remark 2.125 Chebychev’s Inequality is sometimes stated in terms of the number of standard deviations from the mean. In particular, P(|X − μ| ≥ kσ ) ≤
1 . k2
The following table gives these probabilities for selected values of k.
76
2 Random Variables k
√
2
2
3
4
5
6
10
Upper bound 0.5 0.25 0.11 0.0625 0.04 0.0278 0.01
An important application of Chebychev’s Inequality is the Weak Law of Large Numbers. It states that the mean of a random sample converges to the mean of the rvs in the sample as the sample size goes to infinity. Theorem 2.126 (Weak Law of Large Numbers) Let X 1 , X 2 , ..., X n be a random sample of size n all with the same finite meanμ and variance σ 2 . If X = X 1 +X 2n+···+X n , then for any constant c > 0, lim P( X − μ ≥ c) = 0. n→∞
Proof Recall that V ar (X ) =
√σ . n
We then have
σ2 1 σ2 P( X − μ ≥ )≤ 2 = 2 c c c
σ √ n
2 =
σ2 −→ 0 as n → ∞. nc2
Remark 2.127 The type of convergence described in the theorem is called convergence in probability. There is also a Strong Law of Large Numbers which states that the mean of the sample converges pointwise to μ, called almost sure convergence. Almost sure convergence is a stronger form of convergence than convergence in probability, hence the names strong and weak laws. Sequences of rvs can converge to a fixed rv in several different ways, but this topic is beyond the scope of the course.
2.9
Other Distributions Important in Statistics
Many types of distributions are used in statistics, but three of the most important ones, besides the normal, are the χ 2 , t, and F distributions. We briefly summarize what information is required about these distributions for the remainder of the text. All of these distributions are nonsymmetric.
2.9.1
Chi Squared Distribution
Suppose that Z 1 , Z 2 , ..., Z n are all independent, standard normal rvs, and consider the rv X = Z 12 + Z 22 + · · · + Z n2 .
2.9
Other Distributions Important in Statistics
77
The rv X is called a χ 2 (n) (chi squared) distribution with n degrees of freedom. Recall Example 2.87 in which the rv Y = Z 2 was introduced. Clearly, in that example, Y is a χ 2 (1) distribution with one degree of freedom. It was shown there that 1 1 1 f Y (y) = √ √ e− 2 y , y ≥ 0. 2π y
Definition 2.128 A chi squared rv with n degrees of freedom is any rv X with pdf f X (x) =
1 1 x n/2−1 e− 2 x , x ≥ 0. 2n/2 (n/2)
We write X ∼ χ 2 (n). The pdf involves the gamma function, (x), a special function of applied mathematics. One of the properties of the gamma function is that (n) = (n − 1)! if n is a positive integer. √ It turns out that 21 = π, and so the two formulas agree if n = 1. The graphs of the pdfs for n = 1, 3, 5 are displayed below.
We derived the mgf of Y in Example 2.87 as MY (t) = X=
Z 12
+
Z 22
+ ··· +
Z n2 ,
By independence, if
then
M X (t) = M Z 2 (t)M Z 2 (t) · · · M Z n2 (t) = 1
√ 1 . 1−2t
2
n
1
√ 1 − 2t
=
1 (1 − 2t)
n/2
,t