A Level Further Mathematics for AQA: Statistics Student Book (AS/A Level) 1316644324, 9781316644508, 9781316644324, 9781316644584, 9781316644614

New 2017 Cambridge A Level Maths and Further Maths resources to help students with learning and revision. Written for th

465 96 55MB

English Pages 176 [303] Year 2018

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

A Level Further Mathematics for AQA: Statistics Student Book (AS/A Level)
 1316644324, 9781316644508, 9781316644324, 9781316644584, 9781316644614

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Brighter Thinking

A Level Further Mathematics for AQA Statistics Student Book (AS/A Level) Stephen Ward, Paul Fannon, Vesna Kadelburg and Ben Woolley

Contents Introduction How to use this resource 1 Discrete random variables 1: Average and spread of a discrete random variable 2: Expectation and variance of transformations of discrete random variables 3: The discrete uniform distribution 2 Poisson distribution 1: Using the Poisson model 2: Using the Poisson distribution in hypothesis tests 3 Chi-squared tests 1: Contingency tables 2: Yates’ correction 4 Continuous distributions 1: Continuous random variables 2: Expectation and variance of continuous random variables 3: Expectation and variance of functions of a random variable 4: Sums of independent random variables 5: Linear combinations of normal variables 6: Cumulative distribution functions 7: Piecewise-defined probability density functions 8: Rectangular distribution 9: Exponential distribution 10: Combining discrete and continuous random variables Focus on … Proof 1 Focus on … Problem solving 1 Focus on … Modelling 1 5 Further hypothesis testing 1: t-tests 2: Errors in hypothesis testing 6 Confidence intervals 1: Confidence intervals 2: Confidence intervals for the mean when the population variance is unknown Focus on … Proof 2 Focus on … Problem solving 2 Focus on … Modelling 2 Cross-topic review exercise AS Level Practice paper A Level Practice paper Formulae Answers Worked solutions for chapter exercises 1 Discrete random variables 2 Poisson distribution 3 Chi-squared tests

4 Continuous distributions 5 Further hypothesis testing 6 Confidence intervals Worked solutions for cross-topic review exercises Cross-topic review exercises Acknowledgements

Introduction You have probably been told that mathematics is very useful, yet it can often seem like a lot of techniques that just have to be learnt to answer examination questions. You are now getting to the point where you will start to see where some of these techniques can be applied in solving real problems. However, as well as seeing how maths can be useful, we hope that anyone working through this book will realise that it can also be incredibly frustrating, surprising and ultimately beautiful. The book is woven around three key themes from the new curriculum.

Proof Maths is valued because it trains you to think logically and communicate precisely. At a high level, maths is far less concerned about answers and more about the clear communication of ideas. It is not about being neat – although that might help! It is about creating a coherent argument that other people can easily follow but find difficult to refute. Have you ever tried looking at your own work? If you cannot follow it yourself it is unlikely anybody else will be able to understand it. In maths we communicate using a variety of means – feel free to use combinations of diagrams, words and algebra to aid your argument. And once you have attempted a proof, try presenting it to your peers. Look critically (but positively) at some other people’s attempts. It is only through having your own attempts evaluated and trying to find flaws in other proofs that you will develop sophisticated mathematical thinking. This is why we have included lots of common errors in our Work it out boxes – just in case your friends don’t make any mistakes!

Problem solving Maths is valued because it trains you to look at situations in unusual, creative ways, to persevere and to evaluate solutions along the way. We have been heavily influenced by a great mathematician and maths educator George Polya, who believed that students were not just born with problem-solving skills – they developed them by seeing problems being solved and reflecting on their solutions before trying similar problems. You may not realise it but good mathematicians spend most of their time being stuck. You need to spend some time on problems you can’t do, trying out different possibilities. If after a while you have not cracked it, then look at the solution and try a similar problem. Don’t be disheartened if you cannot get it immediately – in fact, the longer you spend puzzling over a problem the more you will learn from the solution. You may never need to integrate a rational function in the future, but we firmly believe that the problem solving skills you will develop by trying it can be applied to many other situations.

Modelling Maths is valued because it helps us solve real-world problems. However, maths describes ideal situations and the real world is messy! Modelling is about deciding on the important features needed to describe the essence of a situation and turning them into a mathematical form, then using that to make predictions, compare to reality and possibly improve the model. In many situations the technical maths is actually the easy part – especially with modern technology. Deciding which features of reality to include or ignore and anticipating the consequences of these decisions is the hard part. Yet it is amazing how some fairly drastic assumptions – such as pretending a car is a single point or that people’s votes are independent – can result in models that are surprisingly accurate. More than anything else, this book is about making links – links between the different chapters, the topics covered and the themes above, links to other subjects and links to the real world. We hope that you will grow to see maths as one great complex but beautiful web of interlinking ideas. Maths is about so much more than examinations, but we hope that if you take on board these ideas (and do plenty of practice!) you will find maths examinations a much more approachable and possibly even enjoyable experience. However, always remember that the results of what you write down in a few hours by yourself in silence under exam conditions are not the only measure you should consider when judging your mathematical ability – it is only one variable in a much more complicated mathematical model!

How to use this resource Throughout this resource you will notice particular features that are designed to aid your learning. This section provides a brief overview of these features. In this chapter you will learn how to: predict the mean, mode, median and variance of a discrete random variable understand how a linear transformation of a variable changes the mean and variance prove and use the formulae for expectation and variance of a special distribution called the uniform distribution recognise when it is appropriate to use a uniform distribution.    If you are following the A Level course, you will also learn how to: calculate the mean of a discrete random variable after a non-linear transformation.

Learning objectives A short summary of the content that you will learn in each chapter.

WORKED EXAMPLE

The left-hand side shows you how to set out your working. The right-hand side explains the more difficult steps and helps you understand why a particular method was chosen.

PROOF

Step-by-step walkthroughs of standard proofs and methods of proof.

WORK IT OUT Can you identify the correct solution and find the mistakes in the two incorrect solutions?

A Level Mathematics Student Book 1, Chapter 21

You should know how to use the rules of probability.

1 Two events A and B are independent. If P(A)=0.4 and P(B)=0.3, find P(A AND B).

A Level Mathematics Student Book 1, Chapter 21

You should know how to find probabilities of discrete random variables.

2 P(X=x)=kx for x=1,2, 3. Find the value of k.

A Level Mathematics Student Book 1, Chapter 20

You should know how to find the mean, variance and standard deviation of data, including familiarity with formulae involving sigma notation.

3 Find the variance of 2, 5 and 8.

Before you start Points you should know from your previous learning and questions to check that you're ready to start the

chapter.

Key point A summary of the most important methods, facts and formulae.

Common error Specific mistakes that are often made. These typically appear next to the point in the Worked example where the error could occur.

Tip Useful guidance, including on ways of calculating or checking and use of technology. Each chapter ends with a Checklist of learning and understanding and a Mixed practice exercise, which includes past paper questions marked with the icon

.

In between chapters, you will find extra sections that bring together topics in a more synoptic way.

FOCUS ON… Unique sections relating to the preceding chapters that develop your skills in proof, problem-solving and modelling.

CROSS-TOPIC REVIEW EXERCISE

Questions covering topics from across the preceding chapters, testing your ability to apply what you have learned. Key terms are picked out in colour within chapters. You can hover over these terms to view their definitions, or find them in the Glossary tab. Towards the end of the resource you will find practice paper questions, short answers to all questions and worked solutions.

Rewind Reminders of where to find useful information from earlier in your study.

Fast forward Links to topics that you may cover in greater detail later in your study.

Focus on… Links to problem-solving, modelling or proof exercises that relate to the topic currently being studied.

Did you know? Interesting or historical information and links with other subjects to improve your awareness about how mathematics contributes to society. Colour coding of exercises

The questions in the exercises are designed to provide careful progression, ranging from basic fluency to practice questions. They are uniquely colour-coded, as shown below. 1

A sequence is defined by un=2×3n−1. Use the principle of mathematical induction to prove that u1+u2+…+un=3n−1.

2 3 4

Show that 12+22+…+n2=n(n+1)(2n+1)6 Show that 13+23+…+n3=n2(n+1)24 Prove by induction that 11×2+12×3+13×4+…+1n(n+1)=nn+1

5

Prove by induction that 11×3+13×5+15×7+…+1(2n−1)×(2n+1)=n2n+1

6

Prove that 1×1!+2×+3×3!…+n×n!=(n+1)!−1

7

Use the principle of mathematical induction to show that 12−22+32−42+…+(−1)n−1n2=(−1)n −1n(n+1)2.

8

Prove that (n+1)+(n+2)+(n+3)+…+(2n)=12n(3n+1)

9 10

Prove using induction that sinθ+sin3θ+…+sin(2n−1)θ=sin2nθsinθ, n ∈ ℤ+ Prove that ∑k=1nk 2k=(n−1)2n+1+2

Black – practice questions which come in several parts, each with subparts i and ii. You only need attempt subpart i at first; subpart ii is essentially the same question, which you can use for further practice if you got part i wrong, for homework, or when you revisit the exercise during revision. Yellow – designed to encourage reflection and discussion Green – practice questions at a basic level Blue – practice questions at an intermediate level Red – practice questions at an advanced level Purple – challenging questions that apply the concept of the current chapter across other areas of maths.

   indicates content that is for A level students only

   indicates a question that requires a calculator    indicates a non-calculator question

1 Discrete random variables In this chapter you will learn how to: predict the mean, mode, median and variance of a discrete random variable understand how a linear transformation of a variable changes the mean and variance prove and use the formulae for expectation and variance of a special distribution called the uniform distribution recognise when it is appropriate to use a uniform distribution. If you are following the A Level course, you will also learn how to: calculate the mean of a discrete random variable after a non-linear transformation.

Before you start… 1 Two events and are independent. If and find .

A Level

You should know how to use the rules

Mathematics Student Book 1, Chapter 21

of probability.

A Level Mathematics Student Book 1, Chapter 21

You should know how to find probabilities of discrete random variables.

2

A Level Mathematics Student Book 1, Chapter 20

You should know how to find the mean, variance and standard deviation of data, including familiarity with formulae involving sigma notation.

3 Find the variance of and .

A Level Further Mathematics Student Book 1, Chapter 11

You should know how to calculate sums of powers of .

4 Find and simplify an

for , . Find the value of .

expression for

What are discrete random variables? A random variable is a variable that can change every time it is observed – such as the outcome when you roll a dice. A discrete random variable can only take certain values. In A Level Mathematics Student Book 1, Chapter 21, you covered the probability distributions of discrete random variables – a table or rule giving a list of all possible outcomes along with their probabilities.

Tip

Discrete variables don’t have to take integer values. However, the possible distinct values can be listed, though the list may be infinite. For example:if is the standard UK shoe size of a random adult member of the public, takes values ,

, ,

up to

and is a discrete random

variable.If is the exact foot length of a random adult member of the public (in cm), takes values in the interval [ , ] and is a continuous random variable. Many real-life situations follow probability distributions – such as the velocity of a molecule in a waterfall or the amount of tax paid by an individual. It is extremely difficult to make a prediction about a single observation, but it turns out that you can predict remarkably accurately the overall behaviour of many millions of observations. In this chapter you will see how you can predict the mean and variance of a discrete random variable.

Section 1: Average and spread of a discrete random variable The most commonly used measure of the average of a random variable is the expectation. It is a value representing the mean result if the variable were to be measured an infinite number of times.

Tip The expectation of a random variable does not need to be a value that the variable can actually take.

Key point 1.1 The expectation of a discrete random variable is written

and calculated as

where is each possible value that can take and is the associated probability.

Tip The subscript in the formula in Key point 1.1 is just a counter referring to each possible value and its associated probability. You do not need to be able to prove this result, but you might find it helpful to see this proof. PROOF 1

The mean of pieces of discrete data is

Start from the definition of the mean.

Since If is large,

will tend towards the

probability of happening, therefore .

is constant you can take it into the sum.

When the sample size tends to infinity, the sample mean becomes the true population mean, .

WORKED EXAMPLE 1.1

The random variable has a probability distribution as shown in the table. Calculate

.

Use the values from the distribution in the formula in Key point 1.1.

As well as knowing the expected average, you may also be interested in how far away from the average you can expect an outcome to be. The variance, , of a random variable is a value representing the degree of variation that would be seen if the variable were to be repeatedly measured an infinite number of times. It is a measure of how spread out the variable is.

Fast forward You will see in Section 2 how to find expectations of other functions of .

Key point 1.2 The variance of a discrete random variable is written where

and calculated as

Did you know? Standard deviation – the square root of variance – is a much more meaningful representation of the spread of a variable. So why is variance used at all? The answer is purely to do with mathematical elegance. It turns out that the algebra of variance is far neater than the algebra of standard deviations. The quantity is the expected value of , read as ‘the mean of the squares’. This variance formula is often read as ‘the mean of the squares minus the square of the mean’. WORKED EXAMPLE 1.2

Calculate

for the probability distribution in Worked example 1.1.

From Worked example 1.1:

Use the values from the distribution in the formula in Key point 1.2.

Tip Many calculators can simplify this process. You normally have to treat the values of the random variable as data and the probabilities as the frequency. Two other less commonly used measures of average are the mode and the median. For data, the mode is the most common result and this extends to variables.

Key point 1.3 The mode of a discrete random variable is the value of associated with the largest probability. For data, the median is the value that has half the data values below it and half above it. You can interpret this in terms of probabilities.

Key point 1.4 The median,

, of a discrete random variable is any value that has

and If there are two possible values, you have to find their mean.

When there are two possible values and you have to take their mean, the median will take a value different from any observed value of the random variable. WORKED EXAMPLE 1.3

For the distribution in Worked example 1.1 find: a the mode b the median. a The largest probability is so there are two modes: and . b

You can create a table of

.

Look for the first value that has a value of greater than or equal to . You could also check that but this is not necessary here.

So the median is .

A probability distribution can also be described by a function. WORKED EXAMPLE 1.4

is a random variable that can take values

and where

a Find the value of . b Find the expected mean of

.

c Find the standard deviation of

. Use the fact that the total of all the probabilities must be

a

b Use Key point 1.1.

c

To find the standard deviation you first need to find the variance, which means you need to find and use Key point 1.2.

So

WORK IT OUT 1.1

Although you only write down three significant figures in the working, make sure you use the full accuracy from your calculator to find the final answer.

Find the variance of , the random variable defined by this distribution.

Which is the correct solution? Identify the errors made in the incorrect solutions. A

B

C

EXERCISE 1A 1

Calculate the expectation, mode, median, variance and standard deviation of each of these discrete random variables. a

i

ii

b

i

ii

c

i

ii

d

i ii

2

, ,

A discrete random variable is given by

for

.

a Show that b Find 3

.

.

A discrete random variable has the probability distribution shown and

.

a Find the values of and . b Find the median of . 4

A discrete random variable has its probability given by , where a Show that

.

b Find the exact value of 5

.

.

The probability distribution of a discrete random variable is defined by ,

.

a Find the value of . b Find

.

c Find the standard deviation of . 6

A fair six-sided dice, with sides numbered

is thrown. Find the mean and variance of the

score. 7

The table shows the probability distribution of a discrete random variable .

a Given that

, find the values of and of .

b Find the standard deviation of . 8

A biased dice with four faces is used in a game. A player pays counters to roll the dice. The table shows the possible scores on the dice, the probability of each score and the number of counters the player receives in return for each score. Score Probability Number of counters player receives Find the value of in order for the player to get an expected profit of

9

counters per roll.

Two fair dice labelled with the values to are thrown. The random variable is the difference between the larger and the smaller score, or zero if they are the same. a Copy and complete this table to show the probability distribution of .

b Find c Find

. .

d Find the median of . e Find

.

10 a In a game a player pays an entrance fee of £ . He then selects one number from or and rolls three fair four-sided dice, numbered to . If his chosen number appears on all three dice he wins four times the entrance fee. If his number appears on exactly two of the dice he wins three times the entrance fee. If his number appears on exactly one dice he wins £ . If his number does not appear on any of the dice he wins nothing. Copy and complete the probability table. Profit £ Probability b The game organiser wants to make a profit over many plays of the game. Given that he must charge a whole number of pence, what is the minimum amount the organiser must charge? 11 Viewers are asked to rate a new film on a three-point scale. Their marks are modelled by the random variable as shown.

a The mean, median and mode of are all equal. Find the variance of . b Two independent viewers of the film are both asked their opinion. i What is the probability that their total score is more than ? ii Show that the expectation of their total score is . 12 The number of books borrowed by each person who visits a library is modelled by the random variable .

a Find the mean of . b Show that the expectation of is larger than the median of . c Show that the standard deviation of is less than the median of . d

people visited the library during an audit period. The numbers of books they borrowed are independent of each other. Find: i the probability that exactly three people borrow no books ii the expected number of people who borrow no books.

Section 2: Expectation and variance of transformations of discrete random variables Linear transformations You might have noticed a link between parts a and b of question 1 in Exercise 1A. The distributions were very similar but in part b all the -values were multiplied by . All the averages and the standard deviations were also multiplied by transformation.

but the variances were multiplied by

. This is an example of a

The most common type of transformation is a linear transformation. This is where the new variable is found from the old variable by multiplying by a constant and/or adding on a constant. You might do this, for example, if you change the units of measurement. This kind of change is also known as ‘linear coding’. If you know the original mean and variance and how the data were transformed, you can use a shortcut to find the mean and variance of the new data.

Key point 1.5 If is a random variable and is a new random variable such that

, then:

Fast forward You will prove Key point 1.5 after you have developed a little more theory. This means that the standard deviation of , is . This makes sense as multiplying the data by does change how spread out they are, but adding on does not change the spread. WORKED EXAMPLE 1.5

A random variable has expectation and variance . Find:

. is a transformation of given by

a the expectation of b the standard deviation of . a

This is just a direct application of Key point 1.5.

b

To find the standard deviation you first need to find the variance of , using Key point 1.5.

Common error It is easy to get confused with the minus sign in the transformations in Worked example 1.5. Remember that both variances and standard deviations are always positive.

Non-linear transformations You can also apply non-linear transformations to , such as

,

or

. When you do this

there is no shortcut to finding the mean and variance of the transformed variable. You need to adapt Key point 1.1. Consider the discrete random variable

outcome on a fair six-sided dice. If

, you can

construct the probability distribution for :

The probability of being is just the same as the probability of being . So , i.e. it is

.

Key point 1.6 If is a discrete random variable with expectation

and is a function applied to ,

then

WORKED EXAMPLE 1.6

The discrete random variable has the distribution shown in the table.

If

°, find:

a b

.

a

Apply Key point 1.6

b

To find

you need

which is

.

You can use Key point 1.6 to prove Key point 1.5. PROOF 2

Let Apply Key point 1.6 to the function

Then:

.

You can separate out a sum into its different terms, taking out constant factors. Use the fact that for any probability distribution and the definition of expectation from Key point 1.1. You have now established the first part of Key point 1.6. Considering

to get to

the variance:

Apply Key point 1.6 to the function expand the brackets.

and

You can separate out a sum into its different terms, taking out constant factors. Use the fact that and the definitions of

for any probability distribution and .

Using the definition of variance from Key point 1.2:

Expand the brackets and lots of terms cancel! Taking out a factor of leaves the expression for from Key point 1.2. This completes the proof.

EXERCISE 1B 1

and a

. Find

and

if:

i ii

b

i ii

c

i ii

d

i ii

e

i ii

2

.

The discrete random variable follows this distribution:

Find a

i ii

b

i ii

c

i ii

d

i

and

if:

.

ii 3

Stephen goes on a mile bike ride every weekend. The distance until he stops for a picnic is modelled by , where and . is the distance remaining after his picnic. Find

4

The rule for converting between degrees Celsius

and

.

and degrees Fahrenheit

When a bread oven is operating it has expected temperature .

is:

with standard deviation

Find the expected temperature and standard deviation in degrees Fahrenheit. 5

The random variable has expectation and variance . If , find the values of and so that the expectation of is zero and the standard deviation is .

6

is a discrete random variable where and . is a transformation of such that . Find and the standard deviation of .

7

is a discrete random variable satisfying

for

.

Find: a the value of b c d e 8

.

The discrete random variable has a distribution given by

for

. a Find, in terms of and , b Hence find, in terms of and , 9

. .

A discrete random variable has equal expectation and standard deviation. is a transformation of such that . Prove that it is only possible for the expectation of to equal the variance of if

10 The St Petersburg Paradox describes a game where a fair coin is tossed repeatedly until a head is found. You win pounds if the first head occurs on the toss. How much should you pay to play this game?

Section 3: The discrete uniform distribution You have already met some special distributions that occur so often that they are named. For example, the binomial and the normal distributions. Another very common distribution is the discrete uniform distribution. This is a distribution in which all the whole numbers from to are equally likely and it is given the symbol . For example, gives the distribution of the outcomes on a fair six-sided dice.

Key point 1.7 If a random variable follows a discrete uniform distribution for

, then .

If you identify a random variable as following a uniform distribution you can immediately write down the expectation and variance.

Key point 1.8 If a random variable follows a discrete uniform distribution and

, then .

Rewind You met the rules for working with indices in A Level Mathematics Student Book 1, Chapter 2. You can prove the result in Key point 1.8 by using your knowledge of sums of powers of integers. PROOF 3

If

then

and

. denotes the possible values of

which are

.

is a constant so you can take it out of the sum. Use the result for the sum of the first positive integers:

All the values of need to be squared.

Use the result for

.

Use the formula for variance.

In Section 2 you saw how to find the expectation and variance of a linear transformation of a discrete random variable. You can find the expectation and variance of a linear transformation of a discrete uniform distribution in the same way. WORKED EXAMPLE 1.7

The discrete random variable is equally likely to take any even value from

to

inclusive. Find

the variance of . where

The values of are where . So is a linear transformation of Apply Key point 1.8. Apply Key point 1.5.

EXERCISE 1C

. You can write these as .

,

EXERCISE 1C 1

Find the mean and variance of these distributions. a

i ii

b

i ii

2

A fair spinner has sides labelled

. Find the expected mean and standard deviation of the

results of the spinner. 3

A fair dice has sides labelled of throwing the dice.

4

a The random variable is equally likely to take any integer value between can be written as

. Find the expectation and standard deviation of the outcome

where

and . Show that this

.

b Hence find the variance of . 5

A string of

Christmas lights starts with a plug then contains a light every

from the plug.

One light is broken. Assuming all bulbs are equally likely to break, what are the expected mean and variance of the distance of the broken light from the plug? 6

The random variable is equally likely to take the value of any odd number between and inclusive. Find the variance of .

7

The discrete random variable takes values variance of .

8 9

and

. Find the expectation and

. Find .

A random number, , is chosen from the fractions Prove that

10

but

. Prove that

. is always divisible by

.

Checklist of learning and understanding The expectation of a discrete random variable is written . The variance of a discrete random variable is written where .

and calculated as and calculated as

The mode of is the value of associated with the largest probability. The median,

, is any value which has

and

.

If there are two possible values, you have to find their mean. If

, then:

A discrete uniform distribution models situations in which all discrete outcomes are equally likely. If

, then

for

and

and

.

Mixed practice 1 1

A discrete random variable has options.

and

. Find

. Choose from these

A B C D 2

A discrete random variable has a distribution defined by

for

. Find

. Choose from these options. A B C D 3

A drawer contains three white socks and five black socks. Two socks are drawn without replacement. is the number of black socks drawn. a Find the probability distribution of . b Find

4

.

A fair six-sided dice is thrown once. The random variable is calculated as half the result if the dice shows an even number, or one higher than the result if the dice shows an odd number. a Write down a table representing the probability distribution of . b Find c Find

. .

d Find the mode of . e Find the median of . 5

6

a

. Find the expectation and variance of .

b

is the discrete random variable that is equally likely to take any integer value between and . Find and .

c

is the discrete random variable that is equally likely to take any even value between and . Find and .

The random variable follows this distribution:

a Write down the median of . b If

, find the values of and .

c Hence find 7

and show that

.

is a discrete random variable with

and

.

. Find

and the

standard deviation of . 8

The random variable has expectation and variance . If and so that the expectation of is and the standard deviation is

9

, find the values of .

is a discrete random variable that can take the value or . a If

, find the standard deviation of .

b

. Find

and

.

10 A fair dice is thrown until a has been thrown or three throws have been made. is the discrete random variable representing the number of throws made. a Write down, in tabular form, the distribution of . b Find

.

c Find the median of . d The number of points awarded in the game, , is given by

. Find the variance of .

11 a A four-sided dice labelled with the values to is rolled twice. Write down, in a table, the probability distribution of , the sum of the two rolls. b Find

and

.

c A four-sided dice is rolled once and the score, , is twice the result. Find the mean and variance of . 12 The discrete random variable follows the distribution. is the expectation of and is the variance of . Find . 13

is a discrete random variable satisfying

for

.

Find, in terms of : a b c d

.

14 A discrete random variable has

. Find

in terms of

15 In a card game a pack of standard playing cards is used. The cards are dealt one at a time until the Queen of Spades (a unique card in the pack) is revealed. a What are the expected mean and standard deviation of the number of cards until the Queen of Spades is revealed? b In the game the player scores

points if the Queen of Spades is the th card revealed.

Find the expected number of points scored.

16 A box contains a large number of pea pods. The number of peas in a pod can be modelled by the random variable . The probability distribution of is shown here: or fewer

or more

a Two pods are picked randomly from the box. Find the probability that the number of peas in each pod is at most . b It is given that

.

i Determine the values of and . ii Hence show that

.

iii Some children play a game with the pods, randomly picking a pod and scoring points depending on the number of peas in the pod. For each pod picked, the number of points scored, , is found by doubling the number of peas in the pod and then subtracting . Find the mean and the standard deviation of . [© AQA 2014] 17 In a computer game, players try to collect five treasures. The number of treasures that Isaac collects in one play of the game is represented by the discrete random variable . The probability distribution of is defined by

a

i Show that

.

ii Calculate the value of iii Show that

. .

iv Find the probability that Isaac collects more than treasures. b The number of points that Isaac scores for collecting treasures is where

.

Calculate the mean and the standard deviation of . [© AQA 2014]

2 Poisson distribution In this chapter you will learn how to: use the conditions required for a Poisson distribution to model a situation use the Poisson formula and calculate Poisson probabilities calculate the mean, variance and standard deviation of a Poisson variable use the distribution of the sum of independent Poisson distributions carry out a hypothesis test of a population mean from a single observation from a Poisson distribution.

Before you start… A Level Mathematics Student Book 1, Chapter 21

You should know how to work with the binomial distribution.

1 Given that

A Level Mathematics Student Book 2, Chapter 20

You should know how to work with conditional probability.

2 Given that find .

Chapter 1

You should know how to find the expectation and variance of discrete random variables.

3 Find

A Level Mathematics Student Book 1, Chapter 22

You should know how to carry out hypothesis tests on the binomial distribution.

4 A coin is tossed times and tails are observed. Use a two-tailed test to determine at the significance level if this coin is biased.

and

, find

and

.

,

for this distribution:

What is the Poisson distribution? When you are waiting for a bus there are two possible outcomes – at any given moment the bus either arrives or it doesn’t. You can try modelling this situation, using a binomial distribution, but it is not clear what an individual trial is. Instead you have an average rate of success – the number of buses that arrive in a fixed time period. There are many situations in which you know the average rate of events within a given space or time, in contexts ranging from commercial, such as the number of calls through a telephone exchange per minute, to biological, such as the number of clover plants seen per square metre in a pasture. If the events can be considered independent of each other (so that the probability of each event is not affected by what has

already been seen), the number of events in a fixed space or time interval can be modelled by the Poisson distribution.

Section 1: Using the Poisson model The Poisson distribution is commonly used when these conditions hold: the events occur singly (one at a time) the events are independent of each other the average rate of events (conventionally called lambda, ) is constant. If these conditions are satisfied then the discrete random variable ‘number of events, ’, follows the Poisson distribution with mean . You write this as .

Tip If a question mentions average rate of success, or events occurring at a constant rate, you should use the Poisson distribution. If you can identify a fixed number of trials, you should use the binomial distribution. The Poisson distribution can also be a useful approximate model for discrete random variables in other situations. However, if the stated conditions are not met this can only be established by looking empirically at data. Once you have identified that a situation follows a Poisson distribution, you can use facts about the probability of a certain number of events, the expected number of events and the expected variance.

Key point 2.1 If a random variable follows a Poisson distribution

 , then:

for

These formulae will be given in your formula book.

Common error Remember that

, not .

Notice that the values of the mean and variance are equal for the Poisson distribution. This is something you look out for when determining if data are likely to fit a Poisson model, although in itself is not sufficient to decide – there are other distributions with this feature. A typical Poisson distribution, the

distribution, is shown here:

p 0.4

0.3

0.2

0.1

0

0

1

2

3

4

5

6

x

Notice that: the mean rate does not have to be a whole number the distribution is not symmetric the graph, in theory, should continue on to infinite values of , but the probabilities of very large values of get very small. WORKED EXAMPLE 2.1

Recordable accidents occur in a factory at an average rate of every year, independently of each other. Find the probability that in a given year exactly recordable accidents occurred. Let be the number of recordable accidents in a year:

Define the random variable. Give the probability distribution. Write down the probability required, and calculate the answer.

The Poisson distribution is scalable. For example, if the number of butterflies seen on a flower in minutes follows a Poisson distribution with mean , then the number of butterflies seen on a flower in minutes follows a Poisson distribution with mean

, the number of butterflies seen on a flower in minutes follows

a Poisson distribution with mean , and so on.

Tip Learn how to use your calculator to find Poisson probabilities, probabilities, .

and cumulative

WORKED EXAMPLE 2.2

If there are, on average, buses per hour arriving at a bus stop, find the probability that more than buses arrive in minutes. Let be the number of buses in minutes:

Define the random variable. Give the probability distribution. Write down the probability required. To use your calculator you must relate this probability to .

The scalability of the Poisson distribution is a consequence of a more general result. If two independent variables both follow a Poisson distribution then so does their sum.

Key point 2.2 If random variables and follow Poisson distributions such that , then

and

and

.

Although you do not need to know the proof of the result in Key point 2.2, it does show an interesting link with the binomial expansion. PROOF 4

Consider all the different ways in which can take the value . If then If then , etc. Rewrite in sigma notation to keep the expression shorter. Use the formula for the Poisson distribution. You can take out factors of and the sum since they are constants.

from

You are close to having a binomial coefficient. Multiply by in the sum to get to this, but then you have to divide by too. Replace the factorials with a binomial coefficient. You can recognise the sum as a binomial expansion.

This is a Poisson distribution with mean

.

WORKED EXAMPLE 2.3

Hywel receives an average of message he receives.

emails and

texts each hour. These are the only types of

a Assuming that both the emails and the texts form an independent Poisson distribution, find the probability that he receives more than messages in an hour. b Explain why the assumption that the emails and texts form independent Poisson distributions is unlikely to be true. a Use Key point 2.2 to combine the two Poisson distributions. You need to write the required probability in terms of a cumulative probability to use the calculator function. b The rate of arrival of messages is unlikely to be constant – there will probably be more at some times of the day than at others. Within each distribution messages are not likely to be independent as they may occur as part of a conversation. The two distributions are

also probably not independent of each other, as times when more emails arrive might be similar to times when more texts might arrive.

Common error Sometimes people think that the mean rate in a Poisson distribution has to be a whole number. This is not the case.

WORK IT OUT 2.1 The number of errors in a computer code is believed to follow a Poisson distribution with a mean of errors per lines of code. Find the probability that there are more than errors in lines of code. Which is the correct solution? Identify the errors made in the incorrect solutions.

A

If is the number of errors in

lines, then

.

B

If is the number of errors in

lines, then

.

More than errors in

lines is equivalent to more than error in

lines, so you need

C

EXERCISE 2A 1

State the distribution of the variable in each of these situations. a Cars pass under a motorway bridge at an average rate of per i The number of cars passing under the bridge in one minute. ii The number of cars passing under the bridge in b Leaks occur in water pipes at an average rate of . i The number of leaks in ii The number of leaks in c

worms are found on average in a i The number of worms found in a

Calculate these probabilities. a If i ii

:

per kilometre.

.

ii The number of worms found in a 2

seconds.

area of a garden. area of garden. by

area of garden.

second period.

b If

:

i ii c If

:

i ii d If

:

i ii e If

:

i ii 3

A random variable follows a Poisson distribution with mean of probabilities, giving the results to three significant figures.

4

From a particular observatory, shooting stars are observed in the night sky at an average rate of one every five minutes. Assuming that this rate is constant and that shooting stars occur (and are observed) independently of each other, what is the probability that more than are seen over a period of one hour?

5

When examining blood from a healthy individual, under a microscope, a haematologist knows

. Copy and complete this table

he should see on average four white blood cells in each high power field. Find the probability that blood from a healthy individual will show: a seven white blood cells in a single high power field b a total of 6

white blood cells in six high power fields, selected independently.

A wire manufacturer is looking for flaws. Experience suggests that there are on average flaws per metre in the wire. a Determine the probability that there is exactly one flaw in one metre of the wire. b Determine the probability that there is at least one flaw in metres of the wire.

7

The random variable has a Poisson distribution with mean . Calculate: a b c d

8

The number of eagles observed in a forest in one day follows a Poisson distribution with mean . a Find the probability that more than three eagles will be observed on a given day. b Given that at least one eagle is observed on a particular day, find the probability that exactly two eagles are seen that day.

9

The random variable follows a Poisson distribution. Given that a the mean of the distribution b

.

, find:

10 Let be a random variable with a Poisson distribution, such that to estimate , giving your answer to three significant figures.

. Use technology

11 The number of emails Sarah receives per day follows a Poisson distribution with mean . Let be the number of emails received in one day and the number of emails received in a sevenday week. a Calculate

and

.

b Find the probability that Sarah receives emails every day in a seven-day week. c Explain why this is not the same as

.

12 The number of mistakes a teacher makes while marking homework has a Poisson distribution with a mean of errors per piece of homework. a Find the probability that there are at least two marking errors in a randomly chosen piece of homework. b Find the most likely number of marking errors occurring in a piece of homework. Justify your answer. c Find the probability that in a class of marking.

students fewer than half of them have errors in their

13 A car company has two limousines that it hires out by the day. The number of requests per day has a Poisson distribution with mean requests per day. a Find the probability that neither limousine is hired on any given day. b Find the probability that some requests have to be denied on any given day. c If each limousine is to be used equally, on how many days in a period of expect a particular limousine to be in use? 14 The random variable follows a Poisson distribution with mean . Given that , find the exact value of . 15 The random variable follows a Poisson distribution with mean . a Show that

.

b Given that

, find the value of such that

.

days would you

Section 2: Using the poisson distribution in hypothesis tests If it is known that a variable follows a Poisson distribution you can use data to make inferences about the value of the mean. To do this you use a hypothesis test. First, you need to work out the -value – the probability of getting the observed result or more extreme, assuming that the null hypothesis is true. You can then compare this to the significance level to determine whether or not to reject the null hypothesis. For a one-tailed test, compare the calculated probability to the significance level directly. For a two-tailed test, you usually find the probability of one tail and compare it to half of the significance level. WORKED EXAMPLE 2.4

The number of telephone calls received by a company follows a Poisson distribution. Over long experience it is thought that the mean is calls per hour. After a redesign of their website it is found that they got calls in an hour. Test at the significance level if this provides significant evidence of a change in the mean number of calls per hour. If

It is a two-tailed test because you are looking for a change in either direction.

Calculate the probability of the observed outcome or more extreme, assuming that is true.

This is more than .

so do not reject

Compare the upper tail to half of the significance value, since this is a two-tailed test. If you want the -value, double the probability to get a -value of

There is insufficient evidence to suggest that the mean number of calls has changed from per hour.

Write a conclusion within the context of the question.

WORK IT OUT 2.2 is the random variable ‘number of absences per day in a school’. It is thought to follow a Poisson distribution with mean . Following a change in the registration system, the number of absences over five days was . Test at the significance level if the change in the registration system has affected the average rate of absences. Which is the correct solution? Identify the errors made in the incorrect solutions.

A

, Under

,

.

If there are

absences over five days, this is a rate of eight per day, so you need . This is more than so you cannot reject . The average rate is absences per day.

B

, Let be the number of absences in five days. Under

,

.

. Since this is a two-tailed test you must double this to get a -value of . This is less than the significance level so you can reject . There is evidence at the significance level that the average rate has changed from absences per day.

C

, .

so reject

.

EXERCISE 2B 1

Conduct these hypothesis tests based on the given observation. You can assume that the data follow a Poisson distribution. Use the a

significance level.

i ii

b

i ii

c

i ii

d

i ii

2

Find the critical region (the set of values for which the null hypothesis is rejected) at the significance level if: a

i ii

b

i ii

c

i ii

3

.

It is known that a sample of radium emits alpha particles per millisecond. A second sample of the same size and shape emits alpha particles in a millisecond. Test at the significance level whether this sample has the same emission rate as radium.

4

a Over a long period it is believed that the average number of cars travelling past a traffic light follows a Poisson distribution, with cars per minute. After some roadworks, it is thought that the number of cars passing is lower. In a one minute observation only cars pass the traffic light. Find the -value of this observation and hence decide at the caused a decrease in traffic levels.

significance level if the roadworks have

b Suggest two reasons why a Poisson distribution might not be appropriate. 5

a The number, , of accidents per month on a road is studied. The mean number of accidents per month is with standard deviation . Explain why this supports the suggestion that the number of accidents follows a Poisson distribution. b Assume that does indeed follow a Poisson distribution. It is thought that adding a speed camera will reduce the average number of accidents from . In the month after the camera was added there were accidents. Test at the significance level if this is evidence of a reduction in the average number of accidents.

6

The numbers of mistakes in nine pieces of a student’s homework are shown:

a Estimate the mean and standard deviation of the number of mistakes, based upon these data. b Hence explain why the Poisson distribution is a plausible model.

c After a study skills session the student produced a piece of work with mistakes. You can assume that the number of mistakes does follow a Poisson distribution. Test at the significance level if the mean number of mistakes is lower than the value found in part a. 7

The number of bees visiting a flower is thought to follow a Poisson distribution with mean per minute. a Describe in context two conditions that must be met for the Poisson distribution to be an appropriate model for the arrival of bees. b After a new hedge has been planted it is thought that the number of bees arriving will increase. In minutes bees visit the flower. Test at the significance level if there is evidence that the number of bees has increased.

8

The number of leaks in a pipe is known to follow a Poisson distribution with mean leaks per km. After the water pressure was changed, an inspection of of pipe revealed leaks. Has there been a change in the mean number of leaks? Test, using the significance level.

9

It is known from long experience that earthquakes occur in a particular town once every four months. Environmentalists believe that a change in the way oil is extracted from a well will increase the number of earthquakes. They monitor the activity for one year and six earthquakes occur. a Test at the significance level whether the number of earthquakes has increased from the longterm trend, stating your -value. b They continue to monitor earthquake activity and the following year six earthquakes also occur. Test at the significance level whether the number of earthquakes has increased from the longterm trend, stating your -value.

10 The discrete random variable follows . A single observation is used to test against . What is the smallest value of for which will be rejected at the significance level when the observation is ?

Checklist of learning and understanding The Poisson distribution is commonly used when these conditions hold: the events occur singly (one at a time) the events are independent of each other the average rate of events (conventionally called ) is constant. If

, then: for

If

,

and

, then

You can use the Poisson distribution to conduct a hypothesis test to see if it suggests that the mean rate has changed.

Mixed practice 2 1

The number of complaints in a shop in any hour while it is open follows a Poisson distribution with mean per hour. Find the probability that in a three-hour shift there are fewer than complaints, giving your answer to three significant figures. Choose from these options. A B C D

2

A random variable follows a Poisson distribution with standard deviation . Find three significant figures. Choose from these options.

to

A B C D 3

The random variable is the number of robins that visit a bird table each hour. The random variable is the number of thrushes that visit a bird table each hour. These are the only types of bird that visit the table. It is believed that

and

.

is the random variable ‘Number of birds visiting the table each hour’. a Stating a necessary assumption, write down the distribution of . b Find the probability that no birds visit the table in one hour. c Find 4

is the random variable ‘number of burgers ordered per hour in a restaurant’. It is thought that . a Write down two conditions required for the Poisson distribution to model data. b Find c During a ‘happy hour’ special offer the number of burgers sold increased to . Test at the significance level whether the special offer has increased the average rate of burgers ordered from .

5

Salah is sowing flower seeds in his garden. He scatters seeds randomly so that the number of seeds falling on any particular region is a random variable with a Poisson distribution, with mean value proportional to the area. He intends to sow fifty thousand seeds over an area of . a Calculate the expected number of seeds falling on a b Calculate the probability that a given

6

a If

write down

and

region.

area receives no seeds. .

b Hence find 7

where is the expected standard deviation of .

Seven observations of the random variable , the number of power surges per day in a power cable, are shown:

a Estimate the mean and standard deviation of , based upon these observations. b Use your answer to part a to explain why the Poisson distribution is a plausible model for . c When a new brand of cable is used it is observed that there are

power surges in five

days. Does this suggest that the new brand has a different average rate of power surges to your answer in part a? Use a 8

significance level.

A receptionist at a hotel answers on average

phone calls a day.

a Find the probability that on a particular day she will answer more than b Find the probability that she will answer more than

phone calls.

phone calls every day during a five-

day week. 9

During the month of August in Bangalore, India, there are on average

rainy days.

a Find the probability that there are fewer than seven rainy days during the month of August in a particular year. b Find the probability that, in ten consecutive years, exactly five have fewer than seven rainy days in August. 10 The random variable follows a Poisson distribution. Given that

, find:

a the mean of the distribution b

.

11 a Given that b

and

find the value of .

. Find the possible values of such that

c If

and

.

, express in terms of .

12 A geyser erupts randomly. The eruptions at any given time are independent of one another and can be modelled, using a Poisson distribution with mean per day. a Determine the probability that there will be exactly one eruption between

b Determine the probability that there are more than

eruptions during one day.

c Determine the probability that there are no eruptions in the watching the geyser.

minutes Naomi spends

d Find the probability that the first eruption of a day occurs between e If each eruption produces

and

and

litres of water, find the expected volume of water produced

in a week. f Determine the probability that there will be at least one eruption in at least six out of the eight hours the geyser is open for public viewing.

g Given that there is at least one eruption in an hour, find the probability that there is exactly one eruption. 13 In a particular town, rainstorms occur at an average rate of two per week and can be modelled, using a Poisson distribution. a What is the probability of at least eight rainstorms occurring during a particular four-week period? b Given that the probability of at least one rainstorm occurring in a period of complete weeks is greater than , find the least possible value of . 14 Patients arrive at random at an emergency room in a hospital at the rate of throughout the day.

per hour

a Find the probability that exactly four patients will arrive at the emergency room between and

.

b Given that fewer than arrive.

patients arrive in one hour, find the probability that more than

15 It is thought that . A single observation of takes the value . Does this provide evidence at the significance level that the average rate has decreased? Support your answer by writing down the -value of the observation. 16 Based on long experience a gardener knows that birds tend to arrive in his garden at an average rate of per hour. a State two assumptions required to model the birds’ arrival, using a Poisson distribution. Are these reasonable assumptions? b If these assumptions do hold, find the probability of observing more than

birds in an

hour. The gardener plants some new flowers. He wants to know if this changes the birds’ behaviour. c If is the true average rate of arrival of birds after the new flowers have been planted, write down suitable null and alternative hypotheses for answering the gardener’s question. d If

birds are observed in an hour, what is the conclusion of the test at

significance?

17 A water company believes that pipes have leaks per km, following a Poisson distribution. After increasing water pressure they are concerned that there are more leaks. They find leaks in a section of pipe. Does this provide significant evidence at the significance level to suggest that the mean number of leaks has increased? 18 A shop has four copies of the magazine Ballroom Dancing delivered each week. Any unsold copies are returned. The demand for the magazine follows a Poisson distribution with mean requests per week. a Calculate the probability that the shop cannot meet the demand in a given week. b Find the most probable number of magazines sold in one week. c Find the expected number of magazines sold in one week. d Determine the smallest number of copies of the magazine that should be ordered each week to ensure that the demand is met with a probability of at least . 19 Annette is a senior typist and makes an average of mistakes per letter. Bruno is a trainee typist and makes an average of mistakes per letter. Assume that the number of mistakes

made by any typist follows a Poisson distribution. a Calculate the probability that on a particular letter: i Annette makes exactly three mistakes ii Bruno makes exactly three mistakes. b Annette types

of all the letters.

i Find the probability that a randomly chosen letter contains exactly three mistakes. ii Given that a letter contains exactly three mistakes, find the probability that it was typed by Annette. c Annette and Bruno type one letter each. Given that the two letters contain a total of three mistakes, find the probability that Annette made more mistakes than Bruno. 20 The number of worms in a square metre in a forest satisfies the distribution . A scientist samples many square-metre areas but only records areas where some worms are observed. What is the mean value of her observations? 21 Mohammed is offered a week’s trial with a view to being permanently employed to service bicycles in Robyn’s bicycle shop. The number of bicycles brought in to be serviced can be modelled by a Poisson distribution with mean per day. a Find the probability that, on Mohammed’s first day, the number of bicycles brought in to be serviced is: i or fewer ii more than iii exactly . b Before starting work, Mohammed told his mother that he hoped that, during his first week ( days), the number of bicycles brought in to be serviced would be: at least , otherwise Robyn might decide that there was not enough work to justify permanently employing him not more than , so that he would not have to work too hard. Find the probability that Mohammed’s hopes will be met. [© AQA 2011] 22 At a Roman site, coins are found at an average rate of coin per number of coins found can be modelled by a Poisson distribution. a Determine the probability that, in an area of

. Assume that the

:

i at most coins are found ii exactly coins are found. b Determine the probability that more than coins are found in an area of

.

c Bronze brooches are less common than coins at this site, and are found at an average rate of brooch per . The number of these brooches found is independent of the number of coins found. Assume that the number of bronze brooches found can also be modelled by a

Poisson distribution. i Determine the probability that the total number of coins and bronze brooches found in an area of is at least . ii Sometimes, Romans buried a hoard of several coins together. They did not usually bury several bronze brooches together. State, with a reason, which of the number of coins found or the number of bronze brooches found is likely to be better modelled by a Poisson distribution. [© AQA 2013]

3 Chi-squared tests In this chapter you will learn how to: check if two variables are dependent. If you are following the A Level course, you will also learn how to: use Yates’ correction, a way of improving the method for checking if two variables are dependent.

Before you start… A Level Mathematics Student Book 1, Chapter 22

You should know how to conduct hypothesis tests.

1 A coin is tossed times and heads are observed. Does this provide evidence (at a significance level) that the coin is biased towards heads?

A Level Mathematics Student Book 1, Chapter 21

You should know how to calculate probabilities for independent events.

2 The probability of Andrew scoring a goal is and the probability of Helen scoring a goal is . Given that these outcomes are independent, what is the probability that they both score a goal?

A Level Mathematics Student Book 2, Chapter 3

You should know how to evaluate expressions including the modulus function.

3 Evaluate

.

Independent? One common question you can ask in a statistical situation is whether or not two variables are dependent – for example, do future earnings depend on A Level choices? In this chapter you will look at a statistical test to answer this type of question.

Did you know? You might have already met a test to see if two variables are correlated. This is related to independence, but it is not quite the same. For example, the scatter graph shows the results of a psychology experiment where people are asked to estimate the size of an angle, and the time taken for them to do so is measured. The two variables are not correlated (there is no linear trend) but they are not independent – people who spend longer making the estimate seem to have more tightly clustered estimates.

Estimated angle (degrees)

90 80 70 60 50

0

5

10 Time (s)

15

It turns out that if two variables are independent then they will definitely be uncorrelated, but the reverse is not true. You can write:

Section 1: Contingency tables In this section you will try to design a hypothesis test that decides whether two variables are dependent: : The two variables are independent. : The two variables are dependent.

Tip Choose

to be that the variables are independent because you can use that to calculate

expected values. You cannot use the fact that two variables are dependent to calculate expected values unless you are given more information about what that dependence is. To describe the two variables you use contingency tables that list how often each combination of variables occurs. For example, this table illustrates the results of a survey of young families. The observed value in cell is called . Number of children or more or fewer Number of bedrooms or more This is a contingency table. Notice that each cell contains actual frequencies rather than probabilities or proportions. You are going to need a way of measuring how far this is away from the numbers you would expect if the two variables were independent. To do this you look at the totals. Number of children or more

Total

or fewer Number of bedrooms or more Total Based on the sample, the probability of having bedrooms or fewer is children is

and the probability of having

. If the two variables are independent, the probability of both occurring is the product of

these probabilities, so the probability of two children and two bedrooms is you would then expect there to be The expected frequency in cell is called

. In a sample of size

families with two children and two bedrooms. .

Tip Expected frequencies do not have to be whole numbers.

Key point 3.1 In a contingency table, the expected frequency in cell is

You can create another contingency table containing all the expected frequencies. Number of children or more or fewer Number of bedrooms or more There are several possible measures of the difference between observed and expected values. The measure you need to know is called chi-squared

.

Tip Notice that the row totals and the column totals are the same as for the original data. This is a useful check.

Key point 3.2 The chi-squared value that gives the difference between the observed values, expected values, , is

, and the

For the given data, Large values indicate a big difference between observed and expected data. Is this value large enough to conclude that number of bedrooms and number of children are not independent? To decide, you need to know the distribution of to see how likely the observed value is. This distribution has a single parameter that depends on the number of cells in the table and that, for historical reasons, is called the degrees of freedom – often given the symbol (lowercase Greek letter ‘nu’) or DF.

Key point 3.3 In an by contingency table, the number of degrees of freedom is

If the null hypothesis is true (that the variables are independent) squared distribution with degrees of freedom – the

approximately follows the chi-

distribution. However, this approximation is only

valid if all the expected frequencies in the contingency table are greater than .

Key point 3.4 If the null hypothesis is true and the expected value

This will be given in your formula book.

for all , then the chi-squared value is

Tip Only expected values need to be above . Observed values are irrelevant. In the survey results, not all of the expected values are above . When this happens you need to combine some rows or columns in a way that is sensible in context. The most obvious way with the example given is to combine the ‘ children’ group with the ‘ or more children’ group. Number of children or more or fewer Number of bedrooms or more You can then create the new contingency table of expected values.

Tip You can find the expected values by adding up the corresponding expected values from the original table. You don’t have to recalculate the frequencies, using Key point 3.1.

Number of children or more or fewer Number of bedrooms or more You can find the contributions of each cell to the total chi-squared value. Number of children or more or fewer Number of bedrooms or more Totalling these contributions, values given in the formula book.

and

. You can compare this value to critical

The highlighted value, 9.488, gives the critical value for a test at the significance level with four degrees of freedom. The column is headed because of chi-squared values with degrees of freedom are below this value, therefore are higher. The calculated value of is higher than you reject the null hypothesis and conclude that number of bedrooms and number of children are dependent variables. The contingency table showing the contributions of each cell to the much larger value of

than others.

sum shows some cells have a

, so

You can use this to analyse which combinations of the variables are very different from the expected frequencies. This can give you further insight into what is happening in the situation being investigated. In the example you can see that two or more children in houses with four or more bedrooms makes the largest contribution to the . You could interpret this as meaning that large families preferring large houses is a big factor in why number of children and number of bedrooms are dependent.

Tip Some calculators allow you to do the chi-squared test automatically and provide you with the -value. This alternative approach is acceptable.

WORKED EXAMPLE 3.1

Determine at the significance level whether or not the colour of a car sold by a dealership is independent of the gender of the purchaser. Gender Male

Female

Total

Blue Red

Colour

Green Silver Total

Set up hypotheses.

: Gender and car colour are independent. : Gender and car colour are not independent. The expected values are: Male Blue Red Green

Female

Find expected values, using Key point 3.1. Check that all the expected frequency values are above , which they are in this case. Also check that the row and column totals match the row and column totals in the original table of observed values.

Silver

Find the chi-squared value, using Key point 3.2. Find the number of degrees of freedom, using Key point 3.3.

The critical value is than

which is more

.

Therefore do not reject ; the data set is consistent with gender and car colour being independent.

Use the formula book to find the critical value. Remember that is a measure of distance between observed and expected values, so your calculated distance is less than the critical distance.

Worked example 3.2 shows how to deal with combining groups. WORKED EXAMPLE 3.2

This contingency table shows the favourite sports played by different age groups in a sample at a sports centre. Age

Soccer Favourite sport

Basketball Swimming Tennis

a Test at the significance level whether preferred sport and age are independent, showing the contributions of each cell. b Interpret your results in context. a

: Age and preferred sport are

Set up hypotheses.

independent. : Age and preferred sport are not independent. The expected values are:

Soccer

Find expected values, using Key point 3.1. Check that all the expected values are above , which they are not in this case. Notice that the row and column totals are the same as those in the observed data table.

Basketball Swimming Tennis

Several cells in the range have a frequency less than , so combining this column with the column:

Soccer Basketball Swimming Tennis

And the corresponding observed values are:

Soccer Basketball

The most obvious choice is to combine the and groups.

Swimming Tennis

Use Key point 3.2.

The contributions of each cell are:

Soccer Basketball Swimming Tennis

The critical value is . Therefore reject ; preferred sport and age are dependent. b The main contributions to come from basketball and swimming. It appears swimming is more popular than would be expected amongst younger children, whilst basketball is more popular than

Compare with the critical value from the formula book and conclude. Use the contributions of each cell to find the most important factors.

would be expected amongst older children. You might have to construct a contingency table from given information. WORKED EXAMPLE 3.3

A biologist claims that the mobility of fish is dependent on their breeding ground. A sample of fish was taken, with equal numbers from each of the two breeding grounds, Ellesmere and Duxbury, studied. A test was used to classify the fish as sedentary, normal or highly mobile. In Ellesmere half of the fish were classified as highly mobile and one-fifth as normal. In Duxbury one quarter of the fish were classified as normal and Test the biologists claim at the

of the Duxbury fish were classified as sedentary.

significance level.

: Mobility and breeding ground are independent.

First write the null and alternative hypotheses.

: Mobility and breeding ground are dependent. Observed values: Sedentary

Normal

Highly mobile

Ellesmere Duxbury

Create a contingency table. There are fish in each location. Turn the given proportions in each location into frequencies and use the fact that each row adds up to to complete the table.

Expected values: Sedentary Ellesmere Duxbury

Normal

Highly mobile

You can use Key point 3.1 to calculate the expected values. All expected values are larger than so no combining is required.

Find the value of

, using Key point 3.2.

Find the number of degrees of freedom, using Key point 3.3. The critical value is reject

so do not

. There is no significant evidence

that mobility depends upon breeding ground.

EXERCISE 3A 1

Test these contingency tables to see if the two variables are dependent at the significant level. State carefully the number of degrees of freedom and the value of . In part b, combine suitable columns to make all expected frequencies greater than 5. a

i

Exam grade or

, or

or

Mr Archer Teacher

Ms Baker Mrs Chui

ii

Time working hours

hours

hours

Male Gender Female b

i

Age

Social media followers

ii

Cost

Red Colour Green Blue 2

A Physics teacher wants to investigate whether or not there is any association between the Physics grade her students get and the Mathematics course they study. She collects data for a random sample of students over several years. The results are given in this table. or lower Further Maths Maths AS or

or

or

No Maths a State the null and alternative hypotheses. b Calculate the expected frequencies. c Calculate the value of d Test at the

and write down the number of degrees of freedom.

significance level whether the Physics grade is independent of the Mathematics

course studied. Show clearly how you arrived at your conclusion. Interpret your results in context. 3

A random sample of

books was taken from a library. One third of the books were fiction and

the rest were non-fiction. The reading level of each book was assessed as elementary, moderate or advanced. of the non-fiction books were classified as advanced and were classified as elementary. One quarter of the fiction books were classified as elementary. There were the same number of moderate fiction and moderate non-fiction books. Conduct an appropriate test to determine, at the significance level, if there is evidence to suggest that reading level depends on whether a book is fiction or non-fiction. 4

James wanted to know whether people being late, early or on time to school depends upon their mode of transport. This partially filled contingency table shows his results, based on asking students. Early

On time

Late

Total

Walk Car Other Total a Copy and complete the contingency table. b Calculate the value of

for this data.

c Conduct an appropriate test at the

significance level to answer James’ question.

d What assumptions does James have to make in conducting this test? 5

The owner of a beauty salon wants to find out whether there is any association between the number of times in a year people visit the salon and the amount of money they spend on each visit. He collects these data for a random sample of clients. Number of visits

Amount spent per visit £

Is there evidence, at the

level of significance, that there is some association between the

number of visits and the amount of money spent? Interpret your result in context. 6

A drugs manufacturer claims that the speed of recovery from a certain illness is higher for people who take a higher dose of their new drug. They provide these data for a sample of patients. No drug taken days days days

Single dose

Double dose

days Test whether there is evidence for the manufacturer’s claim at the Interpret your results in context. 7

level of significance.

A company is investigating their gender equality policies. As a part of this investigation they collect data on salaries, to the nearest pound, for a random sample of this table. Male

employees, as shown in

Female

£ £ £ £ £

a Assuming salary is independent of gender, calculate the corresponding expected frequencies. b Carry out a suitable test to determine whether salary is independent of gender. State and justify your conclusion at the 8

a Prove that b A

level of significance. , where

contingency table has

. . Find the largest possible sample size.

c Find the largest sample size that will produce a significant result in the chi-squared test at the significance level, assuming that all cells have an expected frequency of at least 5. 9

A researcher believes that these percentages are the true proportions of people voting for different political parties based on their gender: Male

Female

Party A Party B Party C a Show that gender and voting intention are dependent. b Show that if a sample of size in the chi-squared test at

follows these proportions, it will not provide a significant result

significance.

c Calculate an estimate of the smallest sample size required to find significant evidence that gender and preferred political party are dependent, using a significance level. d Explain why your answer to part c is only an estimate. 10 This contingency table has some blank spaces. First factor

A Second factor

B C

Total a Copy the table and fill in the blanks.

Total

b Hence explain why it can be said that this contingency table has degrees of freedom. 11 Explain why the formula for chi-squared contains: a squaring before summing b dividing by the expected value.

Section 2: Yates’ correction It turns out that when the number of degrees of freedom (i.e. a × contingency table) then the approximation that is not very good. To improve upon this you use an alternative formula, called Yates’ correction.

Key point 3.5 Yates’ correction:

Rewind You met the modulus function,

, in A Level Mathematics Student Book 2, Chapter 3.

WORKED EXAMPLE 3.4

This contingency table shows the results of gender.

people in a driving test, along with their

Gender Male Pass

Result

Test at the

Female

Fail

significance level if the outcome of the test is independent of gender.

: Gender and result are independent.

Set up hypotheses.

: Gender and result are not independent. The expected values are: Gender Male

Female

Find the expected values, using Key point 3.1.

Pass

Result

Fail

Use Key point 3.5.

so the critical value is

.

You cannot reject the null hypothesis; the test outcome is independent of gender.

WORK IT OUT 3.1 Test at the

significance level if there is any association between teacher and test result: Teacher

Mr A

Total

Mrs B

Pass

Result

Fail Total

Which is the correct solution? Identify the errors made in the incorrect solutions. A

If there is no association then each cell will be the same, so the expected values are: Teacher Mr

Mrs

Pass Result

Fail

So The critical value when

is

, so you can reject

; the result does not

depend on the teacher. B

: The result depends on the teacher. : The result does not depend on the teacher. The expected values are: Teacher Mr

Mrs

Pass Result

Fail

Using Yates’ correction:

The critical value is , which is more than the calculated value, so do not reject the result does depend on the teacher. C

: Teacher and result are independent. : Teacher and result are dependent. The expected values are: Teacher Mr Pass Result

Fail

Using Yates correction:

Mrs

;

The critical value is , which is more than the calculated value, so do not reject the result and the teacher are independent.

;

Sometimes Yates’ correction only becomes clear after you combine rows and columns of a contingency table. WORKED EXAMPLE 3.5

This contingency table shows the location and ownership status of a random sample of houses. Owned outright

Owned with mortgage

Rented

Urban Rural

Does this sample provide evidence at the on location?

significance level that ownership status depends

: Ownership status and location are independent. : Ownership status and location are dependent.

First write the null and alternative hypotheses.

Expected values:

Use Key point 3.1 to find the expected values.

Owned outright

Owned with mortgage

Rented

Urban Rural

Combining the two owned categories gives observed data: Owned

Rented

Urban

Since there are two cells with an expectation below , you must combine rows or columns. The most reasonable combination here is the two types of owned category.

Rural

The expected data is: Owned

Rented

Urban

You do not need to use Key point 3.1 again. You can just add the expected values of the appropriate cells in the previous table.

Rural

Since there is now only one degree of freedom, it is appropriate to use Yates’ correction. The critical value is , which is greater than the observed value, so you do not

reject

. There is no significant evidence

that ownership status depends on location.

EXERCISE 3B 1

Use Yates’ correction to test these contingency tables for evidence of association, using significance. a

i

ii

b

i

ii

2

Gregor Mendel, the founder of modern genetics, carefully observed peas and found these results: Wrinkled

Round

Yellow Green Show that the round or wrinkled appearance of the pea is independent of the colour at the level of significance.

Did you know? These results are actually suspiciously close to being perfect – some people believe Mendel faked his results. However, it is not possible to conduct a hypothesis test to check this. How is statistics used to check for authenticity in results? In particular, how is Benford’s law used to check tax returns? 3

A scientist wanted to find out if a colleague could tell whether tea or milk was put in the cup first when tea was prepared for her. The results are shown here. Tea first

Milk first

Likes Dislikes Determine, at the level of significance, whether the colleague’s enjoyment is independent of whether tea or milk is added first.

Did you know? ‘The lady tasting tea’ was one of the experiments reported by eminent statistician Ronald Fisher in his book, The Design of Experiments. He used a variant on the chisquared test, called Fisher’s exact test. 4

This table shows the number of books in libraries in rural and urban locations.

Number of books to Rural Urban Conduct a test at the

significance level to determine if the number of books differs between

rural and urban libraries. 5

These data show the number of murders each year and the amount spent on horror films in the cinema across the last years in the UK. Amount spent on horror films in million £ to Number of murders Test at the significance level to see if there is an association between the amount spent on horror films and the number of murders each year. Does this provide evidence that watching horror films encourages people to commit murder?

6

In

admissions to the six largest departments in Berkeley, a university in California, followed

this pattern. Accepted

Rejected

Male Female a Conduct a chi-squared test at significance to show that acceptance patterns depend on gender. Is a higher percentage of men or women admitted? Is this evidence of bias? b Data from six departments are shown here. Men Department

Admitted

Women Rejected

Admitted

Rejected

Total

Total Conduct a test at the

significance level to determine if acceptance patterns vary in

different departments. In how many departments is the proportion of men admitted higher than the proportion of women admitted?

Did you know? This effect is called Simpson’s Paradox. You have to be very careful when using statistics to support arguments! 7

Explain why the null hypothesis in a chi-squared test cannot be ‘The two variables are dependent’.

Checklist of learning and understanding The

distribution provides a very important method for deciding if two variables are

independent. If the variables are independent, you use the formula

to find the expected values in each cell. The test statistic used is

is the number of degrees of freedom, calculated as (rows If If

, then

) (columns

.

, then you use an alternative formula, called Yates’ correction:

).

Mixed practice 3 1

What is the number of degrees of freedom for a

contingency table?

Choose from these options. A B C D More information needed. 2

A

contingency table has all expected frequencies larger than and a chi-squared value of

. What is the largest range of values of for which there is evidence that the two factors are not independent at significance? Choose from these options. A B C D 3

The area manager of a bank obtained information on the bank during the previous two years.

randomly selected loans made by

The loan outcomes were categorised as ‘Satisfactory’ or as a ‘Bad debt’. The loan recipient types were categorised as ‘Individual’, ‘Small business’ or ‘Large business’. Recipient Individual Outcome

Small business

Large business

Satisfactory Bad debt

Using a distribution and the level of significance, test whether the outcome of a loan is independent of the type of recipient. Interpret your conclusion in the context of the question. [© AQA 2013] 4

This contingency table shows the data on hair colour and eye colour for a sample of children. Eye colour Blue Hair colour

Green

Brown

Brown Blonde

a Assuming that hair colour and eye colour are independent, calculate the expected frequencies.

b Calculate the value of the freedom.

statistic for this data and state the number of degrees of

c Perform a suitable hypothesis test at the

level of significance to decide whether hair

colour and eye colour are independent. State your hypotheses and your conclusion clearly. 5

A nurse thinks that she has noticed that more boys are born at certain times of the year. She records the data for babies born in her hospital in one year. Spring

Summer

Autumn

Winter

Boy Girl

Test at the

significance level whether her data gives evidence for any association between

the gender of the baby and the time of the year. You must show all your working clearly. 6

Find the value of the appropriate chi-squared test statistic (to three significant figures) for this contingency table.

Choose from these options. A B C D 7

A large estate agency would like all the properties that it handles to be sold within three months. A manager wants to know whether the type of property affects the time taken to sell it. The data for a random sample of properties sold are tabulated here. Type of property Flat

Terraced

Semidetached

Detached

Total

Sold within three months Sold in more than three months Total

a Conduct a test, at the level of significance, to determine whether there is an association between the type of property and the time taken to sell it. Explain why it is necessary to combine two columns before carrying out this test. b The manager plans to spend extra money on advertising for one type of property in an attempt to increase the number sold within three months. Explain why the manager might choose: i terraced properties ii flats. [© AQA 2013]

8

Fiona, a lecturer in a school of engineering, believes that there is an association between the class of degree obtained by her students and the grades they had achieved in A Level Mathematics. In order to investigate her belief, she collected the relevant data on the performances of a random sample of recent graduates who had achieved grades or in A level Mathematics. These data are tabulated here. Class of degree

Total

A Level grade Total

a Conduct a

test, at the

level of significance, to determine whether Fiona’s belief is

justified. b Make two comments on the degree performance of those students in the sample who achieved a grade in Level Mathematics. [© AQA 2012] 9

An organisation kept details of sideswipe accidents involving heavy goods vehicles (HGVs) during 2006. The type of each sideswipe accident was recorded as changing lane to the left, changing lane to the right or overtaking moving vehicle. The HGV involved was identified as either British registered (right-hand drive) or foreign registered (left-hand drive). The table summarises details for a random sample of

sideswipe accidents.

Type of sideswipe accident Changing lane to the left

Changing lane to the right

Total

Overtaking moving vehicle

British registered HGV Foreign registered HGV Total

a

i Investigate, at the significance level, whether the type of sideswipe accident is independent of whether the HGV involved was British registered or foreign registered. ii Describe any differences found in the type of sideswipe accident between British registered and foreign registered HGVs.

b A further random sample of serious HGV accidents was investigated. It was found that of these involved drivers who were years of age or younger. Of these accidents, resulted in prosecution for a driving offence. Of the other accidents, which involved drivers over the age of years, resulted in prosecution for a driving offence. i Form a

contingency table from this information.

ii Carry out a test, at the significance level, to investigate whether the age of the driver is independent of whether a prosecution for a driving offence resulted. Interpret your conclusion in context.

[© AQA 2011] 10 The director of a large company wants to know whether there is any association between the ages of her staff and the departments they work in. The table shows the data for a sample of employees.

Accounts Personnel Marketing Communications

Perform a suitable test at the

level of significance to decide whether there is any

association between age and department. 11 The table shows the experience of a bank over a long period of the types of loan that they give and whether they are repaid or defaulted (i.e. not repaid). Repaid

Defaulted

Personal Mortgage Business

a Show that whether or not a loan gets repaid depends on the type of loan. b A statistician wants to sample loans at random. Show that dependence between the two values, using significance.

would not show

c Find the smallest whole number for which the sample would be expected to show dependence at significance. You can assume that all expected values are above . d State an additional assumption required for your calculations in parts b and c. 12 Research was carried out to investigate for a possible connection between weekly alcohol consumption and development of Type 2 diabetes. In the research report, it was stated that a sample of women, aged between and , was studied and that of these women went on to develop Type diabetes. The women were categorised according to their average level  of weekly alcohol consumption. This was measured, in grams of alcohol per week, as ‘less than ’, ‘between and ’ or ‘more than 30’. The results are summarised in the table. Type 2 diabetes developed Yes

No

Less than Average level of weekly alcohol consumption

Between and More than

a Test, at the

level of significance, whether the development of Type diabetes is

independent of the average level of weekly alcohol consumption. b A medical reviewer for a newspaper read the report and then stated that people should

increase their weekly alcohol consumption in order to decrease their chance of developing Type diabetes. Make two comments on his statement, referring to both the study and the sources of association, if any, identified when carrying out the test in part a. c In fact,

women were involved in the research but the frequencies in the resulting

contingency table had been divided by

in order to make the calculation simpler.

The test in part a was therefore repeated using the correct frequencies. For this test, state: i the critical value ii the value of the test statistic iii the conclusion. [© AQA 2010]

4 Continuous distributions In this chapter you will learn how to: describe probabilities of continuous random variables calculate expected statistics of continuous random variables and functions of continuous random variables find the median, mode and quartiles of continuous random variables find the expected statistics of the sum of two continuous random variables work with the sum of two normally distributed random variables. If you are following the A Level course, you will also learn how to: convert between probability density function, , and cumulative probability function, use distributions of random variables that are part discrete and part continuous use two new probability distributions – the rectangular and the exponential use the cumulative distribution function to find the distribution of the function of a random variable.

Before you start … Chapter 1

You should know how to calculate expectations and variances for discrete distributions.

1 Find , given that has the distribution:

A Level Mathematics Student Book 1, Chapter 20

You should know the meaning of the statistical measures covered in AS Level Mathematics

2 Find the interquartile range of:

A Level Mathematics Student Book 1, Chapter 14

You should know how to integrate all functions from AS Level Mathematics.

A Level Mathematics Student Book 2, Chapters 9 and 11

You should know how to integrate all functions from A Level Mathematics.

4 Find

A Level Mathematics Student Book 2, Chapter 20

You should know how to use the

5 Given that and

A Level Mathematics Student Book 2, Chapter 21

You should be able to perform calculations with a normal distribution.

rules of probability including conditional probability.

From discrete to continuous

3 Find

6 Given that find

, find

.

, .

In Chapter 1 you saw that being able to describe random variables allowed you to make predictions about their properties. However, a major limitation was that the methods in chapter 1 only applied to discrete variables. In reality, many variables you are interested in, such as height, weight and time, are continuous variables. In this chapter you will extend the methods from chapter 1 to work with continuous random variables.

Section 1: Continuous random variables Consider these data for the masses of several bags of rice labelled ‘ Mass

’.

Frequency

Not all of the data in category has a mass of exactly . A bag with mass or would be included in this category. It is impossible to list all the different possible actual masses, and it is impossible to measure the mass absolutely accurately. When you collect continuous data, you have to put it into groups. This means that you cannot talk about the probability of a single value of a continuous random variable (CRV). You can only talk about the probability of the CRV being in a specified interval.

5.05

x

5.15

5.087 954 6

A useful way of representing probabilities of a CRV is as an area under a graph. The probability of a single value would correspond to the ‘area’ of a vertical line, which would be zero. However, you can find the area of the CRV in any interval by integration. The function which you have to integrate is called the probability density function (PDF), and it is often denoted . The defining feature of is that the area between two values is the probability of the CRV falling between those two values.

Tip For a continuous random variable, it does not matter whether you use strict inequalities or inclusive inequalities .

Key point 4.1 For a continuous random variable with probability density function

:

y y = f(x)

P(a < X < b)

O

a

b

x

As with discrete probabilities, the total probability over all cases must equal . Also, no probability can ever

be negative. This provides two requirements for a function to be a probability density function.

Key point 4.2 For

to be a probability density function, it must satisfy:

, for all

Tip The limits

and

represent the fact that, in theory, a continuous random variable can take

any real value. In practice, the limits of the integral are set to the lowest and the highest value the variable can take.

WORKED EXAMPLE 4.1

The continuous random variable shown has probability density function:

f(x)

O

1

x

a Find the value of . b Find the probability of being between a

and

.

The total area is . The limits are and because the PDF is only non-zero between and .

b

Use the formula in Key point 4.1 and substitute the value of found in part a.

EXERCISE 4A 1

For each of these distributions find the possible values of the unknown parameter . a

i ii

b

i ii

c

i

ii d

i ii

e

i ii

f

i

ii

g

i ii

h

i

ii 2

In each part, a continuous random variable has the given probability density function. a i Find ii Find b i Find ii Find c i Find ii Find

3

.

In each part, a continuous random variable has the given probability density function. a i Find if ii Find if b i Find if ii Find if

c i Find if ii Find if 4

A model predicts that the angle, , an alpha particle is deflected by a nucleus is modelled by the PDF a Find the value of the constant b

alpha particles are fired at a nucleus. Assuming that the model is correct, estimate the number of alpha particles deflected by less than .

5

The probability density function of finding a seed at a distance from a tree is proportional to . The minimum distance seed is found from the tree is being found more than

6

. Find the probability of a seed

from the tree.

A random variable has PDF Find the exact value of

7

Given that the continuous random variable has probability density function , find the interquartile range of .

8

A continuous random variable has probability density function a Find in terms of if b Find in terms of if

9

The continuous random variable has probability density function

for

otherwise. The probability of two independent observations of both being above

and is

.

Find the values of and of . 10

The continuous random variable has probability density function

. Find

. 11 The continuous random variable has probability density function given by . Prove that there is only one possible value of , and state its value.

for

Section 2: Expectation and variance of continuous random variables The expressions for expectation and variance of continuous random variables all involve integration.

Key point 4.3 The expectation and variance of a continuous random variable are:

Tip You might notice that the expressions for and look similar to those for discrete random variables, but with integration instead of summation signs. This is because there is a link between sums and integrals. You need to evaluate these integrals over the whole domain of the probability density function. WORKED EXAMPLE 4.2

A continuous random variable has pdf:

Find

and the standard deviation of . You can do the definite integration on your calculator.

To find the standard deviation you must first find which requires you to find .

It is also possible to find the median and mode for a continuous distribution. The defining feature of the median is that half of the data should be below this value and half above. You can interpret this in terms of probability.

Key point 4.4 If you represent the median of a continuous random variable with PDF . The mode is the value of at the maximum value of

.

by , then it satisfies

Common error Don't forget to look at the end points of the function when finding the mode. You can use similar ideas to find the quartiles (or any other percentile). For example, if quartile and is the upper quartile then

is the lower

Although the lower limit is written as minus infinity, in practice it starts from the lowest value for which the probability density function is defined. WORKED EXAMPLE 4.3

Find the median and mode of the random variable with probability density function

For median :

Use the formula in Key point 4.4 with the lower limit as .

Using your calculator:

This is a cubic equation. Use your calculator to solve it. For the mode, check for a maximum point. This could be where the derivative is zero or at an end point.

Hence the mode is .

The largest of these three numbers is

.

EXERCISE 4B 1

Find function. a

i ii

b

i

ii

c

i

, the median of , the mode of and

if has the given probability density

ii d

i

ii 2

a Given that

, find if:

i

ii b Given that

, find if:

i

ii 3 The continuous random variable has pdf a Find the expected mean of . b Find 4

.

A continuous random variable has pdf a Find the value of the constant . b Find

5

.

Consider the function a Show that, for all values of , the function

satisfies the conditions to be a PDF.

b The random variable has probability density function 6

in terms of .

is a continuous random variable with probability density function a Show that b Given that

7

. Find

Given that .

. , find the exact value of . , is a probability distribution, find

and prove that

Section 3: Expectation and variance of functions of a random variable Linear transformations Suppose the average height of students in a class was and their standard deviation was . If they all stood on their -high chairs then the new average height would be , but the range, and any other measure of variability, would not change, so the standard deviation would still be . In other words, if you add a constant on to a variable, you add the same constant on to the expectation, but the variance does not change: ,

Rewind You have already met this idea for discrete random variables in Key point 2.5. In this chapter you extend it to continuous random variables. If, instead, each student were given a magical growing potion that doubled their heights, the new average height would be . However, the range, along with any other measure of variability, would have doubled, so the new standard deviation would be . This means that their variance would change from to . In other words, if you multiply a variable by a constant, you multiply the expectation by the constant and multiply the variance by the constant squared: ,

Common error It is important to know that this only works for the structure function. So, for example, cannot be simplified to

or

, which is called a linear .

Key point 4.5 For a random variable with expectation then

and variance

. If and are constants,

,

WORKED EXAMPLE 4.4

A length of pipe is cut into a long pipe with average length and standard deviation . The leftover piece is used as a short pipe. Find the mean and standard deviation of the short pipe.  Length of long pipe  Length of short pipe

Define your variables. Connect your variables.

Apply Key point 4.5.

So the standard deviation of is also

General transformations

.

Key point 1.6 stated that for a discrete random variable . You can extend this to continuous random variables by changing the probability into a probability density function and integrating instead of summing. When finding the variance you used the fact that

.

This can be generalised to any function of .

Common error You will always get a positive variance (since square numbers are always positive), even if the coefficients are negative. If you find you have a negative variance, something has gone wrong!

Key point 4.6 If is a continuous random variable with pdf

, then:

WORKED EXAMPLE 4.5

Given that the random variable has probability density function otherwise, find

and

because that is where

EXERCISE 4C Given that a

, find:

i ii

b

i ii

c

i ii

d

i ii

e

i ii

2

Given that

and

. Identify

1

for

. , find:

. The limits are between and is not zero.

a

i ii

b

i ii

c

i ii

d

i ii

e

i ii

3

Given that is a continuous random variable with PDF find: a

for

and otherwise,

i ii

b

i ii

c

i ii

d

i ii

4

The expected distance of a random taxi journey is miles with standard deviation miles. The charge for a taxi journey is £ plus £ per mile (so that, for example, a mile journey would cost £ ). Find: a the expected value b the standard deviation in the charge for a taxi journey.

5

The random variable has

and

. Given that

and

, find

. 6

Daniel has hours of playtime each Sunday afternoon. In that time he either reads or plays games. If the expected amount of time reading is hours with a standard deviation of hours, find: a the expected amount of time playing games b the standard deviation in the amount of time spent playing a game.

7

The side of a cube, , is a continuous random variable with pdf

for

otherwise. a Find

.

b Find the expected volume of the cube.

Common error Notice that the answer to part b is not the cube of the answer to part a.

and

8

The continuous random variable has probability density function

for

and otherwise. Find: a b c 9

The continuous random variable has probability density function and otherwise. a Find b Find

.

.

where is a positive whole number.

for

Section 4: Sums of independent random variables A tennis racquet is formed by adding together two components – the handle and the head. If both components have their own distribution of length and they are combined together randomly then you have formed a new random variable – the length of the racquet. It is not surprising that the average length of the racquet is the sum of the average lengths of the parts, but with a little thought you can reason that the standard deviation will be less than the sum of the standard deviations of the parts. To get extremely long or extremely short tennis racquets you must have extremes in the same direction for both the handle and the head. This is not very likely. It is more likely that: both are close to average an extreme value is paired with an average value an extreme value in one direction is balanced by another.

Tip The first of the results in Key point 4.7 is true even if and are not independent.

Key point 4.7 For independent random variables with expectation expectation and variance :

and variance

, and with

The results in Key point 4.7 extend to more than two variables. WORKED EXAMPLE 4.6

The mean thickness of the base of a burger bun is The mean thickness of a burger is

with variance

The mean thickness of the top of a burger bun is

with variance . with variance

.

Find the mean and standard deviation of the total height of a whole burger in a bun, assuming that the thicknesses of the individual parts are independent. Define your variables.

Connect your variables. Apply Key point 4.7.

So the standard deviation of is

.

In Key point 4.7 it was stressed that and have to be independent, but this does not mean that they have to be drawn from different populations. They could be two different observations of the same population, for example, the heights of two different people added together. This is a different variable from the height of one person doubled. Use a subscript to emphasise when there are repeated

observations from the same population: means adding together two different observations of means observing once and doubling the result. The expectation of both of these combinations is the same:

. However, the variance is different.

From Key point 4.7:

From Key point 4.5:

So the variability of a single observation doubled is greater than the variability of two independent observations added together. This is consistent with the earlier argument about the possibility of independent observations cancelling out extreme values. You can also combine the results in Key points 4.5 and 4.7 to look at other linear combinations of independent random variables. WORKED EXAMPLE 4.7

The volume of lemonade Chris purchases at a supermarket is a discrete random variable with mean and standard deviation . The volume of lemonade Chris drinks on the journey home is a continuous random variable with mean and standard deviation . Assume that is independent of . is the random variable: volume of lemonade in ml remaining after the journey home. a Find the expected mean and standard deviation of . b How realistic is the assumption that is independent of ? a

Write the required random variable in terms of the other two random variables. You only have a rule for sums so you need to write it as a sum. Use Key point 4.7. Use Key point 4.5. Use Key point 4.7. Use Key point 4.5. Remember that variance is the square of the standard deviation. So the standard deviation of is

b Although the two variables might reasonably be thought to be independent there are also some reasons to doubt this. If Chris is very thirsty he might buy more lemonade and then drink more. If Chris does not buy any lemonade then he cannot drink any.

Worked example 4.7 illustrates that the theories of Key points 4.5 and 4.7 are applicable to both continuous and discrete random variables, or indeed combinations of the two. It also highlights the counter-intuitive fact that .

EXERCISE 4D

EXERCISE 4D 1

Let and be two independent variables with the expectation and variance of: a

,

,

and

. Find

i ii

b

i ii

c

i ii

d

i ii

e

i .

ii 2

Let and be two independent variables with

,

,

and

. Find:

a b c d 3

. and

are two independent observations of the random variable with

and

. The sample mean, . a Show that b Find 4

, of these two observations is also a random variable defined by

. .

The average mass of a man in an office is

with standard deviation

. The average mass of a

woman in the office is with standard deviation . The empty lift has a mass of . What is the expectation and standard deviation of the total mass of the lift when women and men are inside? 5

A weighted dice has mean outcome with standard deviation . Brian rolls the dice once and doubles the outcome. Camilla rolls the dice twice and adds the results together. Work out the expected mean and standard deviation of the difference between their scores.

6

Exam scores at a large school have mean and standard deviation . Two students are selected at random. Find the expected mean and standard deviation of the difference between their exam scores.

7

Adrian cycles to school with a mean time of minutes and a standard deviation of minutes. Pamela walks to school with a mean time of minutes and a standard deviation of minutes. They each calculate the total time it takes them to get to school over a five day week. Find the expected mean and standard deviation of the difference in the total weekly journey times, assuming journey times are independent.

8

is the random variable mass of a gerbil. Explain the difference between

and

.

Section 5: Linear combinations of normal variables Although the proof is beyond the scope of this course, it turns out that any linear combination of normal variables will also follow a normal distribution. You can use the methods from Section 4 to find out the parameters of this distribution.

Rewind You studied the normal distribution in A Level Mathematics Student Book 2, Chapter 21.

Key point 4.8 If and are independent random variables following a normal distribution and , then also follows a normal distribution.

WORKED EXAMPLE 4.8

Given that

,

and

, find



.

Use Key point 4.5. State the distribution of . Use your calculator to find the probability.

WORKED EXAMPLE 4.9

Given that

and four independent observations of are made, find Express

.

in terms of observations of .

Use Key point 4.7.

State the distribution of

.

Use your calculator to find the probability.

Rewind In Chapter 2 you met the idea that the Poisson distribution was scalable. You can now interpret this as meaning that the sum of two Poisson variables is also Poisson. This is the only other distribution in this course that has this property. However, it only applies to sums of Poisson distributions – not to differences or multiples or linear combinations.

EXERCISE 4E 1

Given that a

i ii

and

, find:

b

i ii

c

i ii

d

i ii

e

i ii

f

i ii

2

, where

is the average of

observations of .

, where is the average of observations of .

An airline has found that the masses of their passengers follow a normal distribution with mean and variance

.

The masses of their hand luggage follow a normal distribution with mean .

and variance

a State the distribution of the total mass of a passenger and their hand luggage and find any necessary parameters. b What is the probability that the total mass of a passenger and their luggage exceeds 3

Evidence suggests that the times Aaron takes to run are normally distributed with mean and standard deviation . The times Bashir takes to run are normally distributed with mean

and standard deviation

.

a Find the mean and standard deviation of the difference Bashir’s times. b Find the probability that Aaron finishes a

between Aaron’s and

race before Bashir.

c What is the probability that Bashir beats Aaron by more than 4

?

?

A machine produces metal rods so that their lengths follow a normal distribution with mean and variance . The rods are checked in batches of six, and a batch is rejected if the average length is less than or more than

.

a Find the distribution, including any necessary parameters, of the mean of a random sample of six rods. b Hence find the probability that a batch is rejected. 5

The distribution of lengths of pipes produced by a machine is normal with mean standard deviation . a What is the probability that a randomly chosen pipe has a length of

or more?

b What is the probability that the average length of a randomly chosen set of type is or more? 6

and

pipes of this

The masses, , of male birds of a certain species are normally distributed with mean and standard deviation . The masses, , of female birds of this species are normally distributed with mean standard deviation . a Find the mean and variance of

and

.

b Find the probability that the mass of a randomly chosen male bird is more than twice the

mass of a randomly chosen female bird. c Find the probability that the total mass of three male birds and female birds (chosen independently) exceeds . 7

A shop sells apples and pears. The masses, in grams, of the apples can be assumed to have a distribution and the masses of the pears, in grams, can be assumed to have a distribution. a Find the probability that the mass of a randomly chosen apple is more than double the mass of a randomly chosen pear. b A shopper buys apples and a pear. Find the probability that the total mass is greater than .

8

The length of a corn snake is normally distributed with mean

.

The probability that a randomly selected sample of corn snakes has an average of above is

.

Find the standard deviation of the length of a corn snake. 9

a In a test, boys have scores that follow the distribution . Girls’ scores follow What is the probability that a randomly chosen boy and a randomly chosen girl differ in scores by less than ?

.

b What is the probability that a randomly chosen boy scores less than three-quarters of the mark of a randomly chosen girl? 10 The daily rainfall in Algebraville follows a normal distribution with mean deviation

and standard

.

On a randomly chosen day, there is a probability of

that the rainfall is greater than

In a randomly chosen seven-day week, there is a probability of is less than .

.

that the mean daily rainfall

a Find the value of and of . b What assumption was required in performing this calculation? How reasonable is this assumption? 11 Anu uses public transport to go to school each morning. The time she waits each morning for the transport is normally distributed with a mean of minutes.

 minutes and a standard deviation of  

a On a specific morning, what is the probability that Anu waits more than

 minutes?

b During a particular week (Monday to Friday), what is the probability that: i her total morning waiting time does not exceed  minutes ii she waits less than

minutes on exactly mornings of the week

iii her average morning waiting time is more than

 minutes?

c Given that the total morning waiting time for the first four days is probability that the average for the week is over  minutes. d Given that Anu’s average morning waiting time in a week is over probability that it is less than

Tip Only consider the last day.

minutes.

 minutes, find the  minutes, find the

Section 6: Cumulative distribution functions In Key point 4.4 you saw a method for finding the median. This method can be generalised to find any percentile, using a function called a cumulative distribution function. This function has many surprising uses because, unlike a probability density function, it represents a real probability so can be combined using laws of probability. A cumulative distribution function (CDF) measures the probability of a random variable being less than or equal to a particular value. Normally, if the probability density function is called , the cumulative distribution function is called

.

Key point 4.9 For a continuous distribution

Tip The in the integral is a dummy variable. You could replace it with any other symbol. The only real variable in this expression is the in the upper limit, which corresponds to the in the left hand expression. Since you can undo integration by differentiation, you can recover the probability density function from .

Key point 4.10 Given that is the cumulative distribution function, then you can find the probability density function, , using:

WORKED EXAMPLE 4.10

Find the cumulative distribution function, given that a continuous random variable has probability density function If If

.

If

for

and otherwise.

State when is below and above the range in which is defined. When is above the probability of the random variable being below is , because all observed values are between and .

: Since there is no probability of the random variable being below , the integral starts at .

Once you have the cumulative distribution function you can use it to find the median, quartiles and any other percentiles, since the th percentile is defined as the value such that . i.e. .

Rewind You saw that you could do this without explicitly referring to a cumulative distribution function, in Exercise 4B.

WORKED EXAMPLE 4.11

The continuous random variable has cumulative distribution function

a Find the probability density function of b Find the lower quartile of . PDF is the derivative of CDF.

a and otherwise. b At the lower quartile:

is non-zero only if Therefore

EXERCISE 4F

.

Lower quartile is

.

th percentile.

Decide which solution to choose.

EXERCISE 4F 1

Find the cumulative distribution function for each of these probability density functions, and hence find the median of the distribution. a

i ii

b

i

ii 2

Given each continuous cumulative distribution function, find the probability density function and the median.

a

i

ii

b

i

ii

3

Find the exact value of the for

percentile of the continuous random variable

and otherwise.

4 A continuous random variable has cumulative distribution function a Find the value of . b Find the probability density function. c Find the median of the distribution.

that has pdf

Section 7: Piecewise-defined probability density functions A probability density function can have different function rules on different parts of its domain. Such a function is said to be defined piecewise. All the techniques from the Sections 1–6 still apply; however, when evaluating definite integrals you need to split them into several parts.

Rewind You already met this idea in the context of kinematics in A Level Mathematics Student Book 1, Chapter 16.

WORKED EXAMPLE 4.12

A continuous random variable has probability density function

a Sketch

.

b Find the value of . c Find a

.

f(x)

k

0

0

1

2

3

4

5

6

7

8

x

b Using the fact that the total area under the graph of must be : The area is now made up of two separate parts, so you need to work out the two separate areas and add these together.

c Again, you need to split the integral for

into two parts.

You might be able to evaluate definite integrals on your calculator.

WORKED EXAMPLE 4.13

A random variable has probability density function

a Find the median of . b Find the cumulative distribution function of . a If the median is , then If

You don’t know whether the median is in in , so you need to try both cases.

:

or

Remember to check that any solution you find is in the correct interval. In this case, neither of these values can be the median. If

: You need to split the probability into two parts: .

So the median is b When

The median must be between and .

.

:

You need to look at the two parts of the domain separately.

When:

You need to split the probability into two parts to use the different expressions. Use found.

Remember to write out the full expression for .

So,

EXERCISE 4G 1

A continuous random variable has probability density function

a Sketch the graph of

, which you have already

.

b Find the value of . c Find the value of such that 2

.

A random variable has cumulative distribution function

a Find the median of . b Find the mean and the variance of . 3

A continuous random variable has probability density function

a Show that

.

b Write down the value of

.

c Find the upper quartile of . d Find the cumulative distribution function of . 4

The continuous random variable

is defined by the probability density function

a Find the value of . b Sketch the probability density function. c Find

.

d Find the median of 5

.

Function is defined by

a Show that is a valid probability density function. b Find the variance of a random variable whose probability density function is . 6

The continuous random variable has probability density function

a Find the value of . b Find the expectation of .

c Find the cumulative distribution function of . d Find the median of . e Find the lower quartile of . 7

A continuous random variable has probability density function

a Sketch b Show that c Find

. . .

d Find the exact value of

Section 8: Rectangular distribution The rectangular distribution is related to the discrete uniform distribution. It is a distribution where any equally sized part of the domain has an equal probability of occurring. It is defined by the endpoints of the domain, and . The probability density function is a constant, and this constant must be chosen so that the total area under the graph is .

Rewind You met the discrete uniform distribution in Chapter 1.

Key point 4.11 If follows a rectangular distribution between and , then for

.

Tip The easiest way to get this result is not to use integration, but to realise that the graph forms a rectangle with width and total area . b – a 1 b – a

Area = 1 a

b

You can find the mean of this distribution by using integration. WORKED EXAMPLE 4.14

Prove that if is a random variable following a rectangular distribution over then

with

. Use the definition of expectation from Key point 4.3. Use the PDF of a rectangular distribution from Key point 4.11. Use the laws of integration.

Use the difference of two squares.

Since You can use a similar method to find the variance.

,

Fast forward You are asked to prove this in question 6 in Exercise 4H.

Key point 4.12 Given that is a random variable following a rectangular distribution over

:

WORKED EXAMPLE 4.15

When a measurement is quoted to the nearest

it is equally likely to be anywhere within

of the stated value. A large number of measurements of different objects, all of which round to , are made and their accurate values noted. a Find the probability that an object quoted as being away from

to the nearest

is actually more than

.

b Find the standard deviation of the difference between the quoted value and the true value (with quoted values below the true value giving a negative difference) a

. follows a rectangular distribution over . Required probability is .

Define variables. Identify the distribution. Write the required distribution in mathematical terms. Use areas of rectangles rather than integration.

b

Use Key point 4.5.

Use the formula for 4.12. So, standard deviation =

EXERCISE 4H

=

from Key point

EXERCISE 4H 1

Find these probabilities. In parts a to d, follows a rectangular distribution over a

b

c

ii

;

i

;

ii

;

ii d

i ii

e

;

i

i

.

; ; ; ;

i When a measurement is quoted to the nearest cm it is equally likely to be anywhere within of the stated value. Find the probability that a measurement quoted as being to the nearest cm is actually above

.

ii A car’s milometer shows the number of completed miles it has done. Jerry’s car shows miles. What is the probability that it will show miles in the next miles? 2

Find the expected mean and standard deviation of: a

b

i

given that it follows a rectangular distribution over

ii

given that it follows a rectangular distribution over

i the true value of a result quoted as being

to the nearest

ii the true age of a boy who (honestly) describes himself as eighteen years old. 3

A piece is cut off one end of a log of length anywhere along the log, find:

. Given that the cut is equally likely to be made

a the probability that the length of the piece is less than b the expected mean and standard deviation of the length of the piece. 4

A string of length is randomly cut into two pieces. Find the probability that the length of the shorter piece is less than .

5

Five random numbers are selected from the interval smaller than .

6

a Prove, using integration, that the variance of the rectangular distribution between and is

b Hence prove that the ratio 7

. Find the probability that they are all

is independent of and , stating its value.

A rod of length is cut into two parts. The position of the cut is uniformly distributed along the length of the rod. Find the mean and standard deviation of the length of the shorter part.

Section 9: Exponential distribution When you model the waiting interval until a first success in a Poisson-type situation you can use the exponential distribution. It is defined by the number of successes in a unit interval of time, , and it is written as . Since the waiting interval is a continuous variable, the probability distribution is described using a probability density function.

Rewind You met the Poisson distribution in Chapter 2.

Key point 4.13 Given that

, then:

You can find the mean and the variance of the exponential distribution by using integration.

Key point 4.14 Given that

, then:

Fast forward You are asked to prove the formula for the variance in question 10 in Exercise 4I.

PROOF 5

Prove that if

, then

. Start from the definition of expectation (Key point 4.3). The integral starts from as this is the lower limit of the probability distribution.

Identify:

You need to use integration by parts. As usual when doing integration by parts, start by identifying and

So

... then find

So

Use

...

and .

When the square bracket term is . It is less obvious what happens when , but it turns out that the terms goes to zero faster than goes to infinity, so overall it is zero at both limits. When When

, ,

. .

WORKED EXAMPLE 4.16

The number of leaks in any miles of pipes in a sewer system follows a Poisson distribution with mean . a Find the probability that the first leak will be found in the first half mile. b Find the variance of the distance until the first leak is found. a

Define variables. Identify the distribution. Since the number of leaks in miles follows a Poisson distribution, the distance until the first leak will follow an exponential distribution. To find the parameter, you need to find the number of leaks per unit of distance (miles); here it is . Write the required probability in mathematical terms and use the probability density function.

b

Use the formula for

from Key point 4.14.

You could also be asked to find a probability of a variable with an exponential distribution being greater than a particular value. You can do this by integration, but it is useful to know the cumulative distribution function. You can find this using integration. If then

Rewind You met integration by parts in A Level Mathematics Student Book 2, Chapter 11.

Key point 4.15 If

, then

.

The exponential distribution also has a property called memorylessness. Prior waiting does not change how long you are likely to wait for an event. This means that as well as measuring the amount of time until a first event, it also measures the interval between events, as shown in Worked example 4.17. WORKED EXAMPLE 4.17

During the summer Tanis sneezes on average two times every hour. a State an assumption that must be made to model the time until the next sneeze by using an exponential distribution. b Assuming that the time until the next sneeze can be modelled by an exponential distribution, find the exact probability that Tanis goes more than ninety minutes after waking up before sneezing. a You must assume that sneezes occur independently of each other. b

Define variables. Identify the distribution. The exponential can be used with

any starting point, so the fact that it is time after waking is not important. Write the required probability in mathematical terms and use the cumulative distribution function. Remember that units are hours.

EXERCISE 4I 1

Find these probabilities. a

if

i

if

ii b

i ii

c

if if

i waiting more than seconds for an emission from a radioactive substance that emits three alpha particles per minute on average ii waiting less than fifteen minutes for a bus that comes three times per hour on average.

2

Find the expected mean and standard deviation of: a

i ii

b

i the distance travelled in a car before reaching the first pot hole if pot holes along a certain road are spread independently at an average rate of per kilometre ii the time from the beginning of the day until the first phone call at a call centre that receives an average of calls per hour.

3

The number of emails Khaled receives in an hour follows a Poisson distribution with mean . What is the probability that the next email arrives in less than minutes?

4

Birds arrive at a feeding table independently, at an average rate of per hour. a Find the probability that two birds will arrive in the next ten minutes. b Find the probability that there is more than a ten-minute wait before the next bird arrives. c Find the expected mean and standard deviation of the time (in minutes) spent waiting for a bird.

5

When Ben walks down a particular street, he meets people he knows at an average rate of three every 5 minutes. Different meetings are independent of each other. What is the probability that Ben has to walk for more than minutes before he meets a person he knows?

6

The probability of waiting less than minutes for a bus is . If the waiting time is modelled by an exponential distribution, find the probability of waiting more than minutes.

7

The probability of waiting more than minutes for a phone call is . Find an expression for the mean waiting time for a phone call in terms of and , assuming the waiting time can be modelled by the exponential distribution.

8

Show that the probability of a variable with an exponential distribution than its mean is independent of .

taking a value larger

9

The number of buses arriving at a bus stop in an hour follows a Poisson distribution with mean . a Name the distribution which models the time, in minutes, Amanda has to wait until the next bus arrives. State any necessary parameters.

b Given that Amanda has already been waiting for wait at least minutes.

minutes, find the probability that she has to

c Show that the answer in part b is the same as the probability that Amanda has to wait at least minutes. 10 Prove that if 11

, then

.

is the number of successes that occur in one unit of time, so that successes that occur in units of time.

. is the number of

a Write down the distribution of . b Find , giving your answer in terms of and . follows an exponential distribution . c Explain why

.

d Hence prove that the probability density function of is

.

Section 10: Combining discrete and continuous random variables It is possible for a random variable to be discrete in some parts of its domain and continuous in other parts of its domain. For example, a doctor might measure the masses of babies less than as precisely as possible (creating a continuous part of the random variable) but masses above might be measured to the nearest (creating a discrete part of the random variable). If this is the case you apply all the rules learnt in this chapter and in Chapter 1 but using sums over the discrete part of the random variable and integrals over the continuous part of the random variable.

Tip Notice that in Worked example 4.18 the end point of the continuous part of the variable is a part of the discrete random variable. You might worry about situations like this, but it is perfectly possible to define random variables in this way.

WORKED EXAMPLE 4.18

The random variable can only take the values If

the variable has PDF

When

or

then

, or .

. .

a Find the value of . b Find

.

a Total probability is:

So

therefore

The total probability is an integral over the continuous range plus a sum over the discrete range.

Use the fact that the total probability equals 1. You need to split the expectation into an integral over the continuous part of the variable and a sum over the discrete part.

b

EXERCISE 4J 1

Find and a

b

for these mixed probability distributions.

i

for

ii

for

i

for

ii

for

for for for for

c

d

2

i

for

for

ii

for

for

i

for

ii

for

for for

The random variable is defined for

and for

probability density function is given by

. Between

. It is also known that

and is

the

. Find:

a the value of b c d 3

.

The random variable is defined for Between that P

and is .

and for

.

the probability density function is given by

. It is also known

a Find an expression for in terms of . b Given that 4

, find

.

The mixed random variable can take any value between and to . The distribution is defined by:

as well as integer values from

for for a Find the value of . b Find 5

.

The mixed random variable can take any values from to and the discrete values and . It has cumulative distribution function: for for Find: a b c d

6

.

A mixed random variable can take the discrete values and . It has cumulative distribution function: for for Find: a the values of and

and and continuous values between

b 7

.

The mixed random variable can take any values between and . Between

and , and the discrete values

and it has probability density function

. When

or

.

Prove that there is only one possible value of and find its value. 8

An athletics coach records the

time of a squad of junior sprinters. He records the time of

anyone who runs between and seconds as precisely as he can. Anyone who runs between and seconds gets their time recorded to the nearest tenth of a second. He models the time recorded by this probability distribution: for

for

a Find the value of . b Find

, giving your answer to three decimal places.

c Find the standard deviation of . The true times of the athletes have probability density function: for for d Find the value of . e Find

and comment on your answer in relation to part b.

 Checklist of learning and understanding The probability of a continuous random variable taking any single value is a meaningless concept, but it is possible to work with the probability of it being in a given range. To do this you use a probability density function such that the area under the curve represents probability. The total area is therefore 1, and the function is never negative. The summation formulae for the expectation of discrete random variables become integrals for continuous random variables:

is still The expectation and variance of a linear transformation are given by:

If and are independent random variables then

If and are independent random variables following a normal distribution and , then also follows a normal distribution.

The expectation of a function of a continuous random variable is given by:

The cumulative distribution function gives the probability of the random variable taking a value less than or equal to . For a continuous distribution with PDF

:

The main uses of cumulative distribution functions are to find percentiles of a distribution and to convert from a distribution of one continuous random variable to a distribution of a function of that variable. If follows a rectangular distribution between and , then

If

, then for

Mixed practice 4 1

A continuous random variable has probability density function value of

for

. Find the exact

.

Choose from these options. A B C D 2

and are independent random variables. has mean and standard deviation . has mean and standard deviation . Given that , what is the standard deviation of ? Choose from these options. A B C D

3

The continuous random variable has PDF

and otherwise.

a Find the cumulative distribution function of . b Find

.

c Find 4

.

Given that is a continuous random variable with PDF

, find:

a the value of b the expectation of c the variance of . 5

The Jones’ expected spend on their garden is £ out of a bank account containing £ .

with a variance of £

. This is paid for

a What is the standard deviation in the amount remaining in the bank account after the garden has been paid for? However much the Jones’ spend on their garden, the Smiths will spend twice as much plus £ . b What is the expected amount that the Smiths will spend? c What is the standard deviation in the amount that the Smiths will spend? 6

a If is a continuous random variable with PDF

and

, find

the value of the constants and . b Evaluate: i ii 7

The continuous random variable has probability density function and otherwise. a Find the cumulative distribution function of . b Find the exact value of the median of .

8

Given that the continuous random variable has PDF

and

otherwise, find the interquartile range of . You will have to make appropriate use of technology. 9

The probability density function for the continuous random variable is

for

and

otherwise. a Find the value of . b Find

.

c Find

.

d Find the exact value of

.

10 A doctor measures the masses of babies. If a baby has a mass between and the mass is recorded as accurately as possible. If the mass is between and the mass is recorded to the nearest . The doctor models the recorded masses using the random variable with probability distribution defined by:

for

.

There are no masses recorded outside of the range from

to

.

a Write down the value of . b Hence find the value of . c Find d Find

. .

11 The time taken, in minutes, to wash the dishes is modelled by a random variable with expectation and standard deviation . The time taken, in minutes, to clean the table is modelled by a random variable with expectation and standard deviation . In this model and are considered to be independent. a Before leaving Hassan must wash the dishes then clean the table. is the total time this takes. Find the expectation and standard deviation of .

b When Alice visits the jobs can be shared. Hassan washes the dishes and Alice cleans the table. is the time Alice has to wait after finishing cleaning the table before they can leave. Find the expected mean and standard deviation in . c Is the assumption that and are independent reasonable in these situations? d Hassan keeps a record of the total time he spends washing the dishes over days. He assumes that the times taken each day are independent and models the total time in the days using the random variable the standard deviation of ?

. For what values of will

be more than

times

12 The times Markus takes to answer a multiple choice question are normally distributed with mean consisting of

and standard deviation questions.

. He has one hour to complete a test

Assuming the questions are independent, find the probability that Markus does not complete the test in time. 13 The masses of men in a factory are known to be normally distributed with mean

and

standard deviation . There is an elevator with a maximum recommended load of . With men in the elevator, calculate the probability that their combined mass exceeds the maximum recommended load. 14 Davina makes bracelets by threading purple and yellow beads. Each bracelet consists of seven randomly selected purple beads and four randomly selected yellow beads. The diameters of the beads are normally distributed with standard deviation . The average diameter of a purple bead is and the average diameter of a yellow bead is . Find the probability that the length of a bracelet is less than

.

15 The masses of the parents at a primary school are normally distributed with mean variance , and the masses of the children are normally distributed with mean

and and

variance . Let the random variable represent the combined mass of two randomly chosen parents and the random variable represent the combined mass of four randomly chosen children. a Find the mean and variance of

.

b Find the probability that four children weigh more than two parents. 16 A random variable has cumulative distribution function given by

The diagram shows the graph of

.

y 1

0.5

0

0

1

2

3

4

5

6

7

8

9

10

x

a Find

.

b Find the median of . c Find the probability density function for . d Show that the mean of is

.

You are given that the variance of is

.

e Find the probability that the mean of a random sample of

values of is greater than .

17 The number of beta particles emitted by a radioactive substance follows a Poisson distribution. The probability of observing no particles in hours is a Find the expected waiting time until the first beta particle is observed. b Find the probability of waiting more than

minutes to observe a beta particle.

c Given that no particles have been observed in the first it takes more than hours to observe a beta particle. 18

minutes, find the probability that

is a continuous random variable following a rectangular distribution between and , with . a Prove that

.

b Find the cumulative distribution function of . c Two independent observations of are made. Find an expression for the probability that the maximum of these two observations is less than where . 19 The humidity of air is measured by a weather station. It can only take values from to inclusive. It is modelled by a mixed random variable, , with these properties:

Between and

has PDF:

. a Find the value of . b Find

.

c Find the median of . 20 The continuous random variable has cumulative distribution function . Find the probability that in four observations of more than two observations take a value of less than . 21 The continuous random variable has CDF Find the values of , and .

. The median of is

.

22 The marks students scored in a Mathematics test follow a normal distribution with mean and variance . The marks of the same group of students in an English test follow a normal

distribution with mean

and variance

.

a Find the probability that a randomly chosen student scored a higher mark in English than in Mathematics. b Find the probability that the average English mark of a class of their average Mathematics mark.

students is higher than

23 The continuous random variable has probability density function:

a Show that

.

b What is the probability that the random variable has a value that lies between and ? Give your answer in terms of . c Find the mean and variance of the distribution. Give your answers in terms of . The random variable represents the lifetime, in years, of a certain type of battery. d Find the probability that a battery lasts more than six months. A calculator is fitted with three of these batteries. Each battery fails independently of the other two. e Find the probability that at the end of six months: i none of the batteries has failed ii exactly one of the batteries has failed. 24 The random variable has probability density function defined by:

a Sketch the graph of . b Find the exact value of

.

c Prove that the distribution function , for

, is defined by

.

d Hence, or otherwise: i find ii show that the median, , of satisfies the equation

.

e Calculate the value of the median of , giving your answer to three decimal places. [© AQA, 2012] 25 The continuous random variable has probability density function defined by:

a Sketch the graph of .

b Show that: i ii

.

c Hence write down the exact value of: i the interquartile range of ii the median, , of . d Find the exact value of

. [© AQA, 2011]

FOCUS ON … PROOF 1

Sums of discrete independent random variables In this section, you will prove this important result:

Rewind You studied discrete random variables in Chapter 1. If and are discrete independent random variables, then . You need to know: (Theorem 1) (Theorem 2) If and are independent random variables, then (Theorem 3) (Theorem 4) A finite double sum of a sum can be split into two sums:

(Theorem 5)

PROOF 6

On each line, state which of these theorems are being applied. 1

Theorem ____

2

Theorem ____

3

Theorem ____

4

Properties of sums

5

Theorem ___ and Theorem ___

6

Theorem ____

7

Theorem ____

QUESTIONS Use techniques similar to those in Proof 6 to answer these questions. and are independent discrete random variables. 1

Prove that

.

2

a Prove that b Hence prove that

3

Prove that

. . .

FOCUS ON … PROBLEM SOLVING 1

Finding the parameters of a distribution Often you are not told directly the parameters of a distribution, but have to infer them from given information. If this is the case, sometimes the equations will be impossible to solve directly, so you have to use technology to solve them. WORKED EXAMPLE

In a Poisson distribution the probability of two events occurring is

. Find the probability of one

event occurring. If

then

Write the information given in terms of , the parameter of the Poisson distribution.

0.3 0.2 (0.605, 0.1)

0.1 –1 O

0.5

1

1.5

So

2

(4.708, 0.1) 2.5

3

3.5

or

4

4.5

.

5

This equation is not solvable using standard functions, so you can instead sketch it.

5.5

You can use graphing technology to find the intersection points.

or

QUESTIONS 1

Given that

2

Given that

3

Given that

4

The probability of a biased coin showing a head is . a In

and and and

, find

.

, find . to decimal places, find .

tosses, one head is observed. Show that the probability of this happening is

.

b In another tosses, two heads are observed. Show that the probability of this happening, and the observation in part a happening, is . c By using technology or otherwise, find the value of that maximises the probability of getting three heads in tosses.

Did you know? This type of method is called maximum likelihood estimation and is a very powerful tool in advanced statistics. You could research and list the uses of this.

FOCUS ON … MODELLING 1

Situations for the Poisson distribution The Poisson distribution is frequently used to model situations in which there is a rate of events. However, it can be applied incorrectly because there are several conditions that must be met: the process must be random, so that it is not totally predictable there must be a constant average rate, not something that changes in different areas or over time the events must be independent of each other.

QUESTIONS To help you to understand the required conditions in context, here are some examples of real-life situations. Comment on whether the Poisson distribution would be an appropriate model in each of these situations. Where the Poisson is not appropriate, state which conditions are not met.

Did you know? In several of the situations in which the Poisson conditions are not perfectly met, in reality statisticians still use the Poisson model to make useful predictions. This is because all models are imperfect and the errors in estimating the average rate might well be larger than the errors caused by a weak dependency between the events. When interpreting models it is vital to understand the sources and scales of uncertainty in the output. 1

The number of fish in a

2

The number of signals received in an hour by a mobile phone from a communication mast when a signal is received every seconds.

3

The number of beta particles emitted every minute by a radioactive substance that emits on average beta particle every seconds.

4

The waiting time for a bus when one arrives on average every

5

The number of errors in

6

The number of fish caught in ten hours in a small pond if an average of

7

The number of girls in girls.

8

A binomial distribution

volume of an ocean where fish occur at an average rate of per

.

minutes.

pages of a textbook if there is an average of error on every pages. fish are caught every hour.

randomly selected people if it is expected that

when is very large.

Tip You might want to use technology to confirm your answer to question 8.

of the population are

5 Further hypothesis testing In this chapter you will learn how to: interpret the different types of errors that can be made while conducting hypothesis tests, called type I and type II errors calculate the probability of a type I error based on a Poisson distribution calculate the probability of a type I error based on a binomial distribution. If you are following the A Level course, you will also learn how to: use a new type of hypothesis test for the mean, called a -test calculate the probability of a type II error based on a Poisson distribution calculate the probability of a type II error based on a binomial distribution calculate the probability of type I and type II errors based on a normal distribution.

Before you start … A Level Mathematics Student Book 1, Chapter 22

You should know how to calculate unbiased estimates of the population variance.

1 Find an unbiased estimate of the variance of a population based on this sample: .

A Level Mathematics Student Book 1, Chapter 22

You should know how to conduct hypothesis tests, using the binomial distribution.

2 A six-sided dice is rolled five times and three sixes are observed. Test at the significance level if this provides evidence that the dice is biased towards rolling more sixes than a fair dice.

Chapter 2

You should know how to conduct hypothesis tests, using the Poisson distribution.

3 The number of bees arriving at a flower is modelled by a Poisson distribution. If six bees arrive in one minute, does this provide evidence at the significance level that the true mean is greater than ?

A Level Mathematics Student Book 2, Chapter 21

You should know how to conduct calculations with the normal distribution.

4 If

A Level Mathematics Student Book 2, Chapter 22

You should know how to conduct hypothesis tests with the normal distribution.

5 A sample of objects drawn from a normal distribution with a standard deviation of has a

, find

.

mean of . Conduct a twotailed test at significance to decide if this provides significant evidence of a change from a mean of .

Rewind You studied hypothesis testing using the normal distribution in A Level Mathematics Student Book 2, Chapter 22.

Realistic hypothesis testing When studying hypothesis testing, using the normal distribution (the -test), you might have wondered about conditions required for using it. It is a test in which you are uncertain about the population mean but you do know the population variance. This is not a situation that occurs very frequently. More often, you need to use the sample to estimate the variance of the population. To do this, you use a -test. One of the reasons hypothesis tests are so important in modern statistics is that they try to give a probability of certain types of error. You will see in this chapter which types of errors are controlled and which are not, and how to calculate their probabilities.

Section 1: -tests In a -test to see if the population mean has changed from

you are testing the hypotheses:

, You calculate the -score for

, your mean of a sample of size :

where and are found from the sample while is the value in the null hypothesis and , the population standard deviation, is assumed to be the same as a previously held value. You can use the fact that

to do calculations with this statistic.

Tip is a random variable representing the sample mean. is a particular observation of this random variable. If you do not know (or have reason to believe that it has changed) you must instead estimate it from the data. An appropriate way to estimate this is using the square root of the unbiased estimate of the variance, . You can then construct a -score.

µ

x1 x2 x3

x σ

S

-scores very rarely exceed (or go below ). However, if your sample just happens to have a very small standard deviation – for example, if the sample is and in the graph shown – then the score can get quite large: it is not unusual for it to be around . This highlights that it does not follow a normal distribution. The likelihood of getting a very tightly clustered sample depends on . At low values of this possible clustering has a very big effect, but at large values of the sample standard deviation is a very good approximation of the population standard deviation. This means that there are lots of different -distributions, depending on the value of .

Tip Although the -distribution gives the distribution of the random variable , conventionally it is written with a lower case .

Z-distribution t-distribution with n = 7 t-distribution with n = 4

As the value of grows, the -distribution gets closer and closer to the -distribution.

Key point 5.1 The -test is based on the test statistic:

This will be given in your formula book. You might wonder why there is an

in the formula in Key point 5.1. Rather than using the value of

to describe the -distribution, conventionally you use the degrees of freedom, . Because you fix one parameter, the population mean, when you are doing a -test you use the formula

.

Tip Some graphical calculators can perform a -test. You should state the test statistic, its distribution and the -value from your calculator. If the mean and standard deviation of the sample are not given in the question you should state those too: your calculator will find them in the process of performing the -test. To conduct a -test, you calculate the value of  and then look up the critical value from the table given in the formula book. This still requires some work as the information is given in terms of cumulative probabilities. To do a one-tailed test with significance level , you look up the column headed by in the table. To do a two-tailed test with significance level , you look up the column headed by

in the table. If the modulus of your -score is more than this value, you reject

1 – α

α

α 1 –– 2

α – 2

α – 2

WORKED EXAMPLE 5.1

The label of a pre-packaged steak claims that it has a mass of steaks is taken and their masses are:

. A random sample of

.

Test at the significance level whether the label’s claim is accurate, stating any assumptions you need to make. mass of a steak in

Define variables. You must use the -test since the true variance is unknown, but to do this test the underlying distribution must be normal.

Assume that

State the hypotheses. State the test statistic and its distribution.

From your calculator:

Find sample statistics from your calculator. Since you do not know the true population variance, you use an unbiased estimate, .

So

Calculate your -score.



The critical value is

.

To find the critical value when two-tailed test you look in the given in the formula book.

, therefore do not reject – there is no significant evidence to doubt the label’s claim.

and for a column in the table





































Compare your calculated -score with the critical value and conclude, putting the conclusion into context.

Common error Conclusions are often stated without a sense of statistical uncertainty. For example, it would be wrong to state that the conclusion of the test in Worked example 5.1 is: ‘The label is correct.’

EXERCISE 5A 1

In each of these situations it is believed that is normally distributed. Decide the result of the test if it is conducted at the significance level. a

;

i

;

ii b

;

i

;

ii c

i

Data:

Data:

ii 2

John believes that the average time taken for his computer to start is seconds. To test his belief, he records the times (in seconds) taken for the computer to start:

a State suitable hypotheses. b Test John’s belief at the

significance level.

c Justify your choice of test, including any assumptions required. 3

Michael regularly buys

packets of tea. He has noticed recently that he gets more cups of

tea than usual out of one packet, and suspects that the packets contain more than on average. He weighs eight packets and finds that their mean mass is and the standard deviation of their masses is . a Find the unbiased estimate of the variance of the masses, based on Michael’s sample. b Assuming that the masses are normally distributed, test Michael’s suspicion at the significance.

level of

4

The crawling ages of babies in a nursery are recorded. The sample has mean months and standard deviation months. A parenting book claims that the average age for babies crawling is months. Test at the level whether babies in the nursery crawl significantly earlier than average, assuming that the distribution of crawling ages is normal.

5

Penelope thinks that cleaning the kettle will decrease the amount of time it takes to boil ( seconds). She knows that the average boiling time before cleaning is she boils the kettle times and summarises the results as:

seconds. After cleaning,

a State suitable hypotheses. b Test Penelope’s idea at the 6

significance level.

A national survey of athletics clubs found that the mean time for a -year-old athlete to run is . A coach believes that athletes in his club are faster than average. To test his belief he collects the times for athletes from his club and summarises the results in this table. Time,

Frequency

a Calculate an estimate of the mean time for the athletes in the club. b Find an unbiased estimate of the population variance based on this sample. c Test the coach’s belief at the

level of significance.

d State what assumption you have made about the distribution of the athletes’ times. 7

The lengths of bananas are found to follow a normal distribution with mean Roland has recently changed banana supplier and wants to test whether their mean length is different. He takes a random sample of bananas and obtains these summary statistics:

a State suitable hypotheses for Roland’s test.

b Test at the significance level whether the data support the hypothesis that the mean length of Roland’s bananas is different from . c Roland’s assistant Sonia suggests that they should test whether the mean length of bananas from the new supplier is less than . i State suitable hypotheses for Sonia’s test. ii Find the outcome of Sonia’s test at the 8

significance level.

The manufacturers of tins of soup claim that the tins contain, on average, of soup. Aki wants to test if this is an accurate claim. She samples tins of soup and finds that they have a mean of and an unbiased estimate of the population standard deviation of . a State appropriate null and alternative hypotheses. b For what values of will Aki reject the null hypothesis at the

significance level?

Section 2: Errors in hypothesis testing Defining type I and type II errors The acceptable conclusions to a hypothesis test are: 1 sufficient evidence to reject at the significance level 2 insufficient evidence to reject at the significance level. It is always possible that these conclusions are wrong. If the first conclusion is wrong – i.e. you have rejected (spoken as ‘type one error’).

while it was true – it is called a type I error

If the second conclusion is wrong – i.e. you have failed to reject

when you should have done – it is called

a type II error (spoken as ‘type two error’).

Key point 5.2 In hypothesis tests, type I and type II errors can be summarised as:





Claim

Reality is true

is false

is true is false

type II error type I error

WORKED EXAMPLE 5.2

In a court case, defendants are presumed innocent. a What are the null and alternative hypotheses in this situation? b What would a type I error be in this situation? c What would a type II error be in this situation? a

: Defendant is innocent. : Defendant is guilty.

b A type I error would be saying that an innocent person is guilty.

A type I error is rejecting true.

c A type II error would be saying that a guilty person is innocent.

A type II error is not rejecting was false.

when it was when it

Probability of type I errors You cannot eliminate these errors, but you can find the probability that they occur. For a type I error to occur, the test statistic must fall within the rejection region while was true. The critical region is designed so that this probability is the significance level.

Key point 5.3 In a hypothesis test, the probability of a type I error is equal to the actual significance level. The phrase ‘actual significance level’ is used because you might find when testing a discrete random variable that you cannot create a critical region that has exactly the desired significance level. If you are asked to design a test with significance, because of the discrete nature of the variables you might have to create a critical level that in fact has an actual significance level near to . Conventionally, you would

choose the largest significance level you can that is less than

.

You might not be told the significance level of a test, in which case you need to use a formula to calculate it.

Key point 5.4 In a hypothesis test:

WORKED EXAMPLE 5.3

and these hypotheses are tested: The null hypothesis is rejected if

. Find the probability of a type I error in this test. Use the definition of a type I error.

Use the tables for the Poisson distribution from the formula book.

WORKED EXAMPLE 5.4

Derren wants to test whether a six-sided dice is biased towards rolling sixes. He rolls a dice times. a State appropriate null and alternative hypotheses. b If Derren is using a

significance level, how many sixes would he have to see to conclude at

significance that the dice is biased? c For the test proposed in part b, find the probability of a type I error. a If is the probability of rolling a then:

b If is the number of sixes rolled then if is true,

: Use your calculator to work out the probabilities from the binomial distribution. Since you are looking for evidence of

, you need to find

as this is an event as extreme as observed, or more extreme in the direction of the alternative hypothesis, i.e. it is the -value of an observed .

The dice can be said to be biased at the significance level if or more sixes are observed. c The probability of a type I error is .

You need to find the first value of that has a value less than . You can just read this from the table. It is the probability of observing five or more sixes while

the null hypothesis is true.

Probability of type II errors If the true mean is anything other than that suggested by the null hypothesis and you have not rejected the null hypothesis, then you have made a type II error (see Key point 5.2). The probability of a type II error depends upon the true value of the population parameter. Suppose you are testing at the significance level the null hypothesis with a standard deviation of . If the true mean were you would expect to be able to detect this very easily. If the true mean were you might have greater difficulty distinguishing this from . If you knew the true mean, then you could find the probability of an observation of this distribution falling in the acceptance region for .

Rewind You could also answer the type of test shown in Worked example 5.4 using a chi-squared test – see Chapter 3 for a reminder of using a chi-squared test – with two categories: and not . However, this would have only one degree of freedom so the fact that the chisquared test is only approximate is particularly problematic. It is preferable to use an exact binomial test, as shown. The following diagrams show the effect of different true population means on this test. In the first diagram the true mean is , so anything that falls in the red region gets this right, but falling in the blue regions results in a type I error. In the second, third and fourth diagrams the true mean is getting further and further away from . Anything falling in the red regions picks this up, but anything falling in the blue regions fails to detect that the true mean is not . All of these are now type II errors.

Rejection region

Acceptance region

Rejection region

H 0 is true 5% type I error 95% correctly not reject H 0 2.5%

95%

2.5% x

µ = 120

H 0 is untrue (small difference) Type II error Correctly reject H0 x

µ = 120.4

H 0 is untrue (medium difference) Type II error Correctly reject H0 x

µ = 130

H 0 is untrue (large difference) Type II error Correctly reject H0

µ = 160

x

Key point 5.5 In a hypothesis test:

WORKED EXAMPLE 5.5

Internet speeds to a household are normally distributed with a standard deviation of . The internet provider claims that the average speed of an internet connection has increased above its long-term value of . A sample is taken on occasions and a hypothesis test is conducted at the significance level. Find the probability of a type II error if the true average speed is .

is the continuous random variable speed of internet connection

Define variables.

State hypotheses.

,

State the test statistic and its distribution (assuming is true).

x

a

9

Decide the range of that falls into the tailed acceptance region. So do not reject

if

.

State the acceptance region.

Use the definition of a type II error.

An important concept when studying tests is called the power of a test. It is defined as the probability of rejecting when it is false, so it is the probability of not making a type II error.

Key point 5.6 In a hypothesis test:

WORKED EXAMPLE 5.6

A call centre believes that it receives calls at an average rate of per hour. To test this it looks at the number of calls in a two-hour period. If that number is greater than or lower than , it rejects the hypothesis that the average rate is per hour. Given that the actual rate of calls is calls per hour, find the power of the test. number of calls in two hours

You are considering the actual rate rather than the rate under the null hypothesis.

To find the probability of a type II error you look at how likely is to fall into the acceptance region.

EXERCISE 5B 1

Given that a

i ii

, find the probability of a type I error for each of these situations. ; reject ; reject

if if

b

c

i

; reject

if

i

; reject

if

Given that a

b

3

b

ii

; reject

if

i

; reject

if

; reject

i

; reject

if

or

ii

; reject

if

or

, find the probability of a type I error for each of these situations.

i

; reject

if

ii

; reject

if

i

; reject

if

; reject

i

or

if

with

or

b

i

with

i

with

,

significance; ; significance;

; ,

,

with

ii

; ,

with

ii c

,

with

ii

significance; ; significance;

; ,

significance; ;

significance;

. In reality,

.

. In reality, . In reality,

. .

. In reality, In reality, . In reality,

. . .

Given that , find the probability of a type II error for each of these situations. Find also the power of the test. a

b

c

i

b

; real

if

; real

i

; reject

if

; real

ii

; reject

if

; real

i

; reject

if

or

; real

or

; real

; reject

if

, find the probability of a type II error in each of these situations.

i

; reject

if

; true

ii

; reject

if

; true

i

; reject

if

; true

ii c

if

; reject

Given that a

; reject

ii

ii 6

if

Find the probability of a type II error for each of these situations. The sample mean is being tested and the sample size, , is specified in each case. a

5

or

, find the probability of a type I error in each of these situations. if

ii 4

if

; reject

Given that a

; reject

or

i

ii c

if

ii

ii 2

; reject

; reject

if

; true

i

; reject

if

or

; true

ii

; reject

if

or

; true

7

What are the advantages and disadvantages of increasing the significance level of a hypothesis test?

8

A television magician tries to trick an audience into believing that a coin is biased. He records

himself tossing a fair coin many hundreds of times until he tosses ten heads in a row. He then shows the audience the film containing only the ten heads being tossed. a State the null and alternative hypotheses in this situation. b If an audience member believed that the coin is biased, is this an example of a type I or a type II error? 9

A textbook says that there is positive correlation between two variables if the sample correlation coefficient is more than

. Describe, in this context, what is meant by:

a a type I error b a type II error. 10 A student conducts a binomial hypothesis test to see if a six-sided dice is fair. He rolls the dice times and if he sees more than sixes, he will claim that the dice is biased. a Describe, in the context of this test, what is meant by: i a type I error ii a type II error. b State two changes to the test that would make a type II error less likely. 11 The numbers of people arriving at a health club follow a Poisson distribution with mean per hour. After a new swimming pool is opened, the management want to test whether the number of people visiting the club has increased. a State suitable null and alternative hypotheses. b They decide to record the number of people arriving at the club during a randomly chosen hour, and to reject the null hypothesis if this number is larger than . Find the significance level of this test. Comment on your result. 12 A long-term study suggests that traffic accidents at a particular junction occur randomly at a constant rate of per week. After new traffic lights are installed, it is believed that the number of accidents has decreased. The number of accidents over a -week period is recorded. a Let denote the average number of accidents in a -week period. State suitable hypotheses involving . b It is decided to reject the null hypothesis if the number of accidents recorded is less than or equal to . Find the probability of making a type I error. c The average number of accidents has in fact decreased to

per week. Find the probability

of making a type II error in this test. 13 The masses of eggs are known to be normally distributed with standard deviation . Dhalia wants to test whether eggs produced by her hens have mass greater than on average. a State suitable null and alternative hypotheses to test Dhalia’s idea. Dhalia weighs eggs and finds that their average mass is

.

b Test at the significance level whether Dhalia’s eggs have mass greater than average. State your conclusion clearly.

on

c Write down the probability of making a type I error in this test. d What is the smallest average mass of the eggs that would lead Dhalia to reject the null hypothesis? e Given that the average mass of Dhalia’s eggs is actually

, find the power of the test.

14 A coin is flipped times. It is decided that it is a biased coin if or heads are observed. a State the null and alternative hypotheses. b Find the significance level of the test. c Given that the true probability of flipping a head is , find the probability of a type II error as a function of .

d Show that the probability of a type II error is maximised when

.

15 A population is known to have a normal distribution with a variance of and an unknown mean . It is proposed to test the hypotheses .



using the mean of a sample of size

a Find the appropriate critical regions corresponding to a significance level of: i ii b Given that the true population mean is

, calculate the probability of making a type II error

when the level of significance is: i ii 16 The number of worms in a square metre of forest is known to follow a Poisson distribution. The mean is thought to be . This is rejected if no worms are observed when a square metre is observed. If the true mean is , find an expression in terms of for the power of this test.

Checklist of learning and understanding A -test is a way of testing to see if a sample provides evidence of a change in the population mean from a previously held belief. It is based on the -score:

A type I error is falsely rejecting A type II error is not rejecting

.

.

when it is false.

Mixed practice 5 1

Find the value of the -score (to three significant figures) for these data when testing the null hypothesis .

Choose from these options. A B C D 2

What is the definition of the significance level in a hypothesis test of a continuous parameter? Choose from these options. A B C D

3

A chemist collects data on the volume of should be .

produced in a reaction. He believes that it

a Write down null and alternative hypotheses for the chemist’s belief. He measures the reaction times and gets these results: . b Find the -score for these data. c Hence conduct a -test at the

significance level.

d What assumption have you made in conducting a -test? 4

a Give the definition of a type II error. A one-tailed -test is conducted for the hypotheses: , It is known that It is decided to reject

if a mean of observations is less than

.

b Find the probability of a type I error in this test. In reality,

.

c Find the probability of a type II error in this test. d What could be done to decrease the probability of a type II error without changing the probability of a type I error? 5

The number of beta particles emitted in one second by an isotope of a radioactive element

is known to follow a distribution. A theory suggests that that this might be an underestimate.

but a physicist believes

a State the null and alternative hypotheses. b The physicist decides that he will reject the null hypothesis if he sees more than

beta

particles in a five-second period. What is the significance level of this test? c In reality, 6

. Find the power of this test.

A union representative wishes to test a company’s claim that it pays an average salary of £

. She suspects that the company pays less than this.

a Write down null and alternative hypotheses for her test. The union representative takes a random sample of employees and finds their wages in thousands of pounds. Her results are summarised here:

b Find an unbiased estimate of the variance of . c What is the -score for her results? d Conduct a -test at the 7

A coin is flipped

significance level to test her suspicion.

times and it is decided that it is a biased coin if more than heads or

fewer than heads are observed. a State the null and alternative hypotheses. b Find the significance level of the test. c If the coin is actually biased so that heads occur test. 8

of the time, find the power of the

Safeerah regularly cycles to and from work. She has a steel-framed bicycle that weighs . Her mean journey time for the round trip is minutes. Her friend, Josh, has a carbonframed bicycle that weighs . Safeerah is thinking of buying a carbon-framed bicycle to reduce her journey time, and Josh agrees to lend her his bicycle so that she can try it. a The carbon-framed bicycle is sold using the slogan: ‘Less weight means more speed’. Safeerah, who weighs , is expecting that the per cent reduction in bicycle mass will substantially reduce her journey times. Josh tells her not to expect this as the resultant mass reduction is actually closer to per cent. Justify Josh’s figure of per cent. b Safeerah records her journey times with the carbon-framed bicycle on

typical days as:

Assuming that these times may be regarded as a random sample from a normal distribution, test, at the significance level, whether her mean journey time with the carbon-framed bicycle is less than minutes. [© AQA 2013] 9

A company manufactures bath panels. The bath panels should be deep, but a small amount of variability is acceptable. The depths are known to be normally distributed with

standard deviation

.

a In order to check that the mean depth is , Amir takes a random sample of bath panels from the current production and measures their depths, in millimetres, with these results.

Test whether the current mean is

, using the

significance level.

b Isabella, a manager, tells Amir that, in order to check whether the current mean is , it is necessary to take a larger sample. Amir therefore takes a random sample of size

from the current production and finds that the mean depth is

Test whether the current mean is the significance level.

.

, using the data from this second sample and

c It is proposed to carry out hypothesis tests at regular intervals to check that the mean remains at . Amir proposes that the tests be based on random samples of size , but Isabella favours random samples of size . Explain which, if either, sample size would lead to a smaller risk: i of a type I error ii of a type II error. [© AQA 2011] 10 A town council wanted residents to apply for grants that were available for home insulation. In a trial, a random sample of residents was encouraged, either in a letter or by a phone call, to apply for the grants. The outcomes are shown in the table.

Applied for grant

Did not apply for grant

Total

Letter Phone call Total

a The council believed that a phone call was more effective than a letter in encouraging people to apply for a grant. Use a -test to investigate this belief at the significance level. b After the trial, all the residents in the town were encouraged, either in a letter or by a phone call, to apply for the grants. It was found that there was no association between the method of encouragement and the outcome. State, with a reason, whether a type I error, a type II error or neither occurred in carrying out the test in part a. [© AQA 2013]

6 Confidence intervals In this chapter you will learn how to: estimate the interval in which a population parameter lies, called a confidence interval estimate the confidence interval when the population variance is known use confidence intervals to conduct hypothesis tests. If you are following the A Level course, you will also learn how to: find confidence intervals when the population variance is unknown, using the distribution.

Before you start… A Level Mathematics Student Book 1, Chapter 22

You should know how to calculate unbiased estimates of the population

1 Calculate an unbiased estimate of the variance of a population based on this sample: .

variance. A Level Mathematics Student Book 2, Chapter 21

You should know how to conduct calculations with the normal distribution.

2 Given that

, find

.

A Level Mathematics Student Book 2, Chapter 22

You should know how to conduct hypothesis tests with the normal distribution.

3 A sample of objects drawn from a normal distribution with a standard deviation of has a mean of . Conduct a two-tailed test at significance to decide if this provides significant evidence of a change from a mean of .

Chapter 5

You should know how to conduct tests.

4 Based on this sample, test to see if there is evidence that at the significance level: .

What is the best way of describing an estimate? If you want to estimate a population parameter, which is better; having a single value that is very unlikely to be correct, or having a range of values that is very likely to contain the population statistic? The latter is usually preferable, and is called a confidence interval. In this chapter you will learn how to construct confidence intervals for the mean in different situations and see how such intervals can be interpreted.

Section 1: Confidence intervals A single value calculated from a sample used to estimate a population parameter is called a point estimate. You are trying to find an interval that has a specified probability of including the true population value of the statistic you are interested in. This interval is called a confidence interval and the specified probability is called the confidence level. For example, given the data , you can calculate the sample mean, which is . However, it is very unlikely that the mean of the population this sample was drawn from is exactly . You will now develop a method that will allow you to say with confidence that the population mean is somewhere between and . This does not mean that there is a probability of that the true mean is between and , but rather that of confidence intervals constructed from samples like this would contain the true mean. To develop the theory, you are going to look at creating confidence levels, which are the default choice. Suppose you are estimating the population mean, using the sample mean . Initially, you will only consider random variables drawn from a normal distribution so that

, where is the population

mean (the thing you want to find) and is the standard deviation in one observation of . You can find, in terms of and , a region symmetrical about that has a

2.5%

95%

Lower µ Upper bound bound

probability of containing

.

2.5% x

You can find the -score of the upper bound. Using the symmetry of the situation, you find that distribution is above the upper bound, so the -score is . You can say that:

You can use the fact that

of the

:

You can rearrange the inequalities to focus on :

Rewind You saw in A Level Mathematics Student Book 2, Chapter 21, that

is the inverse

normal distribution that tells you the -score that results in the cumulative probability . Be warned – although this looks like it is a statement about the probability of , in your derivation you treated as a constant so it is meaningless to talk about a probability of . This statement is still concerned with the probability distribution of .

So, if the sample mean is , your

confidence interval for is:

Tip The quantity

is sometimes referred to as the standard error.

You can generalise this method to other confidence levels. To find a

confidence interval you can find the

critical -score geometrically, using the properties of this graph.

c% c% 2 2 q

Z

x

50%

From this diagram you can see that the critical -score is the one where there is a probability of being below it.

Tip This process creates a symmetric interval around the sample mean. It is also possible to create a non-symmetric interval, but that is beyond the scope of this course.

Tip Some calculators can find these confidence intervals for you.

Key point 6.1 A

symmetric confidence interval for the population mean is:

where is the sample mean, is the standard deviation in one observation of and

WORKED EXAMPLE 6.1

The masses of fish in a pond are known to have a normal distribution with a standard deviation . The mean mass of fish from the pond is found to be . a Find a

confidence interval for the mean mass of all the fish in the pond.

b Guidance from a vet suggests that the pond is an unsuitable environment if the mean mass of the fish is below . Does your confidence interval suggest that the environment is unsuitable?

of

a For a

confidence interval

So confidence interval is

, which is

Use your calculator to find the score associated with a confidence interval.

. b A mean mass of is within the confidence interv so the confidence interval does not necessarily suggest that the environment is unsuitable.

You need to consider whether a true mean of is consistent with your confidence interval.

You do not need to know the centre of the interval to find the width of the confidence interval. From Key point 6.1 you know that the confidence interval goes from

. Therefore its width is

.

Fast forward The inference in part b of Worked example 6.1 is effectively a type of hypothesis test. You will see later in this chapter how you can quantify the significance level when using a confidence interval to perform a hypothesis test.

Key point 6.2 The width of a confidence interval is

.

WORKED EXAMPLE 6.2

The results in a test are known to be normally distributed with a standard deviation of many people need to be tested to find an For an

. How

confidence interval with a width of less than ?

confidence interval

Find the -score associated with an interval.

confidence

Set up an inequality.

At least

people need to be tested.

If the sample size is sufficiently large (greater than ) and you do not know the true variance, you need to use the unbiased estimate of the variance as a substitute for the true variance. You can use confidence intervals to conduct hypothesis tests. For example, if you find a interval you can use it to conduct a

confidence

significance two-tailed hypothesis test.

WORKED EXAMPLE 6.3

A vet is measuring the masses of a breed of dog

. Her data are summarised here:

It can be assumed that the masses follow a normal distribution. a Find a confidence interval for the mean. b A textbook claims that the average mass of this breed is

. Conduct a hypothesis test at the

significance level to decide if this sample suggests that the textbook figure is incorrect. a

First you need to work out the sample statistics.

You need to find the unbiased estimate of the variance.

You can use the formula from Key point 6.1 to find the appropriate -score. So

b

You can then use the expression in Key point 6.1, substituting for .

, The true mean being is consistent with the confidence interval found, so you do not reject at the significance level. There is not significant evidence that the textbook is incorrect.

You must take care not to draw false inferences from confidence intervals. It is important to know the types of error that can be made, as shown in Worked example 6.4. WORKED EXAMPLE 6.4

Ramon works out a a

confidence interval for the population mean as

of any observed data will be between

. He claims that:

and

b the probability that the population mean is between c the median of the population is

to

and

is

.

Decide which of these statements, if any, are correct. Justify your answers. a This is not necessarily true. The confidence interval is for the mean rather than a single observation. Even if this statement was about the sample mean there would be variations between samples. b This is not true. The population mean is not a random variable so you cannot talk about a probability associated with it. c This is not necessarily true. The confidence interval will be centred on the sample mean, which may not equal the population median.

EXERCISE 6A 1

1 Find the -value for these symmetric confidence levels: a b

2

.

Find the required symmetric confidence interval for the population mean for the summarised data. You can assume that the data are taken from a normal distribution with known variance. a

i ii

,

, ,

; ,

confidence interval ;

confidence interval

b

,

i

,

ii 3

,

;

,

;

confidence interval confidence interval

Copy and complete this table. You can assume that the data are taken from a normal distribution with known variance and that the confidence level is symmetric. Confidence level a

i



ii b

d

4









i





ii





i



ii e



i ii

c



Lower bound of Upper bound of interval interval



i



ii











The blood oxygen levels (measured as percentages) of an individual are known to be normally distributed with a standard deviation of . Based upon six readings, Niamh finds that her blood oxygen levels are on average a Find a

.

symmetric confidence interval for Niamh’s true blood oxygen level.

b A doctor needs to be called if the true mean oxygen level falls below interval suggest that the true oxygen level is below ? 5

. Does the confidence

The birth masses of male babies in a hospital are known to be normally distributed with variance

.

a Find a symmetric confidence interval for the average birth mass if a random sample of ten male babies have an average mass of . b If average birth masses are below then an investigation must be conducted. Based upon this confidence interval, should an investigation be conducted? 6

A data set is summarised here:

Find a

symmetric confidence interval for the mean, assuming that the data are drawn from a

normal distribution. 7

a A sample of people in a town have an average wage of £ with an unbiased estimate of the population variance of million. The wages follow a normal distribution. Find a symmetric confidence interval for the mean wage in the town. b Is there significant evidence (at £ ?

8

significance) that the mean wage in this town is different from

When a scientist measures the concentration of a solution, the measurement obtained can be assumed to be a normally distributed random variable with standard deviation . a He makes independent measurements of the concentration of a particular solution and correctly calculates the confidence interval for the true value as . Determine the confidence level of this interval. b The scientist claims that this means that of sample means will be between this a correct interpretation of the confidence interval? Justify your answer.

and

. Is

c He is now given a different solution and is asked to determine a confidence interval for its concentration. The symmetric confidence interval is required to have a width less than . Find the minimum number of measurements required. 9

A supermarket wishes to estimate the average amount spent shopping each week by single men. It is

known that the amount spent has a normal distribution with standard deviation €

. What is the

smallest sample required so that the margin of error (the difference between the centre of the interval and the boundary) for an symmetric confidence interval is less than € ? 10 A physicist wishes to find a confidence interval for the mean voltage of some batteries. She therefore randomly selects batteries and measures their voltages. Based on her results, she obtains the confidence interval [ ]. The voltages of batteries are known to be normally distributed with a standard deviation of . a Find the value of . b Assuming that the same confidence interval had been obtained from measuring would be its level of confidence?

batteries, what

c A

confidence interval for the mean voltage of a different brand of batteries is found to be [ ]. Is there significant evidence that the second brand of battery has a higher voltage than the first brand of battery?

11 a A set of data items produces a confidence interval for the mean of ( ). You can assume that the data are drawn from a normally distributed population. Given that , find the confidence level, giving your answer to two significant figures. b Jasmine wants to test these hypotheses: Use the given confidence interval to conduct a hypothesis test, stating the significance level. 12 From experience it is known that the variance in the increase between marks in a beginning-of-year test and an end-of-year test is . A random sample of four students in Mr Jack’s class was selected and the results in the two tests were recorded. Alma

Brenda

Ciaron

Dominique

Beginning of year End of year a Assuming that the difference can be modelled by a normal distribution with variance

, find a

symmetric confidence interval for the mean increase. b How could the width of the confidence interval be decreased? c Do these data provide evidence at the the school average of a

13 Which of these statements are true for a There is a probability of

significance level that Mr Jack’s class is doing better than

-mark increase? symmetric confidence intervals of the mean?

that the true mean is within the interval.

b If you were to repeat the sampling process

times,

of the intervals would contain the true

mean. c Once the interval has been created there is a the interval. d On average e

chance that the next sample mean will be within

of intervals created in this way contain the true mean.

of sample means will fall within this interval.

14 For a given sample, which will be larger; an symmetric confidence interval for the mean?

symmetric confidence interval for the mean or a

Section 2: Confidence intervals for the mean when the population variance is unknown In many real-life situations, when finding an estimate for the population mean you do not know the true population variance – you estimate it from the sample variance, . This means that the statistic does not follow the normal distribution, but rather the -distribution (as long as follows a normal distribution). In Section 1 you assumed that when the sample size is large the difference between the -distribution and the normal distribution is sufficiently small that it can be ignored. In this section you will look at how you can adapt the theory from Section 1 when sample sizes are small – less than about .

Rewind The -distribution and associated calculations were covered in Chapter 5. Remember that the number of degrees of freedom is given by

.

You can follow a similar analysis to the one leading to Key point 6.1 to get a formula for a confidence interval using a t-distribution when the sample size is small.

Key point 6.3 If the estimated variance is found from the sample and the sample size is small, the symmetric confidence interval for the population mean is given by:

where is the sample mean and is chosen so that The sample must be drawn from a normal distribution.

Tip You can find the value of from some calculators or by using the percentage points table in the formula book. For example, if you are looking at a symmetric confidence interval, that means that there is below the upper bound of the interval so you use the percentage point.

95% 2.5%

2.5%

x

97.5%

WORKED EXAMPLE 6.5

Find a confidence interval for the mean of the data drawn from a normal distribution. ,

, assuming that the data is

Find the sample mean and unbiased estimate of the variance. Find the number of degrees of freedom.

th percentage point of is

Use tables to find the -score associated with a symmetric confidence interval when . If there is within the confidence interval then there is below the upper bound. Apply the formula from Key point 6.3.

EXERCISE 6B 1

Find the required symmetric confidence interval for the population mean for these data, some of which have been summarised. You can assume that the data are taken from a normal distribution. a

i ii

b

c

, ,

i

,

ii

,

i

,

;

confidence interval

,

;

confidence interval ;

; ;

2

confidence interval

confidence interval ;

ii

confidence interval

confidence interval

A garden contains a large number of rose bushes. A random sample of eight bushes was taken and the heights in cm were measured and the data were summarised as: , a State an assumption that is necessary to find a confidence interval for the mean height of rose bushes. b Find the sample mean. c Find an unbiased estimate for the population variance. d Find an

3

symmetric confidence interval for the mean height of rose bushes in the garden.

A sample of three randomly selected students are found to have an unbiased estimate of the population variance of in the amount of time they watch television each weekday. Based upon this sample, the symmetric confidence interval for the mean time a student spends watching television is calculated as . It can be assumed that the times follow a normal distribution. a Find the mean time spent watching television. b Find the confidence level of the interval. c A newspaper report on this study claims that most students watch between and hours of television each day. Is this a reasonable conclusion from this confidence interval? Explain your answer.

4

The random variable is normally distributed with mean . A random sample of is taken on , and it is found that:

A symmetric confidence interval

observations

is calculated for this sample.

Find the confidence level for this interval. 5

The lifetime of a printer cartridge, measured in pages, is believed to be approximately normally distributed. The lifetimes of randomly chosen printer cartridges are measured and the results are:

A symmetric confidence interval for the mean was found to be

.

a Find the value of . b What is the confidence level of this interval? c The manufacturer claims that the lifetime of the printer cartridge is at least confidence interval found consistent with this claim? 6

pages. Is the

The times taken for four people to complete a crossword puzzle are measured and the results are shown in this table. Person

Time (minutes)

John Diane David Jane a Find a confidence interval for the true population mean, assuming that the times follow a normal distribution. b The newspaper says that the average time to complete the crossword is more than minutes. i State suitable null and alternative hypotheses for this test. ii Use your confidence interval from part b i to determine the conclusion to this hypothesis test at the significance level. 7

The masses of four burgers, in grams, before and after being cooked for one minute, are measured: Burger Before cooking After cooking



A symmetric confidence interval for the mean mass loss was found to include values from . It can be assumed that the masses follow a normal distribution. a Find the value of . b Find the confidence level of this interval. 8

The temperature of a block of wood minutes after being lifted out of liquid nitrogen is measured and then the experiment is repeated. The results are and . a Assuming that the temperatures are normally distributed, find a

confidence interval for

the mean temperature of a block of wood minutes after being lifted out of liquid nitrogen. b A different block of wood is subjected to the same experiment and the results are and , where . A second confidence interval is created. Prove that the two confidence intervals overlap for all values of .

Checklist of learning and understanding A confidence interval for the mean is a range of possible values for the population mean, along with a confidence level. If the true population variance is known and the sample mean follows a normal distribution then the confidence interval takes the form:

where The width of the confidence interval is given by

. .

When carrying out a hypothesis test or finding a confidence interval for the mean, if the sample size is sufficiently large ( ) and you do not know the true variance, you can use the unbiased estimate of the variance as a substitute for the true variance. If the estimated variance is found from the sample and the sample size is small, the confidence interval for the population mean is given by:

where is chosen so that distribution.

. The sample must be drawn from a normal

Mixed practice 6 1

The mass of a particular breed of dog is known to be normally distributed with variance . The masses of a random sample of dogs from this breed are found. What is the smallest value of required to make the confidence interval for the mean mass less than wide? Choose from these options. A B C D

2

A data set taken from a normal distribution is summarised as: ,

,

a Calculate the unbiased estimate of the variance of these data. b Find a

confidence interval for the mean.

c Conduct a two-tailed test at 3

significance to determine if there is a change from

The masses of bananas are investigated. The masses of a random sample of of these bananas were measured and the mean was found to be with an unbiased variance of . It is assumed that the masses follow a normal distribution. Find a

symmetric confidence interval for .

4

The time taken for a mechanic to replace a set of brake pads on a car is recorded. In a week she changes sets of brake pads and minutes and . Assuming that the times are normally distributed, calculate a symmetric confidence interval for the mean time taken for the mechanic to replace a set of brake pads.

5

The pH of a river is believed to be normally distributed with a standard deviation of . What is the smallest number of samples that should be taken to get a confidence interval for the mean with a width of less than

6

?

A random sample of four students in a school was selected and the results they got in two tests were recorded: Alma

Brenda

Ciaron

Dominique

Beginning of year End of year a Find a symmetric confidence interval for the mean increase in marks from the beginning of year until the end of year, assuming that the differences follow a normal distribution. b Hence conduct a test at the significance level to see if the results have changed between the beginning and end of the year. 7

The random variable is normally distributed with mean and standard deviation . A random sample of a Find a

observations of has a mean of

confidence interval for .

.

.

b It is believed that confidence interval for . 8

. Determine whether or not this is consistent with your

From experience it is known that the variance in the mass decrease during a diet is

. A

random sample of four people was selected and their masses before and after their diet were recorded. Bobby

Sam

Francis

Alex

Before diet After diet a Assuming that the mass loss follows a normal distribution, find a

confidence interval

for the mean mass loss during the diet. b Hence conduct a test at 9

A sample of

significance to see if the diet results in a change in mass.

eggs are weighed and the masses in grams are:  

















Assuming that these masses form a random sample from a normal population, calculate: a unbiased estimates of the mean and variance of this population b a 10 a i

ii

confidence interval for the mean. A confidence interval for a population mean, , is to be constructed. What is the probability that the interval will not include the value of ? If such confidence intervals are constructed from separate random samples from the same population, find the probability that at least one of them will not include .

b Jurgen can run metres in a mean time of seconds. His coach changes his training programme to concentrate on his starting speed. After following the new training programme, a random sample of of Jurgen’s -metre running times has mean seconds and standard deviation seconds. i

Assuming Jurgen’s -metre times are normally distributed, construct a confidence interval for his new mean time to run metres, giving the limits to three decimal places.

ii

Use the confidence limits to decide whether there is significant evidence that the new training programme has been effective. Justify your decision. [© AQA 2015]

FOCUS ON … PROOF 2

Proving the expectation and variance of the binomial distribution In A Level Mathematics Student Book 2, Chapter 21, you used the formulae for the mean and variance of the binomial distribution: If

, then

and

.

In this section you will prove these facts. You need to know the formula for binomial probabilities and the binomial expansion. One part of the proof also involves differentiation using the chain rule.

Rewind Refer to A Level Mathematics Student Book 2 for revision on the binomial distribution and on the chain rule.

QUESTIONS 1

Expand

2

Use your result from question 1 to prove that if is the probability of success and

where is a positive integer. is the

probability of failure, then:

3

Explain why

4

By differentiating

5

a By writing the binomial coefficient in terms of factorials, explain why b Hence prove that

6

. with respect to and treating as a constant, prove that

.

a Show that b Hence prove that

. .

. .

FOCUS ON … PROBLEM SOLVING 2

Investigating confidence intervals A common misconception is what is meant by the confidence level of a confidence interval. In this section you will use spreadsheets to simulate the construction of confidence intervals to gain a better understanding of them. With many statistics problems, the ability to simulate the situation is an extremely useful tool in getting started. The screenshots show the syntax of some common spreadsheets, although you might need to adapt this for the program you are using. The formula for the endpoints of a known is approximately

.

Use a spreadsheet to create Sample 1st

1 27.62

2 20.34

3 22.50

symmetric confidence interval where the population variance is

4 10.06

random numbers generated from the normal distribution Observation 5 6 20.75 18.76

7 26.51

8 10.41

:

9 10 =NORMINV(RAND(),20,5) NORMINV(p robabili ty, mean, standard_dev)

Then find the mean of the sample and use the formula 1 2 3 4

A

B

C

D

E

Sample 1st

1 13.20

2 26.32

3 17.06

4 14.70

F G Observation 5 6 19.77 13.47

to find the confidence interval:

H

I

J

K

L

7 21.86

8 21.30

9 20.27

10 15.74

Mean 18.37

M N O Confidence interval Lower Upper 15.27 =L3+1.96*5/SQRT(10)

Tip Some spreadsheets have the option of generating random numbers from a given distribution. If your spreadsheet does not have this facility, then you can still use a random number generator which provides random numbers from the rectangular distribution between and ; most spreadsheets do have this function. You might have to think about why the formula shown then provides random numbers from the normal distribution; it is not obvious. Check if the confidence interval does contain the true mean, which was L

M N Confidence interval Mean Lower Upper 20.97 17.87 24.07

O

P

Q

R

:

S

Check =IF(AND(M320),1,0) IF(logical_test, [value_if_true], [value_if_false])

Then copy this all down to consider Confidence interval Mean Lower Upper 20.80 17.70 23.90 21.52 18.42 24.62 21.94 18.84 25.04

QUESTIONS

Check 1.00 1.00 1.00

samples, all of size Counting: =SUM( 03:0202) SUM(number1, [number 2], …)

. Count how many do contain the true mean.

QUESTIONS 1

For each sample, can you say with certainty whether or not the true mean is within the calculated confidence interval?

2

What percentage of the calculated confidence intervals contain the true mean?

3

If, instead of using the true standard deviation, the sample standard deviation is used, then a -interval is required. a How does this affect the width of the confidence intervals? b How does this affect your answer to question 2?

4

Adapt the spreadsheet to create two samples of size

from a

distribution. For each of

these samples, create a confidence interval. Note whether the two confidence intervals overlap. Repeat for lots of pairs of samples of size from a distribution. What percentage of the pairs have overlapping confidence intervals? 5

Repeat the investigation from question , but this time using one sample of size distribution and one sample of size from a distribution.

from a

Tip The purpose of questions 4 and 5 is to highlight that it is not a good idea to use the overlap of two confidence intervals to test to see if the mean of two distributions is the same, as the significance level is not obvious.

FOCUS ON … MODELLING 2

Simulating the -distribution The normal distribution and the -distribution are very closely related. To get a better feel for their similarities and differences, this exercise investigates the shapes of these two distributions.

QUESTIONS 1

Use a spreadsheet to create a list of .

2

Find the mean of each sample. Is the mean of all the means of each sample zero?

samples of size , taken from the normal distribution

Tip In Excel you can create a random number from the syntax “ toolpak.

distribution using the

” or the function provided by the Data Analysis

3

Find the standard deviation of these samples. Is the mean of the standard deviations of each sample approximately ?

4

Construct the -score for each sample mean using the formula:

Plot each of these -scores on a histogram. What do you observe?

Tip If you are using Excel, you might want to use the Data Analysis toolpak to create the histogram. 5

Construct the -score for each sample mean using the formula:

Plot each of these -scores on a histogram. What do you observe? 6

Repeat questions 1 to 5 using a list of distribution. What do you observe?

samples of size

taken from the

Based on this exercise, you should see that for small sample sizes there is a noticeable difference between -scores and -scores, necessitating the use of the -distribution. However, for larger sample sizes the differences are small compared to most other sources of uncertainty, so the normal distribution can be used as an approximation to the -distribution.

CROSS-TOPIC REVIEW EXERCISE 1 The questions in this exercise cover AS Level material only. 1

The discrete random variable can only take the values and . If Choose from these options.

, find

.

A B C D 2

The length of an athlete’s long jump is modelled by a normal distribution with standard deviation . A sample of jumps is measured. What will be the width (to three significant figures) of a confidence interval for the mean? Choose from these options. A B C D

3

A continuous random variable has probability density function defined by

a Sketch the graph of . b Show that the value of is c

.

i Write down the median value of . ii Calculate the value of the lower quartile of . [© AQA 2013]

4

The numbers of people studying Mathematics at different levels in a sample of students from two different schools were recorded. North Academy

South High School

No Maths Single Maths Further Maths a Conduct an appropriate test to show that there is evidence at the significance level that the level of Mathematics studied depends on the school attended. b What assumptions are required to make the conclusion of the test in part a valid? c Jane says that she would be more likely to study Further Mathematics if she attended North Academy. Is this a valid inference from the data? Justify your answer. 5

The continuous random variable has probability density function given by for and otherwise. a Show that

.

It is given that

.

b Hence find the values of and . c Find

.

d Find the value of 6

.

Andrew travels to a meeting. His journey consists of two independent parts; a section by car and a section by train. The amount of time spent on the car section is modelled by the random variable and the amount of time spent on the train section is modelled by the random variable . All times are in minutes. Based on long experience, Andrew knows that the average time spent on the car section is minutes with standard deviation minutes, and the average time spent on the train section is

minutes with standard deviation

minutes.

a Assuming that there is no waiting time, find the expectation and standard deviation in Andrew’s total journey time. b For the meeting Andrew gets paid £

plus £

per hour he spends travelling. Find the

expectation and standard deviation in the amount Andrew gets paid. 7

For the year 2014, this table summarises the masses, kilograms, of a random sample of women residing in a particular city who are aged between years and years. Mass (

)

Number of women

Total a Calculate estimates of the mean and the standard deviation of these b

masses.

confidence interval for the mean mass of women residing in the city, i  Construct a who are aged between years and years. ii Hence comment on a claim that the mean mass of women residing in the city, who are aged between years and years has increased from that of in 1965. [© AQA 2014]

8

Two independent random variables have normal distributions, . a State the distribution of b Find

9

and

, including any necessary parameters.

.

At a remote hospital, in an area where there are many venomous snakes, the number of patients during one week requiring treatment after a venomous snake bite may be modelled by a Poisson distribution with mean . a For this hospital, find the probability that: i no more than patient requires treatment after a venomous snake bite during a particular week ii at least patients require treatment after a venomous snake bite during a particular period of weeks iii more than

patients but fewer than

patients require treatment after a venomous

snake bite during a particular period of

weeks.

b Each patient who has been bitten by a venomous snake is treated with a single dose of an anti-venom which is effective against the venoms of all the snakes common in that area. The anti-venom is expensive and has a limited shelf life, so that a delivery of fresh antivenom is made at -week intervals. The hospital stores just enough anti-venom so that the probability that it runs out of anti-venom before the next delivery is less than per cent. Quoting probabilities to justify your answer, state how many doses of anti-venom the hospital should have in its store immediately after a delivery of fresh anti-venom. [© AQA 2015] 10 Dana, a researcher in the USA, investigated game-related stress for sports officials in inter-school baseball, basketball and soccer. The officials involved in this investigation were categorised as either adopting an approach (AP) coping style or an avoidance (AV) coping style when dealing with gamerelated stress. Table 1 summarises the results of this investigation. Table 1 Coping style AP

AV

Baseball Sport

Basketball Soccer

You may assume that the sample.

officials involved in this investigation represent a random

a Use the information in Table 1 to complete the contingency table, Table 2, with frequencies that could be analysed to investigate whether the coping style used by officials is associated with the sport involved. Table 2 Coping style AP

AV

Baseball Sport

Basketball Soccer

b Examine, using the level of significance, whether the coping style used by officials is associated with the sport involved. c By comparing observed and expected frequencies, identify, in context, two important facts concerning coping style and sport involved. [© AQA 2014, adapted] 11 The probability distribution of a discrete random variable is given by:

a Find

in terms of .

b Show that

.

c What is the largest possible value of the variance of ? 12 A sample of size is drawn from a normally distributed population with standard deviation . A confidence interval for the mean was correctly calculated to be . Find:

a the unbiased estimate of the population mean b the value of . 13 In a diamond mine, the number of diamonds found per cubic metre of material mined is known to follow a Poisson distribution with mean . a If

, find the probability of finding: of mining i diamonds in

ii diamond in each of two

sections of mining.

b To be economically viable a diamond mine needs more than diamonds per cubic metre. To survey a potential new mine the owner examines a sample. i State appropriate null and alternative hypotheses in terms of . ii The survey results show that the sample contains diamonds. Conduct a hypothesis test at the significance level. iii What is the probability of a type I error in this context? iv Why might the mine owner choose to use a significance level when conducting this test?

significance level rather than a

14 A receptionist answers phone calls for a company. a State two conditions needed for the number of phone calls answered in an hour to be modelled by the Poisson distribution. b Explain why these conditions are unlikely to be met in this situation. For a certain period of time you can now assume that the number of phone calls answered in an hour can indeed be modelled by the distribution. c Find the standard deviation in the number of phone calls answered. d Find the probability that fewer than

phone calls are answered in a -hour shift.

e Find the longest time for which the probability that no phone calls are answered is at least

.

15 The continuous random variable has probability density function given by for and otherwise. Find: a the value of b c d e the median of . 16 The volume of lemonade in a can produced at a factory follows a normal distribution with standard deviation . A quality control test takes a random, independent sample of cans. The factory manager claims that the cans should, on average, contain . a If the true mean is

, find the probability that exactly

cans contain less than

. b Jane decides that if or more cans in the sample contain less than reject the batch. i State in this context what is meant by a type I error.

she will

ii Find the probability of a type I error in Jane’s test. c The mean of the sample is found to be

.

confidence interval for the true mean of the cans, giving your i Construct a answer to decimal place. ii Phillip uses the confidence interval from part c i to determine whether the cans do

come from a population with a mean of less than

. What conclusion does Phillip

draw and what is the significance level of his conclusion? 17 Members of a library may borrow up to books. Past experience has shown that the number of books borrowed, , follows the distribution shown in the table.

a Find the probability that a member borrows more than books. b Assume that the numbers of books borrowed by two particular members are independent. Find the probability that one of these members borrows more than books and that the other borrows fewer than books. c Show that the mean of is

, and calculate the variance of .

d One of the library staff notices that the values of the mean and the variance of are similar and suggests that a Poisson distribution could be used to model . Without further calculations, give two reasons why a Poisson distribution would not be suitable to model . e The library introduces a fee of pence for each book borrowed. Assuming that the probabilities do not change, calculate: i the mean amount that will be paid by a member ii the standard deviation of the amount that will be paid by a member. [© AQA 2016]

CROSS-TOPIC REVIEW EXERCISE 2 1

It is assumed that people arrive in a queue randomly and at a constant average rate of per minute. The random variable is the time, in minutes, between people arriving in the queue. a State the distribution of , including any parameters. b Find the probability that there is a gap of between and minutes. c What is the expected standard deviation of ?

2

In a particular town, a survey was conducted on a sample of

residents aged

years

to years. The survey questioned these residents to discover the age at which they had left full-time education and the greatest rate of income tax that they were paying at the time of the survey. The summarised data obtained from the survey are shown in the table. Greatest rate of income tax paid

Age when leaving education (years) or less

or

or more

Total

Zero Basic Higher Total a Use a -test, at the level of significance, to investigate whether there is an association between age when leaving education and greatest rate of income tax paid. b It is believed that residents of this town who had left education at a later age were more likely to be paying the higher rate of income tax. Comment on this belief. [© AQA 2015] 3

A digital thermometer measures temperatures in degrees Celsius. The thermometer rounds down the actual temperature to one decimal place, so that, for example, and are both shown as . The error, , resulting from this rounding down can be modelled by a rectangular distribution with the following probability density function.

a State the value of . b Find the probability that the error resulting from this rounding down is greater than . c

i State the value for

.

ii Use integration to find the value for

.

iii Hence find the value for the standard deviation of . [© AQA 2016] 4

Julie, a driving instructor, believes that the first-time performances of her students in their driving tests are associated with their ages. Julie’s records of her students’ first-time performances in their driving tests are shown in the table. Age

Pass

Fail

a Use a

-test at the

level of significance to investigate Julie’s belief.

b Interpret your result in part a as it relates to the

age group. [© AQA 2010]

5

The random variable represents the number of soft drinks Manuel purchases while eating a burger. Manuel models using the

distribution.

a Find the standard deviation of . is the amount Manuel spends on his meal in pounds. If the burger costs £ and each drink costs £ , find: b

i .

ii 6

The discrete random variable follows the

distribution and satisfies

. a Find

.

b In three independent observations of , find the probability that fewer than two have . 7

The random variable measures the number of minutes Cauchy spends on a mobile phone each month. The mean of is

with standard deviation

.

Cauchy is on a contract with a fixed charge of £ each month, then

per minute.

a Find the mean and the variance in , the amount of money in pounds that Cauchy spends each month on his mobile phone. b Cauchy has a budget of £ per month for his phone. Anything that he does not spend on his phone he saves. Find the mean and variance in , the amount saved in pounds each month. 8

South Riding Alarms (SRA) maintains household burglar-alarm systems. The company aims to carry out an annual service of a system in a mean time of minutes. Technicians who carry out an annual service must record the times at which they start and finish the service. a Gary is employed as a technician by SRA and his manager, Rajul, calculates the times taken for annual services carried out by Gary. The results, in minutes, are as follows: Assume that these times may be regarded as a random sample from a normal distribution. Carry out a hypothesis test, at the significance level, to examine whether the mean time for an annual service carried out by Gary is

minutes.

b Rajul suspects that Gary may be taking longer than minutes on average to carry out an annual service. Rajul therefore calculates the times taken for annual services carried out by Gary. Assume that these times may also be regarded as a random sample from a normal distribution but with a standard deviation of minutes. Find the highest value of the sample mean which would not support Rajul’s suspicion at the

significance level. Give your answer to two decimal places. [© AQA 2014]

9

The time taken to complete a test is modelled by the normal distribution. The average score on this test is with standard deviation . A sample of students in a school take the test and if their average is above it will be decided that the school is doing

better than the rest of the population. a Explain why the normal distribution is a plausible model for the test results. b Assuming that the standard deviation is still test.

, find the significance level of this

c If the true mean of students in the school is

, find the power of the test.

d If the true mean of the students was higher than , would the power of the test be higher or lower? Explain your answer. No further calculations are required. 10 The time in seconds between errors in a piano performance is modelled by an exponential distribution, exp . a The probability that there is an error in any

seconds is

b Find the probability that there is no error in any

Find the value of .

seconds.

c Find the expected time until the first error. 11 Judith, the village postmistress, believes that, since moving the post office counter into the local pharmacy, the mean daily number of customers that she serves has increased from . In order to investigate her belief, she counts the number of customers that she serves on

randomly selected days, with the following results.

Stating a necessary distributional assumption, test Judith’s belief at the

level of

significance. [© AQA 2010] 12 It is claimed that a new drug is effective in the prevention of sickness in holiday-makers. A sample of

holiday-makers was surveyed, with the following results. Sickness

No sickness

Total

Drug taken No drug taken Total Assuming that the holiday-makers are a random sample, use a level of significance, to investigate the claim.

test, at the [© AQA, 2010]

13 The discrete random variable follows the

distribution and satisfies

.

a Find the value of . b Show that

.

14 Lorraine bought a new golf club. She then practised with this club by using it to hit golf balls on a golf range. After several such practice sessions, she believed that there had been no change from metres in the mean distance that she had achieved when using her old club. To investigate this belief, she measured, at her next practice session, the distance, metres, of each of a random sample of shots with her new club. Her results gave

Investigate Lorraine’s belief at the you make.

level of significance, stating any assumption that [© AQA 2010]

15 Wellgrove village has a main road running through it that has a speed limit. The villagers were concerned that many vehicles travelled too fast through the village, and

so they set up a device for measuring the speed of vehicles on this main road. This device indicated that the mean speed of vehicles travelling through Wellgrove was . In an attempt to reduce the mean speed of vehicles travelling through Wellgrove, lifesize photographs of a police officer were erected next to the road on the approaches to the village. The speed, following data obtained.

, of a sample of

vehicles was then measured and the

a State an assumption that must be made about the sample in order to carry out a hypothesis test to investigate whether the desired reduction in mean speed had occurred. b Given that the assumption that you stated in part a is valid, carry out such a test, using the level of significance. c Explain, in the context of this question, the meaning of: i a type I error ii a type II error. [© AQA 2015] 16 The discrete random variable satisfies this distribution:

a If

, find the possible values of .

b For the larger value of , find the value of 

.

17 Long-term observations suggest that the number of cars passing the school gates follows a Poisson distribution with the mean of cars per minute. Following the opening of a new supermarket at the end of the road, the head teacher wishes to find out whether this mean has increased. She sends a group of students to count the cars passing the school gates during a -minute interval. Let be the number of cars passing the school gates in a .

-minute interval, so that

a Write down suitable null and alternative hypotheses. b Find the critical region for the test at the c The students counted

significance level.

cars. State the conclusion of the test.

In reality, the mean number of cars has increased to

per minute.

d Find the probability that the test results in a type II error. 18 Groups of visitors arrive at a museum randomly, at a constant average rate of per hour. The director wants to find out whether this rate is smaller on rainy days. She randomly selects a rainy day and records the number of groups arriving over a -hour period. She then conducts a hypothesis test, using these hypotheses: , where is the population mean number of groups arriving at the museum in a -hour period. a Write down the value of

.

The manager decides that she will reject the null hypothesis if the number of visitor groups arriving in the -hour period is less than or equal to .

b Find the probability of a type I error in this test. The number of visitor groups in fact decreases to

per hour on a rainy day.

c Find the power of the test. 19 A physicist measures a quantity associated with the spin of an electron, . She takes independent readings that have mean . She calculates an unbiased estimate of the variance as

.

She assumes that this quantity follows a normal distribution. a Find a confidence interval for the true mean of , giving your answer to a suitable level of accuracy. b The random variable is defined as interval for the mean of .

. Write down a

confidence

c A theory predicts that the true value of is exactly . Is the confidence interval found in part a consistent with the theory? d The physicist repeats her experiment three times. Each experiment consists of independent readings followed by finding a confidence interval. i What is the probability that at least two of these confidence intervals do contain the true mean? ii What is the probability that all of these confidence intervals are above the true mean?

AS LEVEL PRACTICE PAPER 45 minutes, 40 marks 1

The number of beetles in a forest can be modelled by a Poisson distribution with parameter beetles per square metre. Find the probability, to three significant figures, that in a area there are fewer than

beetles.

Choose from these options. A B C

2

D

[1 mark]

The discrete random variable has a probability distribution given by

for

and otherwise. Find

.

Choose from these options. A B C D 3

[1 mark]

The discrete random variable has this distribution:





a Find the value of .

[1 mark]

b Find

[1 mark]

.

c Find the standard deviation of . 4

[4 marks]

Sarah models the number of buses arriving at a bus stop using a Poisson distribution. is the number of Route buses arriving in an hour and is the number of Route buses arriving in an hour. Sarah models these as being independent with and . a Given that

, state in context an interpretation of the variable and write down its

distribution, including any parameters.

[2 marks]

b Find the probability that or fewer buses arrive in an hour.

[2 marks]

c Give one reason why the assumption that and are independent is unlikely to be the case. [1 mark] To check her model, Sarah counts the buses arriving in

randomly selected hours.

d Use suitable calculations to determine if a Poisson model is feasible. 5

The continuous random variable has probability density function given by and otherwise.

for

a Find the value of .

[3 marks]

b Show that

[5 marks]

c Find 6

[4 marks]

median of . .

[3 marks]

This table shows the results of a survey in a school about weekly hours spent watching TV. Test at the

significance level whether school year and hours spent watching TV are

independent. School year

Hours

[5 marks] 7

a The number of leaks in a pipe is known by a water company to follow a Poisson distribution with mean leaks per . A new contractor claims that they can reduce the number of leaks. After they have maintained the pipes for some time, a random investigated and found to have leaks. Test the contractor’s claim at the level.

stretch of pipe is significance [5 marks]

b It is decided that if three or fewer leaks are found in , then the contractor has reduced the number of leaks. What is the probability of a type I error? [2 marks]

A LEVEL PRACTICE PAPER 60 minutes, 50 marks 1

The number of beta particles emitted by a radioactive isotope follows a Poisson distribution. On average, beta particles are emitted each second. What is the probability (to three significant figures) that the second beta particle is emitted between and seconds after the first beta particle is observed? Choose from these options. A B C D

2

[1 mark]

The discrete random variable has a probability distribution given by

for

and otherwise. Find the median of . Choose from these options. A B C D 3

[1 mark]

The discrete random variable follows the distribution shown.

a Find b Find

.

[1 mark] .

[2 marks]

c Write down the value of 4

.

[1 mark]

The contingency table shows information about whether a random sample of people have music lessons, and their gender. Music lessons

No music lessons

Female Male a State the null and alternative hypotheses when conducting a chi-squared test for independence. [2 marks] b Write down the number of degrees of freedom in this test. c Conduct a chi-squared test at the gender and choice of lessons. 5

[1 mark]

significance level to see if there is a link between [5 marks]

A continuous random variable has probability density function given by

a Find the exact value of .

[3 marks]

b Find

[4 marks]

and

, giving your answers to three significant figures.

c Find the standard deviation of   6

, giving your answer to three significant figures.

[4 marks] A researcher is testing if a new swimming technique is more effective. She knows the average time of swimmers in her club using the old technique is seconds. After training swimmers with the new technique she times them over and summarises their times in seconds:

Lower times are considered better in swimming. a Show that the unbiased estimate of the variance is

to two decimal places. [2 marks]

b Write down appropriate null and alternative hypotheses to test if the new swimming technique is effective.

[2 marks]

c Write down the number of degrees of freedom in the test. d Investigate, using the mean time.

[1 mark]

significance level, whether the new technique improved the [4 marks]

e State one assumption required for your test to be valid. Comment on how reasonable the assumption is in this context. 7

[2 marks]

The number of phone calls received by an IT helpline is known to follow a Poisson distribution. It is thought to receive a mean of phone calls per hour. A change to the IT system is designed to encourage fewer phone calls to the helpline. If there are phone calls or fewer in a -hour period, the change will be deemed successful. a Find the probability of a type I error in this process. b In reality the number of phone calls was error.

8

[3 marks]

per hour. Find the probability of a type II [3 marks]

When a scientist records the volume of acid required to neutralise a solution she records her results to the nearest millilitre. For example, if she records a volume of believes that the true volume required is somewhere in between and possibilities equally likely. The error,

, she with all

, is a random variable defined as the true volume of acid required to

neutralise the solution minus the recorded volume. a State an appropriate distribution to model , including its parameters. b Find the probability that the magnitude of the error,

, is less than

[2 marks] .

[1 mark]

c Find the probability that in two independent observations the magnitude of the error is less than . [2 marks] d Hence find the probability density function of the random variable magnitude of the error in two observations.

, the maximum [3 marks]

FORMULAE

Probability

Standard deviation

Discrete distributions Distribution of

Mean

Variance

Binomial Poisson

Sampling distributions For a random sample variance :

of independent observations from a distribution having mean and

For a random sample of observations from

:

Distribution-free (non-parametric) tests Contingency tables:

is approximately distributed as

TABLE 1 Percentage points of the student’s -distribution The table gives the values of satisfying -distribution with degrees of freedom.

p

0

x

, where is a random variable having the student’s

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 45 50 55 60 65 70 75 80







85 90 95 100 125 150 200

TABLE 2 Percentage points of the The table gives the values of satisfying

distribution , where is a random variable having the

distribution with degrees of freedom.

p

x

0

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

10

10

11

11

12

12

13

13

14

14

15

15

16

16

17

17

18

18

19

19

20

20

21

21

22

22

23

23

24

24

25

25

26

26

27

27

28

28

29

29

30

30

31

31

32

32

33

33

34

34

35

35

36

36

37

37

38

38

39

39

40

40

45

45

50

50

55

55

60

60

65

65

70

70

75

75

80

80

85

85

90

90

95

95

100

100

Answers 1 Discrete random variables BEFORE YOU START 1 2 3 4

WORK IT OUT 1.1 Solution B is correct.

EXERCISE 1A 1 Answers are given to a i ii b i ii c i

ii

d i ii 2 a Proof. b 3 a b 4 a Proof. b 5 a b c

where appropriate.

6 7 a b 8 9 a

b c d e 10 a

Profit £ Probability

b £ 11 a b i ii Proof. 12 a b c d i ii

EXERCISE 1B 1 a i ii b i ii c i ii d i ii e i ii 2 a i ii

b i ii c i ii d i ii 3 4 5 6 7 a b c d e 8 a b 9 Proof. 10 Any value allowed, there is no upper limit.

EXERCISE 1C 1 a i ii b i ii 2 3 4 a Proof; b 5 6 7 8 9 Proof.

10 Proof.

MIXED PRACTICE 1 1 C 2 C 3 a

b 4 a

b c d e 5 a    b c 6 a b c 7 8 9 a b 10 a

b c d 11 a

b c 12

13 a b c d 14 15 a b 16 a b i ii Proof. iii 17 a i Proof. ii iii Proof. iv b

2 Poisson distribution BEFORE YOU START 1 2 3 4 Do not reject

.

WORK IT OUT 2.1 Solution A is correct.

EXERCISE 2A In this exercise answers are given to 1 a i ii b i ii c i ii 2 a i ii b i ii c i ii d i ii e i ii 3

4 5 a b 6 a b 7 a

, where appropriate.

b c d 8 a b 9 a b 10 11 a b c There are alternative ways to get

emails in a week other than every day.

12 a b c 13 a b c 14 15 a Proof. b

WORK IT OUT 2.2 Solution B is correct.

EXERCISE 2B In this exercise answers are given to 1 a i Do not reject

-value

ii Do not reject

-value

b i Reject

-value

ii Reject

-value

c i Do not reject

-value

ii Do not reject

-value

d i Reject

-value

ii Reject

-value

2 a i ii b i

, where appropriate.

ii c i ii 3 Reject

-value

4 a -value

. Do not reject

.

b Rate might be different at different times. Cars might not be independent – cars might be travelling together. 5 a b Do not reject

-value

6 a b c Reject

-value

7 a Constant rate over the day. Bees arrive independently. b Reject

-value

8 Do not reject

-value

9 a Do not reject b Reject

.

-value

-value

10

MIXED PRACTICE 2 In this exercise answers are given to

, where appropriate.

1 C 2 C 3 a

.

b c 4 a Independent events. Constant rate of success. b c Reject

-value

.

5 a b 6 a b 7 a b c Do not reject 8 a

-value

b 9 a b 10 a b 11 a b c 12 a b c d e f g 13 a b 14 a b 15 Do not reject

-value

16 a The average rate must be constant. However, you might expect it to vary over different times of the day and with different weather conditions. Birds must arrive independently, but they might come in flocks. b c d Do not reject 17

-value

18 a b c d 19 a i ii b i ii

c 20 21 a i ii iii b 22 a i ii b c i ii The coins buried in a hoard are no longer independent. The Poisson assumption requires independence, so brooches are more likely to be modelled by a Poisson distribution.

3 Chi-squared tests BEFORE YOU START 1 Yes; the -value is 2 3

EXERCISE 3A 1 a i

ii b i ii 2 a

: Physics grade and Mathematics course are independent; : Physics grade and Mathematics course are dependent.

b

or

or

or

Further Maths Maths AS or A No Maths

c d Reject . Significant evidence of association; students studying a higher level of Mathematics tend to do better in physics. 3

critical value . Reject reading level and fiction/non-fiction are dependent.

4 a

Early

. There is significant evidence that the

On time

Late

Total

Walk Car Other



Total

b c Reject

. Significant evidence that lateness depends on the mode of transport.

d He must assume that his data are representative; for example, it was not a day with unusual traffic. He must also assume that the respondents were independent; for example, not lots of students from the same bus. 5 Significant evidence of association; more money on each visit. 6 No significant dependency; increasing the speed of recovery. 7 a

. People who visit more often tend to spend

. The drug does not appear to be effective in

Male £ £



Female





£ £





£





b No significant evidence of dependency; 8 a Proof. b c 9 a Proof. b Proof;

.

c d There will be random variation within the sample. 10 a

First factor

Second factor

Total

A

12

12

1

5

B

10

1

4

5

C

3

2

25

20

Total b Proof. 11 a Proof. b Proof.

WORK IT OUT 3.1 Solution C is correct.

EXERCISE 3B 1 a i ii b i ii 2 Proof; 3 Independent;

.

4 Independent; rural and urban libraries.

. Number of books does not seem to differ significantly between

5 There is significant evidence of an association;

. However, this does not establish

causality. 6 a Proof; . A higher percentage of men are admitted to be evidence of bias. b

. This appears

. Two out of six departments have a higher proportion of men accepted.

7 You cannot do a calculation based on two factors being dependent unless you know exactly what

that dependency is.

MIXED PRACTICE 3 1 D 2 B 3

; no significant evidence of association.

4 a

b c Do not reject 5

. No significant evidence that hair colour and eye colour are dependent.

; do not reject

. There is no evidence of association.

6 A 7 a

; significant evidence of association.

b i There are more of them. ii Largest contribution to 8 a

.

; Fiona’s belief is justified.

b Fewer than expected gained Class . More than expected gained Class 2ii. 9 a i

; significant evidence of association.

ii Accidents involving changing lane to the left are less likely, and accidents involving changing lane to the right are more likely than expected for foreign registered HGVs. b i

Expected values

Prosecution resulted

No prosecution

years or under Over ii

years

; it appears that they are independent. There is no significant evidence that prosecution is dependent on age.

10

; there is significant evidence of an association between age and department.

11 a Proof. b Proof; c d The sample follows the long-term trend. 12 a

; development of Type 2 diabetes seems to be dependent on average level of weekly alcohol consumption.

b Not generalisable to the whole population. The ‘less than ’ category also has a large contribution. c i No change, ii iii No change.

.

4 Continuous distributions BEFORE YOU START 1 2 3 4 5 6

EXERCISE 4A In this exercise answers are given to s.f., where appropriate. 1 a i ii b i ii c i ii d i ii e i ii f i ii g i ii h i ii 2 a i ii b i ii c i ii 3 a i

ii b i ii c i ii 4 a b 5 6 7 8 a b 9 10 11 Proof;

EXERCISE 4B 1 a i ii

;

;

;

;

b i

;

ii

;

c i

;

ii d i ii 2 a i ii b i ii 3 a b 4 a

; ;

;

;

;

;

;

;

; ; ;

; ; ;

; ;

;

b 5 a Proof. b 6 a Proof. b 7

EXERCISE 4C 1 a i ii b i ii c i ii d i ii e i ii 2 a i ii b i ii c i ii d i ii e i ii 3 a i ii b i ii c i ii d i ii

4 a £ b £ 5 6 a b 7 a b 8 a b c 9 a if is odd,

if is even.

b

EXERCISE 4D 1 a i ii b i ii c i ii d i ii e i ii 2 a b c d 3 a Proof. b 4

;

5 ; 6 ; 7 8

minutes;

minutes

is twice the mass of a single gerbil;

is the sum of the masses of two different gerbils.

EXERCISE 4E 1 a i ii b i ii c i ii d i ii e i ii f i ii 2 a b 3 a b c 4 a b 5 a b 6 a b c 7 a b 8 9 a b 10 a b Assumes that the rainfall each day is independent of the rainfall on other days. This is unlikely to be the case. 11 a b i ii

iii c d

EXERCISE 4F ;

1 a i ii b i ii 2 a i ii b i ii 3 4 a b

, otherwise.

c

EXERCISE 4G 1 a

f(x) 5k

0

0

b c 2 a b 3 a Proof. b c

5

10

x

d

4 a b f(w)

0

0

3

7

w

c d 5 a Proof. b 6 a b

c

d e 7 a

f(x) k

0

0

π – 2

5π – 8

x

b Proof. c d

EXERCISE 4H In this exercise answers are given to 3 s.f., where appropriate.

1 a i ii b i ii c i ii d i ii e i i 2 a i ii b i ii 3 a b 4 5 6 a Proof. b Proof;

.

7

EXERCISE 4I In this exercise answers are given to 1 a i ii b i ii c i ii 2 a i ii b i ii

;

, where appropriate.

3 4 a b c 5 6 7 8 Proof; it equals 9 a Exponential; b c Proof. 10 Proof. a b c Proof. d Proof.

EXERCISE 4J 1 a i ii b i ii c i ii d i ii 2 a b c d 3 a b 4 a

; ; ; ;

b 5 a b c d 6 a b 7 Proof; 8 a b c d e

. Rounding causes a slight underestimate of the true mean time.

MIXED PRACTICE 4 1 2 3 a

b c 4 a b c 5 a £ b £ c £ 6 a b i ii

7 a

b 8 9 a b c d 10 a b c d 11 a b c Probably not. Especially in the situation in part b it is likely that when Alice is finished Hassan might try to speed up. d 12 13 14 15 a b 16 a b

c

d Proof. e 17 a b c 18 a Proof.

b

c 19 a b c 20 21 22 a b 23 a Proof. b c d e i ii 24 a f(x) 3 – 10

x + 7 y =– 40

1 – 5 O

1

5

x

b c Proof. d i ii Proof. e 25 a f(x) 3 – 32

O

1 – 2

b i-ii Proof. c i ii

11

x

d

Focus on … Proof 1 Proof 6 1 Theorem 2. 2 Theorem 3. 3 Theorem 5. 4 Properties of sums. 5 Theorem 4 and Theorem 1 6 Theorem 1. 7 Theorem 4. 1–3  Proof.

Focus on … Problem solving 1 1 0.613 or 0.168 2 0.199 or 0.416 3 8 4 a, b Proof. c 18

Focus on … Modelling 1 1 Not very appropriate. Rate might not be constant in every part of the ocean. The presence of fish might not be independent. 2 Not appropriate. Not a random process. 3 This is well modelled by a Poisson distribution. 4 Not appropriate. This is not a number of events, and the rate might not be constant throughout the day. Buses might not be independent. 5 The rate might not be constant, but the Poisson tends to work quite well in these situations. 6 Not appropriate. The number of fish being caught is sufficient that it might have a significant effect on the number of fish remaining in the pond. 7 This is well modelled by the Poisson distribution (and indeed is used in a derivation of the chi-squared statistic). 8 This is well modelled by the Poisson distribution.

5 Further hypothesis testing BEFORE YOU START 1 2 Reject 3 Do not reject 4 5 Do not reject

.

EXERCISE 5A 1 a i Reject ii Reject b i Do not reject ii Do not reject c i Do not reject ii Do not reject 2 a b Reject c True variance unknown. Assume that times are normally distributed. 3 a b Do not reject

.

4 Reject 5 a b Do not reject 6 a b c Reject

.

d Assume that they are normally distributed. 7 a b Do not reject c i ii Reject 8 a b

EXERCISE 5B 1 a i ii

.

.

b i ii c i ii 2 a i ii b i ii c i ii 3 a i ii b i ii 4 a i ii b i ii c i ii 5 a i ii b i ii c i ii 6 a i ii b i ii c i ii 7 Decreases the risk of a type II error, but increases the risk of a type I error. 8 a

: Coin is fair,

: coin is biased.

b Type I error. 9 a Claiming that there is correlation when none really exists.

b Not recognising correlation when there is underlying correlation. 10 a i Claiming that the dice is biased when it is not. ii Claiming that the dice is not biased when it is. b For example: Roll the dice more times, look for more than do a chi-squared test.

sixes, consider other numbers,

11 a b

. This is very small and requires extreme evidence before change is found. This does not seem to be required in this situation.

12 a b c 13 a b Do not reject

. There is not enough evidence that Dhalia’s eggs are heavier.

c d e 14 a b c d Proof. 15 a i ii b i ii 16

MIXED PRACTICE 5 1 C 2 A 3 a b c Do not reject d Assume that the data are drawn from a normal distribution. 4 a Not rejecting

when it is false.

b c d Increase the sample size.

5 a b c 6 a b c d Reject

.

7 a b c 8 a Proof; b

. no significant evidence that mean journey time is

9 a

. no significant evidence to doubt that the mean is

b

. significant evidence that the mean is not equal to

c i Neither. Risk of a type I error is

minutes. . .

regardless of sample size.

ii Larger sample size leads to a smaller risk of a type II error. 10 a

. evidence of association between method of receiving information and outcome.

b Type I error.

Chapter 6 Confidence intervals BEFORE YOU START 1 2 3 Do not reject 4 No significant evidence.

EXERCISE 6A 1 a b 2 a i ii b i ii 3

Confidence Lower bound of level interval

a

Upper bound of interval

i ii

b

i ii

c

i ii

d

i ii

e

i ii

4 a b It is plausible that the true oxygen level is

.

5 a b Yes 6 7 a b No significant evidence of a difference in the mean wage. 8 a b No. This is a confidence interval for the population mean, not sample means. c 9 10 a b

c No, since the confidence intervals overlap (although it is quite difficult to find the significance level). 11 a b Reject

at the

significance level.

12 a b Increase the sample size. c Yes 13 a False. b False. c False. d True. e False. 14

EXERCISE 6B 1 a i ii b i ii c i ii 2 a Assume heights are normally distributed. b c d 3 a b c No. The confidence interval is for the mean value for an individual. The sample is too small for meaningful generalisations. 4 5 a b c Yes 6 a b i ii Do not reject 7 a

.

b 8 a b Proof.

MIXED PRACTICE 6 1 D 2 a b c Do not reject 3 4 5 6 a b Significant evidence that the results are different. 7 a b No. the given probability suggests that

which does not fall in the confidence interval.

8 a b Do not reject 9 a b 10 a i ii b i ii New programme seems to have been effective.

Focus on … Proof 2 1 (n0)pn+(n1)pn−1q+(n2)pn−2q2…+(nn)qn 2–3 Proof. 4 Proof. 5–6 Proof.

Focus on … Problem solving 2 1 Yes. No probability is involved. 2 About 95%. 3 a They tend to be wider. b No change. 4 About 99.4%. 5 About 30%.

Focus on … Modelling 2 1 Investigation. 2 It should be close to zero. 3 No, it should be about 0.7. 4 It looks a lot like a normal distribution. 5 The shape looks like a normal distribution, but it is much wider – it extends to t-scores above 3 and below −3. 6 The standard deviation is much closer to 1 and the t-scores histogram is very similar to the z-scores histogram.

Cross-topic review exercise 1 1 B 2 D 3 a f(x) 9k

O

3

4

x

b Proof. c i ii 4 a

; degrees of freedom

.

b Random, representative sample from each school. c No. Just because there is dependency does not mean that there is causality. 5 a Proof. b c d 6 a

minutes;

b £

minutes

£

7 a Mean:

; s.d.:

b i ii There is reason to doubt the claim. 8 a b 9 a i ii iii b 10 a

AP

AV

Baseball Basketball Soccer

b

; number of degrees of freedom is associated with sport involved.

; significant evidence that coping strategy

c Soccer officials are far less likely than expected to use an AV coping style. Baseball officials are far more likely than expected to use an AV coping style.

11 a b Proof. c 12 a b 13 a i ii b i ii Do not reject null hypothesis. iii iv Making a type II error, rejecting a genuine opportunity, might turn out to be very costly. Further tests can always be done to be more certain. 14 a Phone calls are independent of each other, there is a constant average rate of phone calls. b For example: The same customer might call back, breaking independence. The rate during office hours might be different from the rate during the night. c

(

)

d e

hours ( s.f.).

15 a b c d e 16 a b i

or more cans containing less than

even though the mean is actually

.

ii c i ii Significant evidence that the cans contain less than

on average; significance level is

17 a b c Proof; d No probability of books borrowed and no probability of more than books borrowed. e i ii

.

Cross-topic review exercise 2 1 a b c 2 a

; significant evidence of association.

b Belief is supported at the

level of significance.

3 a b c i ii iii 4

evidence to support Julie’s belief at More students than expected in the age group

significance level.

pass their test first time.

5 a b i ii 6 a b 7 a b 8 a

; insufficient evidence to reject null hypothesis.

b 9 a Most students will be close to the average, with fewer and fewer students getting scores as you move further from the mean. b c d Power would increase because the test will be more likely to pick up the difference from

.

10 a b c 11

; sufficient evidence to support Judith’s belief. Assumption that the population is normally distributed.

12

no evidence at against the sickness.

13 a

significance to support the claim that the drug is effective

b Proof. 14

. Evidence to support Lorraine’s belief. Assume that the distances follow a normal distribution.

15 a Random sample. b

; significant evidence that mean speed has reduced.

c i Concluding that the mean speed has reduced when in fact it has not. ii Concluding that the mean speed is still

when in fact it has reduced.

16 a b 17 a b c Sufficient evidence that the mean number of cars has increased. d 18 a b c 19 a b c No d i ii

AS Level Practice Paper 1 C 2 C 3 a 0.1 b 0.8 c 0.8 4 a Total number of buses arriving in an hour; T~Po(7.5). b 0.0591 (3 s.f.) c For example: Both are dependent on traffic. d Not feasible. Mean≠variance. 5 a 19 b Proof; median=2.38 (3 s.f.);E(X)=2.25. c 0.338 (3 s.f.) 6 χ2=4.22,ν=6; do not reject H0. No significant evidence of an association. 7 a p − value=0.0212; reject H0. Significant evidence that the contractors have reduced the mean number of leaks. b 0.0212 (3 s.f.)

A Level Practice Paper 1 B 2 D 3 a 53 b 59 c 5 4 a H0: Gender and lessons are independent; H1: Gender and lessons are not independent. b 1 c χYates2=4.84; reject H0. 5 a 1ln2 b E(X)=1.44 (3 s.f.);Var(X)=0.0827 (3 s.f.) c 0.144 (3 s.f.) 6 a Proof. b H0: μ=35;H1: μ1)=∫12px+q dx=[px22+qx]12=1−p2−q=14 d Using the formula for Var(X) and substituting in the values for p and q from part b: Var(X)=E(X2)−(E(X))2=∫02px3+qx2 dx−(23)2=4p+8q3−49=29

6

a X=C+T E(X)=E(C)+E(T)=20+100=120 minutesσ(X)=σ2(C)+σ(T)2=52+102=11.2 minutes (3 s.f.) b X=C+T,Y=200+1060X=200+16X Substituting in the values from part a: E(Y)=200+16E(X)=£220σ(Y)=136σ2(X)=16σ(X)=£1.86 (3 s.f.)

7

a Mean=10 065160=62.9 kg (3 s.f.); Standard deviation=12.3 kg  (3 s.f.) b i Z=Φ−1(0.99)=2.33 (3 s.f.) Zsn=2.27 (3 s.f.)x¯−Zsnr)=1−P(X⩽r)=1− e−5∑0r5rr!

Result

P(X>5)=1−P(X⩽5)=0.384 039 349

>10%, so do not reject H0.

P(X>6)=1−P(X⩽6)=0.237 865 41

>10%, so do not reject H0.

P(X>7)=1−P(X⩽7)=0.133 371 678

>10%, so do not reject H0.

P(X>8)=1−P(X⩽8)=0.068 093 639

8)=0.0681 or 6.81%.(3 s.f.) iv To reduce the likelihood of making a type II error. Making such an error would result in a lost opportunity for the owner, as they would find significant evidence to suggest that the mine is not economically viable when, in fact, it is. 14 a Phone calls are independent of each other. There is a constant average rate of phone calls. b For example: The same customer might call back, breaking independence. The rate during office hours might be different from the rate during the night. c σ(X)=λ=4.5=2.12 (3 s.f.) d Let X represent the number of phone calls answered in a 2-hour shift, X~ Po(2×4.5=9). P(X70 b X represents the number of cars passing the school gates in a 10-minute interval, X~Po(70). Ρ(X⩾x)=∑n=x∞70ne−70n!⩽0.1⇒X⩾82 c Reject H0. There is sufficient evidence that the mean number of cars has increased. d The mean number of cars passing is 12 per minute, so now X~ Po(120). Ρ(X⩽81)=∑n=081120ne−120n!=1.01×10−4 (3 s.f.) 18 a λ0=3×16=48 b

Ρ(X⩽35)=∑n=03548ne−48n!=0.0309 (3 s.f.)

c The number of visitor groups is now 12 per hour, so if X represents the number of groups arriving in a 3-hour period, X~Po(3×12=36). Ρ(X>35)=∑n=36∞36ne−36n!≈0.522Power=1−P(X>35)=0.478 (3 s.f.) 19 a Interval is X¯−Z×sn