Probability and Statistics (Tanton)

THINKING MATHEMATICS A Refreshingly Clear Reference Series for Teachers and Students and all those seeking True and Joyo

Views 217 Downloads 0 File size 2MB

Report DMCA / Copyright

DOWNLOAD FILE

Recommend stories

Statistics and Probability Katabasis

Common Methods in Inferential Statistics are ________________________. ans- all Inferential Statistics conclusions are r

44 0 16KB Read more

Probability and Statistics: Cheat Sheet

Probability and Statistics Cheat Sheet c Matthias Vallentin, 2011 Copyright [email protected] 6th March, 2011 12 Par

31 0 1MB Read more

Probability && Statistics

161 3 236KB Read more

Introduction To Probability and Mathematical Statistics

13 1 46MB Read more

Rohatgi - An Introduction To Probability and Statistics

5 0 28MB Read more

probability and statistics for data science

Probability and Statistics for Data Science Math + R + Data CHAPMAN & HALL/CRC DATA SCIENCE SERIES Reflecting the inte

44 3 6MB Read more

Python for Probability, Statistics, And Machine Learning

122 1 7MB Read more

Probability and Statistics for Engineers - Solutions

102 4 63MB Read more

Introduction to Probability, Statistics, And Random Processes - Hossein Pishro-Nik

Preface Introduction and Goals For years, I have been joking with my students that I would teach probability with the sa

17 0 21MB Read more

MTHE03C02 - Probability and Statistics Final Exam 2011/2012

Module Code: MTHE03C02 Final Examination; 2011/2012 Module Title Probability and Statistics Module Leader Dr. Khaled Sal

8 0 251KB Read more

Author / Uploaded
Neutron

Citation preview

THINKING MATHEMATICS A Refreshingly Clear Reference Series for Teachers and Students and all those seeking True and Joyous Understanding!

Volume 8

PROBABILITY AND STATISTICS LEVEL:

TABLE OF CONTENTS PART 1: BASIC PROBABILITY THEORY Simplistic Overview Naïve Probability Theory OR AND Sequence Principle The Empirical Model Law of Large Numbers Monte Carlo Expected Value Conditional Probability Baye’s Theorem

……………………………………………………… 2 ……………………………………………………… 7 ……………………………………………………… 15 ……………………………………………………… 17 ……………………………………………………… 23 ……………………………………………………… 34 ……………………………………………………… 34 ……………………………………………………… 35 ……………………………………………………… 38 ……………………………………………………… 45 ……………………………………………………… 49

PROBLEM SET I

……………………………………………………… 55

PART 2: COUNTING COUNTING PRINCIPLES The Multiplication Principle Factorials The Labeling Principle Multi-stage Labeling Fun with Poker PASCAL’S TRIANGLE A Grid of Numbers The Binomial Theorem

……………………………… 20 ……………………………… 27

PROBLEM SET II

……………………………… 31

……………………………… ……………………………… ……………………………… ……………………………… ………………………………

2 4 9 13 16

PART 3: STATISTICS Displaying and Summarising Data……………………………………………………… 2 Measures of Central Tendency ……………………………………………………… 5 Measures of Dispersion ……………………………………………………… 9 Scatter Plots ……………………………………………………… 16 Lines of Best Fit ……………………………………………………… 18

Correlation Coefficient

……………………………………………………… 23

Null Hypothesis Distributions Central Limit Theorem Normal Distribution 68-95-99.7 Rule z-scores Roulette Confidence Intervals P-values Gallup Poles Sampling Chi-Squared test Quality Control Run Tests Rank Correlation

……………………………………………………… 27 ……………………………………………………… 31 ……………………………………………………… 37 ……………………………………………………… 41 ……………………………………………………… 43 ……………………………………………………… 45 ……………………………………………………… 51 ……………………………………………………… 54 ……………………………………………………… 57 ……………………………………………………… 60 ……………………………………………………… 62 ……………………………………………………… 66 ……………………………………………………… 71 ……………………………………………………… 74 ……………………………………………………… 82

PROBLEM SET III

……………………………………………………… 85

PART 4: ADVANCED TOPICS Random Variables ……………………………………………………… 2 Sum, differences, multiples ……………………………… 4 Connection to Central Limit Theorem ……………………………… 9 Cereal Box Problem ……………………………………………………… 10 Geometric Distribution ……………………………………………………… 12 Binomial Distribution ……………………………………………………… 14 Proportions ……………………………………………………… 19 Student’s t-distribution ……………………………………………………… 20 Chi Squared distribution ……………………………………………………… 23 Chebyshev’s Inequality ……………………………………………………… 23 Law of Large Numbers ……………………………………………………… 24

PROBABILITY AND STATISTICS

Informal Course Notes

PART I of IV James Tanton © 2007 James Tanton

CONTENTS: Simplistic Overview

……………………………………………………… 2

Naïve Probability Theory OR AND Sequence Principle

……………………………………………………… 7 ……………………………………………………… 15 ……………………………………………………… 17 ……………………………………………………… 23

The Empirical Model Law of Large Numbers Monte Carlo

……………………………………………………… 34 ……………………………………………………… 34 ……………………………………………………… 35

Expected Value

……………………………………………………… 38

Conditional Probability

……………………………………………………… 45

Baye’s Theorem

……………………………………………………… 49

REFERENCES:

SOLVE THIS: Mathematical Activities for Students and Clubs, J. Tanton, Mathematical Association of America, Washington D.C., 2001

ENCYCLOPEDIA OF MATHEMATICS, J. Tanton, Facts on File, New York, 2005.

PART ONE:

2

SIMPLISTIC OVERVIEW Probability and Statistics represent two sides of the same coin. PROBABILITY: Explores what can be said about an unknown sample from a known collection of objects. e.g. We know all possible combinations from rolling a pair of dice. What is the most likely outcome? STATISTICS: Explores what can be said about an unknown collection from a known sample. e.g. We surveyed 100 people and found that 37 chewed gum. What does this say about the gum-chewing habits of the entire nation? BASIC IDEAS: Probability: If a situation can be described in terms of possible outcomes that are deemed equally likely, then the probability of any one particular outcome occurring is defined to be: 1 Prob = Total number of outcomes

e.g. The possible outcomes from rolling a dice are: 1, 2, 3, 4, 5, 6. Each is usually deemed equally likely. Then: 1 Prob(3) = 6 1 Prob(5) = 6 etc. Probability relies on the ability to COUNT things. e.g. Four cards dealt from a deck. What’s the probability of getting four aces? This problem relies on the being able to count all 4-card hands. (A bit tricky.)

PART ONE:

3

STATISTICS: There are two branches: Descriptive Statistics is concerned with methods of collecting, tabulating and summarizing data. Inferential Statistics is concerned with making inferences and predictions based on collected data. e.g. A medical study records the heights of 100 eight-year-olds. • average height = a statistic • tallest height = a statistic • third to shortest height = a statistic THESE ARE ALL DESCRIPTIVE •

Making a judgment about whether a particular child’s height is abnormal is an INFERENTIAL JUDGMENT

COMMENT: The word “statistik” was coined by German political scientist Gottfried Achenwall (1719-1772) to mean “a summary of how things stand.” It is based on the Latin word stare meaning “to stand.”

PART ONE:

4

HISTORY PROBABILITY The start of probability theory can, essentially, be pinpointed to a single moment in time. In 1654 French nobleman Chevalier de Méré wrote to prominent mathematician Blaise Pascal asking for advice on the following problem:

Two friends each lay down $100 in a friendly “best of seven” tennis game. But rain interrupts play after just four matches –with one person having won three games and the other just one. How should the $200 be divvied up between the two players so as to properly reflect the likelihood of each winning? Pascal shared this problem Pierre de Fermat. Both solved it independently using different techniques. Through this problem, probability theory was born. Comment: Italian mathematician Girolamo Cardano (1501-1576) actually worked with ideas akin to probability theory before this but did not publish his work. And of course gambling games have been in existence for centuries and scholars have wondered about their results. But the first definitive analysis of “chance” began with the work of Pascal and Fermat. STATISTICS The study of statistics – descriptive statistics, at least – is ancient. 3050 B.C.E. Egyptians collated data on population wealth 2300 B.C.E. Ancient Chinese did the same 594 B. C. E. Greeks took a census for tax collection 309 B. C. E. Greeks took a census for population figures Later Middle ages:

Romans kept census records, birth and death records, conducted geographic surveys, etc. Very little done

The start of inferential statistics can be pinpointed to: 1662 John Grant analysed birth and death records to create “life tables” which were used to predict life expectancies of different social groups. 1790 First U.S. census taken.

PART ONE:

GETTING OUR FEET WET: For fun let’s go back and analyse de Méré’s problem. Let’s imagine, like Pascal and Fermat, we are seeing it for the first time. How would you like to approach it? Recall: Best of 7 games but only 4 games played. Player A has won 3 games Player B has won 1 game. How best divvy up a $200 pot? Here’s some space for writing notes!

5

PART ONE:

6

COMMENT: There are a number of interesting concepts at play in this example. If you have some familiarity with these terms, you may be able to identify in your work … The Law of Large Numbers, the definition of the probability of an event, the notion of expected value. We’ll, of course, talk about these concepts in detail later in these notes.

PART ONE:

NAÏVE PROBABILITY THEORY Recall the basic principle of probability: If a situation can be described in terms of possible outcomes that are deemed equally likely, then the probability of one particular outcome occurring defined to be: 1 Prob = Total number of outcomes Example: In rolling a die there are six possible outcomes: 1, 2, 3, 4, 5, 6, each deemed equally likely. Then: 1 Prob(4) = 6 1 Prob(6) = 6 etc. Definition: The set of all possible outcomes of an experiment is called the sample space. Example:

In tossing a coin … Sample Space = {H , T } In rolling a die …

Sample Space = {1, 2, 3, 4, 5, 6}

Ascertaining someone’s age (in years) … Sample Space = {0, 1, 2 ,.., 120(?) } Definition: An event is a set of outcomes (or just a single outcome).

Example: In rolling a die:

Sample Space = {1, 2, 3, 4, 5, 6}

An event could be:

{2, 4, 6} (rolling an even number)

or :

{3} (rolling a three)

or:

{1, 2,3, 4, 5, 6} (rolling any number!)

7

PART ONE:

Definition: Given a sample space S for an experiment and an event E, the probability of E occurring is defined to be: P( E ) =

# elements of E # elements of S

(This is, of course, assuming that the sample space has just a finite number of elements, and that every single outcome is “equally likely.”) Example: In rolling a die … S = { 1, 2, 3, 4, 5, 6} The probability of rolling an even number, E = {2. 4 6} is: P(even) =

# elements of E 3 1 = = . # elements of S 6 2

Note, also:

P({3}) =

1 6

4 2 = 6 3 0 P ( rolling a 7 ) = = 0 6 P ({1, 2, 4,5}) =

P(rolling any number) =

6 =1 6

COMMENT: We always have

0 ≤ P( E ) ≤ 1 .

8

PART ONE:

9

WARNING: This naïve approach to probability assumes each individual outcome is “equally likely.” THIS IS NOT AN EASY CONCEPT! For example: In rolling a pair of dice and computing their sum, the set of all possibly outcomes is: S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12} But these individual events ,”2”, “3”, …,”12”, are not equally likely. Somehow we are meant to know that the underlying “equally likely” quantity here is not the sums 2, 3, …, 12, but the pairs of numbers behind each sum with order considered important! There are 36 possible ordered pairs: 1-1 2-1 3-1 4-1 5-1 6-1

1-2 2-2 3-2 4-2 5-2 6-2

1-3 2-3 3-3 4-3 5-3 6-3

1-4 2-4 3-4 4-4 5-4 6-4

1-5 2-5 3-5 4-5 5-5 6-5

1-6 2-6 3-6 4-6 5-6 6-6

These are the entities deemed “equally likely.” Now, only one of these pairs gives a sum of “2”, so: 1 P (2) = 36 Also, we see: 2 1 = 36 18 3 1 P (4) = = 36 12

P (3) =

EXERCISE: Write down P(5), P(6), P(7), P(8), P(9), P(10), P(11) and P(12). [QUESTION: Is this correct? Is the order of the pair indeed important? Or should 6-3, say, be deemed equivalent 3-6? How is one meant to know?]

PART ONE: 10

This might not seem too much of an issue here, but in more complicated examples one might be given a sample space, such as S = {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and one is meant to somehow “know” whether or not these outcomes are “equally likely,” or whether or not this sample space is a result of a more fundamentally “equally likely” set. THIS IS VERY PERTURBING TO A MATHEMATICIAN! This is equally perturbing to students – and it should be! Example: Consider the command: Pick a whole number at random. What does this mean? Is each number “equally likely”? Does the term “equally likely” even apply? Example: THE WALLET GAME Two people decide to play the following game:

Each pulls out her wallet. Whoever possesses the least amount of money in her wallet wins. Her prize? The contents of the other player’s wallet. Each person can reason: “I stand to win more than I lose. Thus the game is in my favour!” A game can’t be favourable simultaneously to both players! Something is very strange here. What are the odds of winning? Is everything really balanced and “equally likely”?

ADDITIONAL COMMENT: The very act of defining probability in terms of “equally probable” events is circular: using the term “equally likely” assumes you already know what probability means! The very basis of naïve probability is philosophically flawed.

PART ONE: 11

AND YET ANOTHER COMMMENT: We defined, for a sample space S and an event E, the probability of E occurring as: P( E ) =

# elements of E # elements of S

If we change our wording here and say: P( E ) =

size of E size of S

then we can extend our notion of probability to geometric settings where “size” of a set is taken to be its “area.” For example, a circle sits inside a circle of side length 2 inches. We can ask: If a point inside the square is chosen at random, what

at the chances of it landing outside the circle?

The area of the shaded region, the region of interest (the event E), is 22 − π ⋅ 12 = 4 − π and the area of the entire square (the sample space S) is 22 = 4 . Then the probability we seek is: P( E ) =

size of E 4 − π = size of S 4

QUESTION: What does “equally likely” mean in this setting? Is the probability of picking any specific point zero? If so, is the probability of picking any point from a collection of points also zero?

PART ONE: 12

And while we are mired in philosophical woes, consider the following disturbing problem: EXERCISE: BERTRAND’S PARADOX A chord is chosen at random in a circle. What is the probability that the chord is longer than the side-length of an inscribed equilateral triangle?

Answer 1: By rotating the circle we may as well assume that one end of the chosen chord is positioned at the left end of the circle. Then we can see that the chord will be longer than the side of an inscribed equilateral triangle if its second end lies in the shaded portion shown.

This represents

1 of the circumference of the circle. Thus the probability we 3

seek is: P=

1 3

This argument is mathematically sound and the result is absolutely correct.

PART ONE: 13

Answer 2: By rotating the circle we may as well assume that the chosen chord is horizontal. Then the chosen chord will be longer than the side-length of an inscribed equilateral triangle if its mid-point lies on the shaded portion of the diameter shown:

An exercise in geometry shows that this represents

1 of the diameter. Thus the 2

probability we seek is: P=

1 2

This argument is mathematically sound and the result is absolutely correct!!! □

The problem here is that the term “at random” is absolutely vague! The first answer defines “at random” to mean: select a point on the circumference of the circle and connect it with a given previously chosen point. The second solution assumes “at random” means: draw a circle on the floor and roll a broom handle from

one side of the room across the circle. It is possible to define “at random” by many different means for this problem and arrive at different, but absolutely valid, answers. (If one draws a circle on a piece 1 of paper and drops straws from above onto the figure, one finds that about of 4 them give chords that of the length we seek!) Examples like these paradoxes alerted mathematicians to the problems with beginning approaches to probability theory. Terms such as “equally likely” and “at random” and even “probability” itself, are fundamentally vague notions. No wonder one’s intuition is so challenged by this subject!

PART ONE: 14

For fun … SICHERMAN DICE: Most people believe that ordered pairs of values of the fundamental “equally likely” entities to be considered when rolling a pair of dice and computing their sum. In this case, the results of rolling two dice can be nicely displayed in a table:

We see now that P (3) =

2 6 , P (7) = and so forth. 36 36

Suppose instead we roll two dice, one numbered 1-2-2-3-3-4 and the other 1-3-4-5-6-8. Complete the following addition table and verify that these dice give exactly the same probabilities for any given sum as ordinary dice.

CHALLENGE: Is there a way to renumber a pair of tetrahedral dice so that the probability of any given sum is the same as from a pair of “ordinary” tetrahedral dice (numbered 1-2-3-4 and 1-2-3-4)? COMMENT: These dice were discovered by Col. George Sicherman in the 1970s.

PART ONE: 15

NAÏVE PROBABILITY THEORY CONTINUED … Putting philosophical woes on hold for now … Given a (finite) sample space S and an event A, we have defined the probability of event A occurring as: P ( A) =

size of A size of S

Next, we need to explore the possibilities of combining actions. “Definition:” Two actions are said to be independent if the outcomes of one action in no way affect the outcomes of the other. Example: Tossing a coin and rolling a die are independent events. Here the sample space is the set of twelve pairs: (H, 1) (H, 2) (H, 3) (H, 4) (H, 5) (H, 6) (T, 1) (T, 2) (T, 3) (T, 4) (T, 5) (T, 6) Example: Picking a card from a deck of cards, destroying it, and then picking a second card from the deck, and NOT independent events: The result of the first action affects possible outcomes for the second. For instance, picking the ace of spades first no longer allows the ace of spades to be chosen second. Example: Deciding what to wear and the weather forecast are not independent events.

COMMENT: We generally rely on our intuitive understanding of the world to conclude whether or not two events are independent. (Again, this can be a difficult issue.) This notion, as it stands, is just as vague and problematic as the term “equally likely.”

PART ONE: 16

THE USE OF THE WORD “OR” Let’s examine a basic example of two independent actions: ACTION 1: Toss a coin ACTION 2: Roll a die The set of all possible outcomes can be displayed in a tree diagram:

The twelve different outcomes are now explicit. It is easy to compute probabilities. For example: P ( {H, even} ) =

P( {T, 5 or 6} ) =

3 12 2 12

P ({H, even} OR {T, 5 or 6} )

=

3+ 2 3 2 = + 12 12 12

= P({H, even}) + P({T, 5 or 6})

PART ONE: 17

This illustrates a general principle: If

A = one set of desired outcomes B = a second set of desired outcomes

and these outcomes share no common events Then: P(A or B) = P(A) + P(B)

Example: Let

A = “head and any number” B = “tail and 3”

Then

P ( A ∪ B ) = P ( A) + P ( B ) =

6 1 7 + = 12 12 12

Comment: The union symbol ∪ is interpreted as “or.”

EXERCISE: What if A and B do share events in common? Use the following Venn diagram to explain the formula: P ( A ∪ B ) = P ( A) + P ( B ) − P( A ∩ B) .

EXERCISE: Find

P( {H, odd} or {H, 1, 2, or 3}).

PART ONE: 18

THE USE OF THE WORD “AND” There is a second model called the “square model” that is useful for analyzing probabilities. Example: Suppose 100 people walk down a garden path that leads to a fork. A left turn leads to house A, a right turn to house B.

Assume that there is a 50% chance that a person will turn one way over another. In this set-up we’d expect, basically 50 people to end up at house A and 50 people at house B. The following diagram of one-hundred dots (for 100 people) depicts this outcome:

The number “100” here is immaterial. The point is that if a square is used to denote the entire population of people walking down the path, then half the area of the square (half the people) end up with results “A”, and the second half the square result “B.” This was a very simple example. Let’s practice some more complicates scenarios.

PART ONE: 19

EXERCISE: Folk walk down the following system of paths. Use the square model to compute the fraction of people that end up at house A, at house B, and at house C. [Assume that each choice encountered at a fork in the path is equally likely.]

PART ONE: 20

EXERCISE: Folk walk down the following system of paths. Use the square model to compute the fraction of people that end up at each house. [Again assume that each choice encountered at a fork in the path is equally likely.]

PART ONE: 21

EXAMPLE: I roll a die and then toss a coin. What are the chances of getting an even number followed by a head? Answer: Think of this as a path-walking problem with two houses labeled “WANT” and “DON’T WANT.” The forks in the road represent the options that can occur (each, with 50% chance of occurring):

This leads to the square model diagram:

We see that the desired outcome represents one quarter (half of a half) of the square: P(even AND head) =

1 4

□

PART ONE: 22

EXAMPLE: I toss a quarter, then I toss a dime, and then I roll a die. What are the chances of receiving HEAD, HEAD, and “5 or 6”? Answer: Here’s the garden path:

This gives the square model:

We have: P( H and H and {5,6} )

=

2 1 1 of of of the square 6 2 2

=

2 1 1 1 × × = 6 2 2 12

□

PART ONE: 23

We see the “multiplication principle” at hand: If A represents the set of desired outcomes for one action, and B the set of desired outcomes for a second action, and these actions are independent then: P(A and B) = P(A) × P(B)

In summary:

For independent actions: “OR” means “addition” “AND” means “multiplication”

ASIDE: Consider a garden path that leads to two possible houses, A and B. Suppose there are an infinite numbers of three-way forks, each with left turn leading to A, right turn leading to B, and straight path leading to the next fork. Create a beautiful depiction of the square model for this situation that makes it 1 1 1 1 1 visually clear that this formula: + + + + ⋯ = must be true. 3 9 27 81 2

PART ONE: 24

ANOTHER CONFUSING MATTER … DOES THE ORDER IN WHICH ONE COMPUTES TASKS MATTER? For example, in tossing a coin and rolling a die there are three possibilities: 1. Toss the coin and then roll the die 2. Roll the die and then toss the coin 3. Roll and toss the coin simultaneously. Standing back and thinking about this one would likely say that all three scenarios are philosophically equivalent, even though the tree diagrams and square diagrams for possibilities 1. and 2. are different. (Draw them! Is it possible to draw a tree diagram for simultaneous actions?) It’s good to spell things out and make explicit the following … SEQUENCE PRINCIPLE: If two actions are independent, then performing the two

actions simultaneously is philosophically equivalent to performing them one at a time (and it does not matter in which order one opts to do them). The following example is typical of the difficulties that can arise: EXAMPLE: I roll a pair of dice. What are the chances of getting an “6” and a “2”? Answer: It is best to avoid thinking of events that occur simultaneously and instead tease them apart into a sequence of actions. Here we can imagine rolling one die and then the other. Now it is clear that there are two desirable possibilities: Roll a 6 and then roll a 2 OR Roll a 2 and then roll a 6 Using “+” or OR and “x” for AND, the probability we seek is: 1 1 1 1 1 P = P (6) × P (2) + P (2) × P (6) = ⋅ + ⋅ = 6 6 6 6 18

□

EXERCISE: I roll three die simultaneously. What at the chances of seeing two 6s and one 5? EXERCISE: In rolling two dice. What are the chances of getting a “2” and a “2”?

PART ONE: 25

This is essentially all there is to naïve probability theory. Of course, there are subtle issues to explore - and we’ll come to those in applications – but for now, let’s practice the basic ideas. EXERCISE: Suppose A is an event for some sample space S. Explain the following formula: P(not A) = 1 – P(A)

EXERCISE: The probability that any one person will be bitten at least once in life by a dog is 1 1 . The probability of being bitten by a cat is . 20 50 Find: a) The probability that a person will be bitten by both a cat and a dog some time in life b) The probability that a person will never be bitten by a dog. c) The probability that someone will be bitten by a cat or a dog but not both.

EXERCISE: Three dice are tossed simultaneously. What are the chances of: a) Receiving three 1s? b) Receiving no 1s? c) Receiving two 1s and one 2? Three dice and two coins are tossed simultaneously. What are the chances of: d) Receiving three 1s and two Hs? e) Receiving no 1s and no Hs?

PART ONE: 26

EXERCISE: A couple has two children. a) Draw a tree diagram illustrating all possibilities re gender. b) What is the probability that the couple has two boys? c) What is the probability that the couple has a child of each gender? [Assume that the chances of having a boy match those of having a girl.] EXERCISE: a) You know that Jenny has two children and that her first child is a boy. What is the probability that her other child is a girl? b) You know that Mike has two children and that one is a boy. What are the chances that the other child is a girl?

EXERCISE: Lulu has four children and you are told that at least one of the four is a boy. What is the probability that … a) Exactly two of her children are boys? b) At least two are boys?

EXERCISE: Two dice are rolled. a) What is the probability that their sum is odd? Even? b) What is the probability that their product is odd? Even? c) When rolling a pair of dice, the most likely sum is “7.” What is the most likely product? EXERCISE: A pop-quiz has 10 multiple choice questions, each with choices: A, B, C or D. I didn’t study for this quiz and decided circle the answers at random. a) b) c) d)

What are What are What are What are

the the the the

chances chances chances chances

that I that I that I that I

will get all 10 questions right? will get 9 out of 10 correct? will get 8 out of 10 correct? will get at least three right?

PART ONE: 27

EXAMPLE: A CLIFF HANGER Dorothy is not feeling good. She stands at the edge of a cliff, conveniently labeled position 1, with an infinite expanse of land at her back (conveniently labeled in steps 2, 3, 4, …). Regard location 0 as “off the cliff.”

Dorothy lays her fate in the toss(es) of a coin. She pulls out a quarter to toss and decides:

If it lands HEADS, I will step forward (to my doom). If it lands TAILS, I will step backwards one place and toss again. She does this repeatedly – stepping forward with each land of HEADS, backwards with each land of TAILS – until she either meets her doom or ends up wandering forever in the infinite expanse behind her. What are the chances that Dorothy will walk over the cliff?

Answer: We’ll answer this is in a series of steps. Let p = p (1 → 0) be the probability, when standing at position 1, of Dorothy eventually reaching position zero. (This may either be by stepping one pace forward right away, or taking a step back and then two paces forward, and so forth.) It is this value p that we seek. Define p (2 → 1) , p (3 → 2) , and so on the same way. STEP 1: Can you see that p (1 → 0) and p (2 → 1) and p (3 → 2) and so on, are each, philosophically, the same problem and so have the same value p ?

PART ONE: 28

Notice that: p (1 → 0 ) = p ( stepping forward right away OR stepping back to position 2 and reaching 0 sometime later ) = p ( stepping forward right away ) + p ( stepping back AND moving 2 → 0 ) 1 1 + p ( 2 → 0) 2 2 1 1 = + p ( moving from 2 to 1 AND moving from 1 to 0 ) 2 2 1 1 = + p ( 2 → 1) ⋅ p (1 → 0 ) 2 2 =

This leads to the quadratic equation: p=

1 1 2 + p 2 2

STEP 2: Solve for p. Is Dorothy’s doom certain? □

EXERCISE: Explain why p ( N → 0) = p ( N → N − 1) × ⋯ × p (2 → 1) × p (1 → 0) = 1 × ⋯ × 1 × 1 = 1 . What is this saying?

EXERCISE: GAMBLER’S RUIN A gambler repeatedly plays a simple game: a 50% chance of winning a dollar, a 50% chance of losing a dollar. If she starts with $N, what are the chances of her losing all her money? COMMENT: We are flirting with the notion of a “random walk” and have essentially proven that a one-dimensional walk, the walker will visit each and every cell of the line an infinite number of times. Feel free to conduct some internet research on this topic.

PART ONE: 29

EXAMPLE: A SMARTER GAMBLER Another gambler repeatedly plays the same simple game with a 50% chance of winning a dollar, 50% chance of losing a dollar. She starts with $4. She decides she will stop playing when she either reaches $0 (loses all her money) or gets to $10. What are her chances of reaching $10? Answer - almost: Let p ( N ) = probability of reaching $10 starting with $N in hand. We certainly have p (0) = 0 (we’ve already lost our money and have no chance of reaching $10) and p (10) = 1 (with $10 in hand we are guaranteed having $10!). Now, suppose N is between 1 and 9 inclusive. Then: p ( N ) = p ( lose a dollar AND play with $N-1 OR win a dollar AND play with $N+1 =

)

1 1 p ( N − 1) + p ( N + 1) 2 2

Thus each number p ( N ) is the average of its two neighboring values. CHALLENGE: If p (1) = x , show that this means that p (2) = 2 x , and that p (3) = 3 x , and so forth. What must be the value of x ?

And, so, what is the value of p (4) ? □ EXERCISE: Back to Dorothy … Suppose she uses a weighted coin that has only a

1 chance of landing HEADS and 3

2 chance of landing TAILS. Show that her chance of survival after a possibly 3 infinite number of tosses is now 50%!

PART ONE: 30

A not-so-exciting example: EXERCISE: A bag contains: Two red balls Three white balls Six blue balls A ball is chosen at random. What is the probability that the ball is: a) Blue? b) Either red or blue? c) Neither red nor blue? A not-unexciting example: EXERCISE: A bag contains one Red, one Blue and one White ball. John picks out a ball at random. If it is Red, he wins. If it is blue, he loses. If it is white, he puts the ball back, and adds to the bag another red ball and another blue ball. (So the bag now contains 2Rs, 2Bs and one W.) John then chooses a ball at random. If it is red, he wins now. If it is blue, he loses. If it is white, he puts the ball back and adds another red ball and another blue ball and picks again. John keeps doing this until he either wins or loses. Use this problem to establish the bizarre formula: 1 1 2 1 1 3 1 1 1 4 1 + ⋅ + ⋅ ⋅ + ⋅ ⋅ ⋅ +⋯ = 3 3 5 3 5 7 3 5 7 9 2

PART ONE: 31

EXAMPLE: NON-TRANSITIVE DICE Here are designs for four dice: A, B, C and D.

You and a friend decide to play the following game.

Your friend chooses a die and then you choose a die. You each roll your chosen die. Whoever receives the largest number wins. Show that if you choose the die to the left of your friend’s choice (or choose die D if your friend chooses die A) you will win this game two-thirds of the time. That is, show that: die A beats die B two-thirds of the time die B beats die C two-thirds of the time die C beats die D two-thirds of the time die D beats die A two-thirds of the time HINT: The following table shows all possible wins if dice A and B are rolled:

We indeed see that “A beats B” two thirds of the time.

PART ONE: 32

EXERCISE: ANOTHER MAGICAL PROPERTY OF A MAGIC SQUARE Here’s the standard 3x3 magic square. (Each row, column and diagonal has the same sum of 15).

Player A picks a number at random from row A, player B a number at random from row B, and player C a number at random from row C. Let’s say that “A beats B” if A’s number is larger than B’s. a) Show that chances are A will beat B. b) Show that chances are B will beat C. c) Show that chances are C will beat A!

PART ONE: 33

ANOTHER CLASSIC … THE BIRTHDAY PROBLEM Two random people are kidnapped. What are the chances that their birthdays land on the same day of the year? Give your answer as a percentage to one decimal place. Answer:

Three random people are kidnapped. What are the chances that at least two of them have the same birthday? Answer:

Four people are kidnapped. What are the chances that at least two of them have the same birthday? Answer:

PART ONE: 34

If you are game … fill out the following table:

PART ONE: 35

THE EMPIRICAL MODEL One way to attempt computing probability values is to repeat an experiment a large number of times and see how often the desired outcome occurs. This is usually the only option available if a particular experiment is extremely complicated or difficult to compute. For example: What is the probability that 5 letters chosen at random spell an English word? The best thing to do would probably be … Have a computer develop 10,000 examples of five letters chosen at random and match each of these with its builtin dictionary to see what proportion of them are English words. This won’t be an exact answer to our questions, but we suspect it would be a very close approximation. We are using a principle here called the “Law of Large Numbers.” This law makes intuitive sense and is often assumed without explicit mention. In the nineteenth and twentieth centuries, as mathematicians attempted to put probability theory on a sound, rigorous footing, one of the significant “check-points” of their work was the ability to prove this Law of Large numbers as true according to the axioms of their theory. Here’s the principle: LAW OF LARGE NUMBERS

The more times a random phenomenon is performed the closer the proportion of trials in which a particular desired outcome occurs approximates the true probability of that outcome occurring. Example: If you toss a coin some number of times, you would expect approximately half of the tosses to be HEADS and half TAILS. With 10 tosses, it is unlikely you will receive exactly half of each. (Try it!) With 100 tosses, the proportion of heads would be closer to 50% With 1000 tosses, significantly closer to 50%. Even better with 100000000000000000 tosses!

PART ONE: 36

Comment: Many gamblers incorrectly interpret this result as follows:

If a run of trials did not produce the desired outcome, then the chances of that outcome occurring on the very next trial is increased. Example: You toss a coin nine times and got four HEADS and five TAILS. That the next toss will be HEADS is thus almost certain – NOT TRUE! Example: You toss a coin 999 times and got TAILS every time. The chances that the next toss will be heads, alas, is still only 50%. Gamblers often feel that a string of losses must produce a win on the next turn.

MONTE CARLO METHOD: The act of repeating an experiment multiple times to determine (an approximate) value for a probability value is often called the “Monte Carlo Method.” Many casinos determined odds for their games simply by playing them multiple times and observing frequency of outcomes. (How do you determine the chances of winning a hand of blackjack? An easy method is to observe play of a large number of games.) Aside: Read Bringing Down the House to learn how MIT students tipped the odds of blackjack in their favour by certain tactical plays. One can use the Monte Carlo Method to work out areas of complicated regions. Example: Here is an aerial photograph of an oil spill.

It is known that the area of the rectangular region photographed is 40 square kilometers.

PART ONE: 37

One can compute (a fairly accurate approximation to) the area by digitizing the photograph and have a compute select 10,000 points at random in the photograph. If, say, 6,473 of those points land in the shaded region, then we can say that the area of the spill is ……? ACTIVITY: In a group of three call one person player 0, one person player 2 and the remaining person player 3. Each person places his right hand behind his back and secretly holds up one, two or three fingers. The players then show their hands. If all three numbers match, player 3 receives a point. If two of the three numbers match, then player 2 receives a point. If there are no matches, player zero receives a point. a) Play this game a large number of times and tally points in a table. From your data, who seems to have the largest chance of receiving a point in any single game? Estimate that probability of winning. Estimate the chances of a win for each of the remaining two players. b) Use theory to determine the actual probability of a win for each of the three players. c) Suppose players 3, 2, and 0 are assigned, instead of one point for each win a, b and c points, respectively, for each win. Choose values for a, b and c that makes this game “fair.” EXERCISE: a) A bag contains four balls each colored either red or blue. Jenny pulls out two balls at random and gets a pair of blue balls. She returns the balls to the bag, gives it a shake, and pulls out another pair of balls. She does this 100 times recording the results along the way: BB = 52 times BR = 48 times RR = 0 times Most likely, how many blue balls and how many red balls are in that bag? b) Suppose, instead, Jenny obtained the result: BB = 16 times BR = 70 times RR = 14 times What might you conclude about the colors of the balls?

PART ONE: 38

OPTIONAL EXERCISE: BUFFON NEEDLE PROBLEM Do some internet research on how one can use probability theory to approximate π. EXAMPLE: KRUSKAL COUNT In the early 1980s, Princeton physicist Martin Kruskal discovered a remarkable mathematical property all passages of written text seem to possess. This phenomenon is now referred to as Kruskal’s count. To illustrate, consider the familiar nursery rhyme: Twinkle twinkle little star, How I wonder what you are, Up above the world so high, Like a diamond in the sky. Twinkle twinkle little star, How I wonder what you are. Perform the following steps: 1. Select any word from the first or second line and count the number of letters it contains. 2. Count that many words forward through the passage to land on a new word. (For example, choosing the word star, with four letters, will transport you to the word what.) 3. Count the number of letters in the new word, and move forward again that many places. 4. Repeat this procedure until you can go no further (that is, counting forward will take you off the nursery rhyme.) 5. Observe the final word on which you have landed. Surprisingly, no matter on which word you start this counting task, the procedure always takes you to the same word in the final line, namely, the word you. Kruskal observed that this same phenomenon seems to occur with any sufficiently large piece of text - counting forward in this way from any choice of beginning word lands you at the same place at the end of the page. This provides an amusing activity for several people to perform simultaneously, all working with the same text, but starting with different choices of initial word. Why does this seem to work?

PART ONE: 39

EXPECTED VALUE Suppose you a play a game with some monetary values associated with it. For example:

Flip a coin. If it comes up HEADS, you win $2. If it comes up TAILS, you lose $1.

Definition: The expected value of a game is the average profit (or loss) one would expect if the game were played a large number of times. For example, suppose we played the above game 200 times. Then, on average, we would expect to win $2 one hundred times and lose $1 one hundred times. Average win =

2 + 2 + ⋯ + 2 + (−1) + (−1) + ⋯ + (−1) 200

=

2 × 100 + (−1) × 100 200

=

2×

1 1 + (−1) × 2 2

=

1 2

= 50 cents

Thus, we’d expect to win 50 cents per game. COMMENT: Many text-book questions phrase a game like the one described above as follows:

You pay $1 to play the following game: Flip a coin. If it is HEADS, win $3. If it is TAILS, you win nothing. Do you see that it is exactly the same game as before? The idea of “having to pay first” often offers a point of confusion for students. It is good to tease such questions apart and list the end outcomes explicitly: If HEADS – I’m up $2 overall If TAILS - I am down $1 overall Now it is clear how to handle analysis of the game.

PART ONE: 40

EXAMPLE: Imagine the following simple dice game: Roll 1: You win $10 Roll 2: You win $5 Roll 3,4,5,6: You lose $3 Is this game worth playing? Answer: Imagine playing 600 rounds. (Why did I choose the number 600?) On average, we’d expect: 100 times a win of $10 100 times a win of $5 400 times a loss of $3 Average profit =

100 × 10 + 100 × 5 + 400 × (−3) 1 1 4 = × 10 + × 5 + × (−3) = 0.50 600 6 6 6

This game is in your favour. It is worth playing. You can expect to win, on average, 50 cents per game. □

Note: In this calculation we see the appearance of the probabilities of each outcome multiplied by their respective values of outcomes. In general … If a game offers values x1 , x2 ,… , xn with probabilities p1 , p2 ,… , pn , then the expected value of the game is: x1 p1 + x2 p2 + ⋯ + xn pn This number is often denoted µ or E. Exercise: Find the expected value of tossing a pair of dice. Answer:

PART ONE: 41

□ EXAMPLE: A coin is tossed. If H comes up on the first toss, you win $1. If H first appears in the second toss, you lose $1. If H first appears in the third toss, you win $1. and so on. (Basically: You win $1 if H first appears on an odd toss, lose $1 if H first appears on even toss.) What is the expected value of this game? Answer: The probability of getting heads on the first toss is

1 2

1 1 1 × = . (Why?) 2 2 4 1 1 1 1 The probability of getting heads on the third toss is: × × = . 2 2 2 8 And so on

The probability of getting heads on the second toss is:

Thus: 1 2

1 4

1 8

µ = 1 × + (−1) × + 1 × + (−1) ×

1 1 1 1 1 +⋯ = − + − +⋯ 16 2 4 8 16

One can use the “geometric series” formula to evaluate this infinite sum. Another approach (a neat trick) is to multiply this sum by two: 1 1 1 1 1 1 1 1 1  1 1 1 1  2µ = 2  − + − + − ⋯ = 1 − + − + − ⋯ = 1 −  − + − + ⋯ = 1 − µ 2 4 8 16  2 4 8 16 32   2 4 8 16 

So 2 µ = 1 − µ giving µ =

1 . Done! 3

□

PART ONE: 42

EXERCISE: a) Consider the example we first studied in this section, phrased as the textbooks tend to phrase it:

You pay $1 to play the following game: Flip a coin. If it is HEADS, win $3. If it is TAILS, you win nothing. What is the expected profit in playing this game? A student decides to answer the problem as follows:

Well … µ =

1 1 ⋅ 3 + ⋅ 0 = $1: 50 . But since we paid a dollar, we must subtract $1 2 2

from this amount. The expected profit is therefore 50 cents. This agrees with our previous answer. Coincidence? COMMENT: Some students prefer to follow this approach to questions like these. b) A casino offers a game in which one can win x1 , x2 , x3 or x4 dollars with probabilities p1 , p2 , p3 , p4 respectively. Suppose the expected value of this game is

µ = x1 p1 + x2 p2 + x3 p3 + x4 p4 . As a promotion, the casino decides to offer “bonus night” during which all payouts increase by $2. (Thus the payouts are now x1 + 2, x2 + 2, x3 + 2, x4 + 2 dollars.) Prove, mathematically, that µ increases by 2. Does the mathematics used here also explain the result of part a)?

PART ONE: 43

TWO TIDBITS Here’s a seemingly paradoxical exercise: EXERCISE: In Tiny-Town, 90% of the city cabs are purple and the remaining 10% are blue. A crime was committed and an eye-witness claims she saw a blue cab at the scene. Subsequent tests showed that this witness is correct in her observations four times out of five, that is, 80% of the time. What are the chances that the cab at the scene really was blue?

COMMENT: The answer is surprisingly small!

PART ONE: 44

EXAMPLE: MONTY HALL PROBLEM Named after the host of a popular American TV game show Let’s Make a Deal!, the Monty Hall problem is a classic puzzler often used to test initiates in the field of probability theory. It goes as follows: On a game show three closed doors stand before you. The host informs you that a cash prize lies behind one of the doors, and nothing behind the other two. You select a door, but before you open it, the host quickly opens one of the remaining two doors to show you that the prize is not there. He now gives the chance to change your mind and open instead the third remaining door. The question is: What should you do? Should you stay with your original choice of door, or switch to the other option? Is there any advantage to switching? One’s typical first reaction to this puzzle is that there is no advantage at all to switching – since two doors remain with only one containing a prize, the chance of selecting the correct door, either by staying with the chosen door or switching, is always 50 percent. Surprisingly, this reasoning is not correct for it makes no use of the subtle information the host presents to you, which you can actually use to your advantage. a) Play the game with a partner using playing cards as “doors” – one black and two red. Take turns being host and being contestant. What do you notice about your choices as host? b) Explain why your odds of winning double if you choose to always “switch” rather than “stick.” c) Suppose the host presents you with 100 doors with only one containing a prize. You reach for a door but just before you open it, the host reveals to you the empty contents of 98 other doors. There are now two closed doors, one with your hand on it. The host then offers you the chance to change your mind and open instead the remaining closed door. Should you “stick” with your original choice or “switch”?

PART ONE: 45

For the bold … Consider the following variation of the game:

A game contains four doors, one with a fabulous cash prize behind it, the remaining three empty. The host invites you to select a door and you place your hand on its knob. At this stage, Monty opens one of the remaining three doors and shows its empty contents. He offers you the chance to “stick” with your current door choice or “switch” to one of the remaining two closed doors. You make your choice. Next, Monty opens a second door to reveal its empty contents. Two closed doors remain, one with your hand on its knob. He again offers you the chance to “stick” or “switch.” At this stage the game ends and you accept the consequences. There are four strategies to this two-stage game: Stick-Stick; Stick-Switch; Switch- Stick; Switch-Switch. Which of these four possibilities gives you the greatest chance of winning?

PART ONE: 46 FOR THE BOLDER … In our analysis of the original (and subsequent) Monty Hall problems we made two assumptions: i) Monty does know behind which door the prize lies ii) Monty has no preference as to which non-prize door he opens when faced with a choice. Let’s consider some alternative scenarios: Assume that Monty knows where the prize lies but has a preference as to which nonprize door he opens when faced with a choice. (For an example of preference, suppose the doors are numbered 1, 2, and 3 and Monty will always open the lowest numbered door he can.) In this scenario, switching is not always better! Sometimes a stick is just as good! Matters depend on individual plays. For example, suppose the contestant reaches for door number 1. If Monty opens door number 3 to reveal a non-prize, then the contestant should switch for a certain win. (What stopped Monty from opening door 2 if that was his preference?) If, on the other hand, Monty did open door 2 to reveal a non-proze, then there is no advantage to sticking or switching. (Monty’s action here reveals no information about the possible location of the prize.) EXPERIMENT: Conduct a card experiment with a friend mimicking this scenario. How often can your friend deduce for certain the location of the black card? In general, what are the odds of your friend winning this version of the game? Assume that Monty has no knowledge of the location of the prize and, by luck, opened a non-prize door. There is no advantage to sticking or to switching: each produces a 50% chance of winning. To see why, imagine playing the game 30 times. On average, the contestant will have his hand on the correct door for 10 of those games, and on in incorrect door for 20 of those games. In those 20 games, Monty will accidentally reveal the prize half the time, and so we must reject those occurrences. (We are told that this did not happen.) So we are left with 20 games to ponder, 10 of each type. Sticking leads to a win for 10 of those 20. Switching leads to a win for the remaining 10 of the 20. CHALLENGE: Consider the final scenario in which Monty does not know the location of the prize but will open the lowest number door available to him. It turns out to a non-prize. Should the contestant stick or switch, or does it depend?

PART ONE: 47

CONDITIONAL PROBABILITY Analysis of the Monty Hall problem flirts with difficulty of what to do when partial information is revealed in a situation. This leads us to the notion of “conditional probability,” Definition: The probability of an event occurring given knowledge that another event has already occurred is called a conditional probability. Example: Two cards are drawn at random from a deck. Knowledge of the colour of the first card will affect the likelihood that the second card is red. Specifically: If we are told that the first card was black, then: P( second card red) =

26 51

If we are told that the first card was red, then: P( second card red) =

25 51

If we are told nothing about the colour of the first card, then: P( second card red) =

26 1 = 52 2

[The first card might just as well still be in the deck.] Notation: The probability that event A will occur given knowledge that event B has already occurred is denoted: P(A|B) e.g. We have:

P(second card red | first card black) =

26 51

PART ONE: 48

We can conduct a “thought activity” to determine a formula for P(A|B). Imagine running the experiment a large number of times and observing the number of times B occurs.

We want P(A|B), the proportion of times A occurs among those times B has already happened. That is, we want the number of times both A and B occurred compared to the number of times just B occurred. This suggests:

P(A|B) =

P( A ∩ B) P( B)

[Recall: The intersection symbol ∩ means ‘and.’]

PART ONE: 49

Example: Draw a card from a deck. A friend tells you that the card is red. What is the probability that it is an ace? Answer: P( Ace | red )

=

P (Ace and Red) P (Red)

=

P (red ace) P(red)

=

 2     52  1   2

=

1 13

[And this makes sense since among the 26 red cards, two are aces.]

□

Exercise: A die is rolled. Someone yells out that the answer is odd. Given this information, what is the probability that the roll was a “3”? a “4”? Answer these questions by practicing the formula for conditional probability (and then check that the answers make sense!)

PART ONE: 50

CONDITIONAL PROBABILITY AND INDEPENDENT EVENTS Recall that two actions are independent if the outcomes of one in no way affect the outcomes of the other. So, if A and B are independent events, we’d expect then: P(A|B) = P(A) [Knowledge of B occurring in no way affects the likelihood of A occurring.] We can prove this mathematically.

Recall, for independent events, we have: P ( A and B ) = P ( A) × P ( B) . Thus: P(A|B) =

P ( A and B ) P ( A) ⋅ P ( B ) = = P ( A) P( B) P( B)

Example: A coin is tossed and a die is rolled. What is the probability of getting a HEAD given that the die rolled a “6”? Answer: P(HEAD | Six)

=

=

=

P (HEAD and SIX) P (SIX)

1 1 × 2 6 1 6 1 2

□

COMMENT: In a sophisticated theory of probability, one defines a “measure” onto a set of objects that defines what “at random” means for the problem at hand. This obviates the issue of “equally likely” and begins to put the theory on sound logical footing. Mathematicians then take the relation P ( A | B ) = p ( A) as the

definition of what it means for two events A and B to be independent, that is, A and B are said to be independent if the “measure” P(A|B) equals the “measure” P(A).

PART ONE: 51

BAYE’S THEOREM (1763) What is the relationship between P(A|B) and P(B|A) ? e.g. Draw a card from a deck. Then: 1 13 1 P(red|ace) = 2

P(ace|red) =

What is the connection between these two numbers? RECALL: P(A|B) =

P( A ∩ B) P( B)

P(B|A) =

P ( B ∩ A) P ( A)

and

We have: P ( B | A) =

P ( B ∩ A) P ( A ∩ B) P( A ∩ B ) P( B ) P( B) = = ⋅ = P( A | B) ⋅ P ( A) P ( A) P ( B) P( A) P( A)

That is:

P ( B | A) = P( A | B) ⋅

Example: P(red|ace) = P(ace|red) .

P( B) P( A)

P (red ) 1 1/ 2 = ⋅ = P (ace) 13 1/13

1 . 2

PART ONE: 52

More generally … Suppose B1 and B2 are two non-overlapping events that cover the whole sample space. e.g.

B1 = getting a red card B2 = getting a back card

Suppose A is another event. e.g.

A = getting an ace

Then: BAYES THEOREM:

P ( B1 | A) =

P ( A | B1 ) ⋅ P ( B1 ) P ( A | B1 ) ⋅ P( B1 ) + P ( A | B2 ) ⋅ P ( B2 )

This looks worse than it is. It is also fairly straightforward to prove. Here’s some blank space to write out the proof: Proof:

□

PART ONE: 53

Let’s do an example: EXAMPLE: Bag 1 contains 5 red and 2 white balls. Bag 2 contains 7 red and 4 white balls. A bag is selected at random. A ball is selected at random from that bag. You are told the ball is red. What is the probability that that ball came from bag 1? Answer: Let: B1 = ball comes from bag 1 B2 = ball comes from bag 2 A = ball is red

We want P( B1 |A), the probability that the ball came from bag 1 given that it is red. According to Baye’s theorem: P ( B1 | A) =

P ( A | B1 ) P ( B1 ) P ( A | B1 ) P ( B1 ) + P ( A | B2 ) P ( B2 )

5 1 ⋅ 7 2 = 5 1 7 1 ⋅ + ⋅ 7 2 11 2 =

55 104

□ Would you have guessed this answer? The theorem looks complicated, but it allows you to compute some nasty problems with relative ease.

PART ONE: 54

Here’s a simple, but philosophically confusing, example: EXERCISE: TWO CARD PARADOX One card is red on one side and black on the other. A second card is red on both sides. Both cards are put in a bag and one is pulled out at random. You see that one side of the chosen card is red. What is the probability that the other side of this chosen card is also red? a) Answer this question first by using logical reasoning b) Answer this question a second time using Baye’s theorem.

Answers:

EXERCISE: Yale psychologists have coined the term “cognitive dissonance” for the act of devaluing an object after being told it is not available, and conducted an experiment with monkeys to show that they might too engage in cognitive dissonance. (See http://www.nytimes.com/2008/04/08/science/08tier.html?_r=1&8dpc&oref=slogin ). Scientists had discovered that monkeys prefer red, green and blue M&M’s over all other colors. Monkeys were then given two M&M’s of different colors – say one red and one green. The monkeys would grab one candy, say the red one, and then have the other one taken away. Next the monkey would be offered another two M&M’s but of the colors it had not eaten, in our example, blue and green. The monkey had already experienced the green M&M being taken away, and the scientists found that about two thirds of the monkeys opted to take the blue one instead. Had the majority of monkeys indeed devalued the green M&M given the previous loss? Show that this is a mathematical result and not a psychological result. That is, show that, of all the monkeys that prefer red M&M’s over green M&M’s, two thirds of them also prefer blue M&M’s over green M&M’s, irrespective of whatever experiment is to be conducted!

PART ONE: 55

PROBLEM SET I Question 1: Analyse and answer the following variation of de Mere’s problem:

Two players play a series of games for the “best out of five.” The winner is to receive a prize of $1000. After three games of play, in which the first player had won one game and the other two games, the match was interrupted by an earthquake. How should the $1000 be divvied up between the two players so as to properly reflect their likelihoods of having won the series? (Assume the each player has a 50% chance of winning any particular game.) Question 2: Repeat question 1 but this time assume that the first player has only a 10% chance of winning any individual game. Question 3: 8640 people walk down the following garden path. At each fork, equal numbers of people take each option. Find the number of people that end up in each of the houses A, B, C, and D.

PART ONE: 56

Question 4: Assume that exactly 50% of children born are boys and 50% are girls. A couple has three children. a) b) c) d)

Draw a tree diagram displaying the possible genders of their three children, What are the chances that the couple has three boys? What are the chances that the couple has at least one boy? What are the chances that the couple has exactly two boys?

Suppose that we are now told that their first child was a girl. e) What are the chances that the other two children are also girls? f) What are the chances that at least one of their three children is a girl? Question 5: Billy’s girlfriend has a dimple on her left cheek (there is 1/100 chance that this occurs), blue eyes (there is a 1/100 chance this occurs), and likes math (there is a 1/100 chance that this occurs). He says that his girlfriend is “one in a million.” Is he correct? Question 6: A card is drawn at random from a deck of 52 playing cards. a) Describe the sample space if suits are not considered relevant b) Describe the sample space is suits and numerical value are considered relevant c) Describe the sample space if the value of the card is considered irrelevant Question 7: A card is drawn at random from a deck of 52 cards. What is the probability of: a) Drawing an ace? b) Drawing a club? c) Drawing the ace of clubs? d) Drawing an ace or a club? e) Drawing neither an ace nor a club? f) Drawing any suit except clubs? g) Drawing the three of clubs or the king of diamonds or any heart?

PART ONE: 57

Question 8: An urn contains 5 red balls, 8 blue balls, and 7 white balls. A ball is selected at random. What are the chances of: a) selecting a red ball? b) not selecting a blue ball? c) selecting a ball that is white or blue? After the ball is selected, it is returned to the urn, and the experiment is repeated. What are the chances, in the run of these two experiments, of d) selecting a red ball followed by another red ball? e) selecting two balls of the same colour? Question 9: The chances that someone gets bitten by a dog at least once in life is 0.02 . The chances of being hit by a meteorite at least once in life is 0.001 . The chances of stepping in gum at least once in life is 0.99 . What is the probability … a) b) c) d) e)

Of being hit by a meteorite and being bitten by a dog in your life? Of never being hit by a meteorite? Of never stepping in gum and never being hit by a meteorite? Of all three events happening in your life. None of these events happening in your life.

Question 10: M&M’s come in six colors. Here’s a table showing the probability that a randomly chosen M&M has a particular colour: Colour Brown Red Yellow Green Orange Blue Probability 30% 20% 20% 10% 10% a) Fill in the missing number for blue. b) What are the chances that an M&M chosen at random is either brown or red?

PART ONE: 58

c) What are the chances that two M&M’s chosen at random from an extremely large bag are both blue? d) What are the chances that three M&M’s chosen at random from an extremely large bag are all blue? I’ve been told that the colour distribution for Peanut M&Ms is different. Obtain a bag of peanut M&M’s and use your sample to make estimates for the entries in the following table: Colour Brown Red Yellow Green Orange Blue Probability

Question 11 (ANNOYING – BUT INTERESTING): It is said that a Friday falls on the 13th day of the month 48 times every 28 years. a) Verify this calculation. b) What is the probability that a randomly chosen Friday is a “Friday the 13th”? Question 12: (HARD-ISH, BUT REALLY INTERESTING!) There are eight possible outcomes in tossing a coin three times: HHH, HHT, HTH, THH, HTT, THT, TTH, and TTT. Two players decide to play the following game. Player A chooses the sequence HHH and player B the sequence THH. A coin is tossed repeatedly until one of these sequences appears. For example, the coin might produce T, T, H, T, H, H and player B wins. If the coin produces the sequence H, T, T, H, H, H then player A wins. a) Play the game 10 times. Does player B seem to win the majority of times? b) Explain why player B has the advantage. c) Suppose instead player A chooses the sequence HHT and player B the sequence THH. Play the game 10 times. Does player B again win the majority of times? Can you explain why?

PART ONE: 59

d) Here’s a table showing all the options A could choose and what B chooses in response. If A chooses this … Then B HHH HHT HTH THH TTH THT HTT TTT

chooses this … THH THH HHT TTH HTT TTH HHT HTT

Play the game 10 times for each of the eight rows in the table. Verify that B wins the majority of times in each case. Show me the results you obtained. Question 13: A bag contains a red ball and a white ball. Jodie takes out a ball at random. If it is red she wins. If it is white, she then moves to a bag that contains two red balls, and a single white ball, and pulls out a ball. If it is red, she wins. If it is white, she then moves to a bag that contains three red balls and a single white ball, and pulls out a ball. If it is red, she wins. If it is white, she then moves on to … There are an infinite number of bags available to her, and she keeps playing this game until she eventually wins. Explain how this hypothetical scenario “proves” the following equation: 1 1 2 1 1 3 1 1 1 4 + × + × × + × × × +⋯=1 2 2 3 2 3 4 2 3 4 5 ∞

(That is, in math notation, we’ve established:

n

∑ (n + 1)! = 1 .) n =1

PART ONE: 60

Question 14: A bag contains a red ball, a blue ball, and a white ball. Schuyler pulls a ball out at random. If it is red, he wins. If it is blue, he loses. If it is white, then he moves on to a bag that contains two red, two blue and one white ball. He pulls one out at random. If it is red, he wins. If it is blue, he loses. If it is white he moves on to a bag that contains four red, four blue, and one white ball. And so on, with double the number of red balls and double the number of blue balls from bag to bag. a) Explain why Schuyler’s chances of winning this game are ½. b) Write an interesting infinite sum based on this hypothetical scenario whose value is ½.

Question 15: Three dice are tossed simultaneously. What are the chances of rolling … a) b) c) d)

three sixes? two sixes and a one? no sixes? at least one six?

Question 16: Three dice and two coins are tossed simultaneously. What are the chances of receiving … a) three sixes and two heads? b) no sixes and two heads? c) no sixes and no heads? Question 17: Five cards are drawn from a deck of 52 cards: a) Explain why the chances of pulling out five cards that are all hearts is: 1 12 11 10 9 × × × × ≈ 0.05% . 4 51 50 49 48 b) Find the probability of pulling out five black cards. c) Show that the probability of pulling out four Kings among the five cards is close to 0.002%. d) Show that the probability of pulling out three Kings and two Queens is close to 0.001%.

PART ONE: 61

Question 18: Consider the following magic square:

Player A chooses a number at random from the first row; player B chooses a number at random from the second row, and player C chooses a number at random from the third row. a) What are the chances that player Bs number is higher than player As? b) What are the chances that player Cs number is higher than player Bs? c) What are the chances that player As number is higher than player Cs? [In this game of chance, B has the advantage over A, C has the advantage over B, and A has the advantage over C!] Question 19: a) Eleven numbers are arranged in a line. The first number is 0, the last number is 0, and every number in between is the average of its two neighbors. What are the 11 numbers and why? b) Eleven numbers are arranged in a line. The first number is 0, the last number is 1, and every number in between is the average of its two neighbors. What are the 11 numbers and why? Question 20: You are a game show contestant and the game show host presents to you 100 boxes. She tells you that inside one box lies a fabulous prize and all the remaining boxes are empty. You select a box at random and are about to open it when the host interrupts you and opens 98 boxes to reveal to you their emptiness. This leaves two boxes: the one you selected and one other.

PART ONE: 62

You are now given the chance to “stick” with the box you first chose, or to “switch” and open instead the second box. a) If you decide to stick, what are your chances of winning the prize? b) If you decide to switch, what are your chances of winning the game? Suppose the game show host opens only 97 boxes. This leaves three boxes: the one you first selected and two others. The host now gives you the choice to either “stick” with your original box or to switch to either one of the remaining boxes. c) If you decide to stick, what are your chanced of winning the prize? d) If you switch to a different box, what now are your chances of winning? Question 21: In a game, if a outcomes are deemed “favorable” and the remaining b possible outcomes “unfavorable,” then folk may say – in horse racing circles in particular – that the odds in favor of winning are “a to b”, or alternatively that the odds against are “b to a.” For example, in rolling a die the odds in favor of rolling a 6 are 1:5. The odds against rolling a 5 or a 6 are 4:2 (which could be reduced to 2:1). In a horse race if the odds against a horse are 7:2, this means that bookies 2 believe that the horse has only a chance of winning. 9 CORRECT or INCORRECT? a) A bookie at a horse race says that the odds against a particular horse are 8 . 5:8. This means that the probability the horse will win the race is 13 b) A game yields a 30% chance of a win. The odds against winning the game are thus 7:3. c) In casting a die, the odds in favor of rolling a number smaller than 5 are 2:1. d) In tossing a coin twice, the odds against receiving two heads is 3:1. Question 22: Bag 1 contains 13 red balls and 14 blue balls. Bag 2 contains 12 red balls and 7 blue balls. A bag is selected at random and a ball is pulled out of that bag at random. We are told that the ball is red. What is the probability that the ball came from bag 1?

PART ONE: 63

Question 23: A die is tossed. What is the probability that the result is a number less than 4 if … a) b) c) d)

We We We We

are are are are

told told told told

no other information? that the result was an odd number? that the result wasn’t 5? that the result wasn’t 1?

Question 24: a) Two ordinary dice are tossed. What is the probability of NOT getting a total of 7 or 11? b) Two Sicherman dice are tossed. What is the probability of NOT getting a total of 7 or 11? Question 25: One bag contains 4 red and 5 white balls. A second bag contains 3 red and 6 white balls. A ball is drawn from each bag. What is the probability that … a) Both balls are white b) Both balls are red c) One is white and the other is red Question 26 One bag contains 2 red and 3 white balls. A second bag contains 3 red and 1 white balls. A ball is drawn from each bag. Suppose we are told that one ball chosen was red. What are the chances that the second ball is also red? Question 27: A bag contains 5 red and 4 white balls. A ball is selected and then, without replacing the first ball, a second ball is selected. I tell you that the second ball is white. What is the probability that the first ball was white? Question 28: You play a simple coin-tossing game. If the coin lands heads, you win $3. If lands tails, you must pay $1. a) If you play this game 100 times, how much money do you expect to have? b) What is the expected value of this simple game?

PART ONE: 64

Question 29: Roll a die. If it comes up even, you win that many dollars. If it comes up odd, you must pay that many dollars. (For example, a roll of “4” wins you four dollars. With a roll of “5,” you lose five dollars.) What is the expected value of this game? Would you want to play it? Question 30: A coin is tossed once, possibly twice. If a head appears on the first toss, you win $10 and the game stops. If it lands tails, the coin is tossed again. If the second toss lands heads you win $4, otherwise you pay $20. a) If you played this game 100, on average, how many times will you win $10? How many times will you win $4? How many times will you lose? b) What is the expected value of this game? Would you want to play it? Question 31: A gambling game is called “fair” if its expected value is zero. A die is rolled. If it rolls 1, 2, 3, or 4, you win $300. If it rolls 5 or 6 you lose $x. Find a value of x that makes this game fair. Question 32: A gambling game is called “fair” if its expected value is zero. A coin is tossed three times. If at least two heads appear, you win $100. If exactly one head appears, you win $50. If no head appears, you lose $x. Find a value of x that makes this game fair. Question 33: A die is rolled. If it lands 1 you win $10. If it lands 2 you win $300. If it lands 3 you win $1. If it lands 4 you lose $500. If it lands 5 or 6 you win $x. Find a value of x so that the expected value of this game is fifty cents. Question 34: There is a one-in-twenty-million chance of winning the lottery. This week’s jackpot is $50,000,000. If it costs $1 to buy a ticket, what is the expected value of this game? Are the odds in your favour? [ASIDE: They are! But what fact of social behaviour are we ignoring in this argument? Why should one still not bother to buy a lottery ticket even if the prize is so high so as to give the impression that odds are in your favor.]

PART ONE: 65

Question 35: PSEUDO-RANDOM NUMBERS It is not possible to generate truly random numbers with a computer - any program follows a predetermined set of instructions – but it is possible to create a list that appears to be random. Several methods for doing so exist. The most popular is the “middle-square method” developed in 1946 by John von Neumann. It works as follows: Step 1: Select a four-digit number. Step 2: Square the number to produce an eight-digit number. (You might have to place a zero at the front of the number to get eight digits.) Step 3: Use the middle four digits of this eight-digit number as the next number in the sequence. REPEAT This procedure produces a seemingly random list of numbers between 0 and 9999. a) Verify that starting with the number 7254 yields the sequence: 7254, 6205, 5020, 2004, 0160, 0256, 0655, 4290 b) What happens if, instead, you start with the initial number 1049? This procedure (and, in fact, all procedures that currently exist) are not without flaw.

PART ONE: 66

SOME MTEL-TYPE QUESTIONS

Question 36: The inner-most circle has diameter 6 inches, and each circle thereafter has diameter 4 inches greater than the previous circle.

a) What is the ratio of area D to area B? b) What is the ratio of the perimeter of the largest circle in the diagram to the perimeter of the smallest circle? c) A dart is randomly thrown at this target. What is the probability that is lands in either region A or region C? Question 37: A survey displays milk preferences amongst men and women: Men Whole Milk 10 2% Milk 18 Non-Fat 7 No Preference 6

Women 3 16 15 12

a) A woman who was surveyed is chosen at random. What is the probability she prefers whole milk? b) A person who likes whole milk is chosen at random. What is the chance that this person is male?

PART ONE: 67

Question 38: AE = 2AC = 3DE = 4BC

a) Is the ratio CD:AB greater than or less than one-third? b) A point is chosen in AE at random. What is the probability that it will lie in BD ?

Question 39: I select a card at random from a standard deck of 52 cards and then a second card. I put the remaining 50 cards aside and lay the two selected cards face down on the table-top in front of me. I look at the first card. It is black. Knowing this, what is the probability that the second card is also black? Question 40: At a party, 30% of the people present select red as their favourite colour, 40% select blue, and 30% select yellow. If a person is chosen at random, what are the chances that he or she does NOT prefer yellow. Question 41: A computer is programmed to select a single digit 1, 2, 3, 4, 5, 6, 7, 8, or 9 at random. a) What are the chances that the computer will select an even digit? b) The computer selects two digits, one after the other. What are the chances of obtaining an odd digit followed by an even digit?

PART ONE: 68

PROBABILITY AND STATISTICS

Informal Course Notes

PART II of IV James Tanton © 2007 James Tanton

CONTENTS: COUNTING PRINCIPLES The Multiplication Principle Factorials The Labeling Principle Multi-stage Labeling Fun with Poker

……………………………… ……………………………… ……………………………… ……………………………… ………………………………

PASCAL’S TRIANGLE A Grid of Numbers The Binomial Theorem

……………………………… 22 ……………………………… 29

Exercises

……………………………… 33

2 4 9 14 18

PART TWO: 2

THE MULTIPLICATION PRINCIPLE Here’s a very simple puzzle: There are three major highways from Adelaide to Brisbane, and four major highways from Brisbane to Canberra.

How many different routes can one take to travel from Adelaide to Canberra? The answer to this question is clearly 12. But pause for a moment and ask yourself why? Is it obvious that the number of routes from A to C really is 3 × 4 , that is, three groups of four? Make sure you are comfortable that “multiplication” really is the right arithmetic operation here (as opposed to direct addition). Let’s take the puzzle up a notch: Suppose there are also six major highways from Canberra to Darwin.

How many different routes are there from A to D? Be sure that you are convinced the answer is given by multiplication: #routes = 3 × 4 × 6 = 72

PART TWO:

EXERCISE: I own five different shirts, four different pairs of trousers and two sets of shoes. How many different outfits could you see me in? EXERCISE: There are ten possible movies I can see and ten possible snacks I can eat whilst at the movies. I am going to see a film tonight and I will eat a snack. How many choices do I have in all for a movie/snack combo? We have the … THE MULTIPLICATION PRINCIPLE If there are a ways to complete one task and b ways to complete a second task, and the outcomes of the first task in no way affect the choices made for the second task, then the number of different ways to complete both tasks is a × b . This principle readily extends to the completion of more than one task.

EXERCISE: Explain the clause stated in the middle of the multiplication principle. What could happen if different outcomes from the first task affect choices available for the second task? Give a concrete example.

3

PART TWO:

4

FACTORIALS: In how many ways can six people stand in a line? Answer: There are six possibilities for the task of placing someone in the first spot, five possibilities for who to place second, four for third, three for fourth, two for the fifth and one for sixth. By the multiplication there are thus: 6 × 5 × 4 × 3 × 2 × 1 = 720 ways to complete the task of lining up all six people.

□

Definition: The product of integers from 1 to N is called “N factorial” and is denoted N!. These numbers grow very large very quickly: 1! = 1 2! = 2 × 1 = 2 3! = 3 × 2 × 1 = 6 4! = 4 × 3 × 2 × 1 = 24 5! = 5 × 4 × 3 × 2 × 1 = 120 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040 8! = 8 × 7 × 6 × 5 × 4 × 3 × 2 × 1 = 40320

COMMENT: In 1729, at the age of 22, Swiss mathematician Leonhard Euler found a formula for a function that generalizes the factorial function. He called it the “Gamma Function.” The curious thing is that you can put fractional and irrational values into his gamma function and obtain meaningful answers. Euler discovered, for instance, that ½! equals

π 2

strange!

EXERCISE: What is the highest factorial your calculator can handle?

. Very

PART TWO: 5

WORD GAMES: EXAMPLE: My name is JIM. In how many ways can one rearrange the letters of my name? Answer 1: By brute force we can list all possibilities and see that there are six arrangements: JIM JMI MIJ MJI IMJ IJM Answer 2: We can use the multiplication principle. We have three slots to fill: ___ ___ ___ The first task is to fill the first slot with a letter. There are 3 ways to complete this task. The second task is to fill the second slot. There are 2 ways to complete this task. (Once the first slot is filled, there are only two choices of letters to use for the second slot.) The third task is to fill the third slot. There is only 1 way to complete this task (once slots one and two are filled).

By the multiplication principle, there are thus 3 × 2 × 1 = 3! ways to complete this task. □

EXERCISE: In how many ways can one arrange the letters HOUSE ?

EXERCISE: How many ways are there to rearrange the letters BOB? Assume the Bs are indistinguishable? Comment: One can certainly answer this second exercise by brute force – just list the possibilities. But is there a sophisticated way to think about how to handle the repeated letter? Think about this before reading on.

PART TWO: 6

PRACTICE EXERCISE: In how many ways can one arrange the letters HOUSES? Certainly if the Ss were distinguishable – written, say, as S1 and S2 - then the problem is easy to answer: There are 6! ways to rearrange the letters HOUS1ES2 . The list of arrangements might begin: HOUS1ES2 HOUS2ES1 OHUS1S2E OHUS2S1E S1S2UEOH S2S1UEOH ⋮

But notice, if the Ss are no longer distinguishable, then pairs in this list of answers “collapse” to give the same arrangement. We must alter our answer by a factor of two and so the number of arrangements of the word HOUSES is: 6! = 360 2

QUESTION: What is this “2” on the denominator? To properly understand it, work out the answer to this next problem:

EXERCISE: How many ways are there to rearrange the letters of the word CHEESE? Think about this before reading on.

PART TWO: 7

Answer: If the three Es are distinct – written E1 , E2 , and E3 , say – then there are 6! ways to rearrange the letters CHE1E2S E3. But the three Es can be rearranged 3! = 6 different ways within any one particular arrangement of letters. These six arrangements would be seen as the same if the Es were no longer distinct: HE1 E2 SCE3 HE1 E3 SCE2 HE2 E1 SCE3

HE3 E1 SCE2 HE3 E2 SCE1 HE2 E3 SCE1

→ HEESCE

Thus we must divide our answer of 6! by 3! to account for the groupings of 6! six that become identical. There are thus = 120 ways to arrange the 3! letters of CHEESE. □

Comment: The number of ways to rearrange the letters HOUSES is

6! . The 2!

“2” on the denominator is really 2!.

EXERCISE: Explain why the number of ways to arrange the letters of the 7! word CHEESES is . 3!2!

EXERCISE: In how many ways can one arrange the letters CHEEEEESIEST? How about of CHEESIESTESSNESS?

PART TWO: 8 7! 2!3! different ways, with 2! in the denominator arising from the fact that there are two Os, and the 3! from the three Ds. If we wished, we could also include in the denominator a 1! (- which equals 1) for the fact that there is a single L in the word and another 1! for the single E. Thus the number of ways to arrange the letters DOODLED might be better written:

Comment: Consider the word DOODLED. Its letters can be arranged

7! 2!3!1!1!

This has the advantage of offering a “self check:” the numbers in the denominator should match - in sum - the numbers in the numerator. Let’s take this further … Each number in the denominator corresponds to the number of times a letter appears in the original word: O two times, D thrice, E once and L once. The letter P appear zero times so we could actually write: 7! 2!3!1!1!0!

Also, the letter J appear zero times as well, so perhaps we should write: 7! 2!3!1!1!0!0!

and so on. THIS IS ALL FINE AND CONSISTENT IF WE CHOOSE TO DEFINE 0! TO BE THE NUMBER 1. It is for this reason that mathematicians set 0! = 1. Even if one is being silly, the formulas still remain correct.

PART TWO: 9

THE LABELING PRINCIPLE We can rephrase the letter-arranging problem. Again consider the word CHEESIEST. Rearranging these letters corresponds to assigning letters to nine slots:

1 slot is to be “labeled” C 1 slot is to be labeled H 3 slots are to be labeled E 2 slots are to be labeled S 1 slot is to be labeled I 1 slot is to be labeled T We know the answer to the problem is:

9! . 1!1!3!2!1!1!

This is the same problem as the following: Nine people are to be given hats. One is to be given a cranberry-red hat (C), one is to be given a hot-pink hat (H), three emerald-green hats (E), two sky-blue hats (S), one an indigo-blue hat (I), and one a teal hat (T). How many ways? We see that rearranging letters is equivalent to assigning labels to distinct objects (people or specific slots) and the answer to the problem is the fraction with numerator the number of objects, factorialised, and denominator given by the counts of objects with each label, factorialised. We have:

PART TWO: 10

THE LABELING PRINCIPLE Each of distinct N objects is to be given a label. If k1 of them are to have label “1,” k2 label “2,” and so on, all the way to kr of them label “r,” then total number of ways to assign all labels is given by: N! k1!k2!⋯ kr !

This is an extremely powerful result. SOME EXAMPLES: 1. Four people from a group of ten are needed for a committee. In how many

different ways can a committee be formed? Answer: The ten folk are to be labeled as follows: 4 as “on the committee” 10! . □ and 6 as “off.” The answer must be 4! 6! 2. Fifteen horses run a race. How many possibilities are there for first,

second, and third place? Answer: One horse will be labeled “first,” one will be labeled “second,” one 15! “third,” and twelve will be labeled “losers.” The answer must be: .□ 1! 1!1!12! 3. A “feel good” running race has 20 participants. Three will be deemed

equal “first place winners,” five will be deemed “equal second place winners,” and the rest will be deemed “equal third place winners.” How many different outcomes can occur? Answer: Easy!

20! . 3! 5!12!

□

4. From an office of 20 people, two committees are needed. The first

committee shall have 7 members, one of which shall be the chair and 1 the treasurer. The second committee shall have 8 members. This committee will have 3 co-chairs and 2 co-secretaries and 1 treasurer. In how many ways can this be done?

PART TWO: 11

Answer: Keep track of the labels. Here they are: 1 person will be labeled “chair of first committee” 1 person will be labeled “treasure of first committee” 5 people will be labeled “ordinary members of first committee” 3 people will be labeled “co-chairs of second committee” 2 people will be labeled “co-secretaries of second committee” 1 person will be labeled “treasurer of second committee” 7 people will be labeled “lucky,” they are on neither committee. The total number of possibilities is thus:

20! . Easy! 1!1!5!3!2!1!7!

□

COMMENT: Students are usually taught to distinguish between a “permutation” and a “combination.” They differ by whether or not the order of terms is important. This is unnecessarily confusing – and somewhat artificial. People would call the first example a combination. They would call the second example a permutation. There are no names for examples 3 and 4!

COMMENT: The formula

N! with k1 + k2 + ⋯ + kr = N is called a generalized k1!k2!⋯ kr !

combinatorial coefficient. It is denoted:

N      k1 k2 ⋯ kr 

 6  6! For example,  = 60 =  2 3 1  2!3!1!

PART TWO: 12

EXAMPLE: In how many different ways can one arrange seven As and nine Bs? Answer: We have sixteen “slots,” seven of which are to be labeled “A” and nine to be labeled “B.” This gives: 16! 7!9!

possible arrangements.

EXAMPLE: Ten circles are drawn in a row. In how many different ways can

we color two of them black and leave the rest white? Answer: Two circles are to be “labeled” black and eight as white. There are: 10! 10 × 9 = = 45 2!8! 2

possibilities.

CHALLENGE: How many solutions are there to the equation 8 = a + b + c if each of a , b and c is a positive integer or zero? HINT:

PART TWO: 13

IF YOU REALLY ARE WORRIED ABOUT ORDER … 1. “SELECTION WITHOUT ORDER” IS JUST LABELING An example will explain:

Suppose 5 people are to be chosen from 12 and the order in which folk are chosen is not important. In how many ways can this be done? Answer: 5 people will be labeled “chosen” and 7 “not chosen.” There are accomplish this task.

12! ways to 5!7! □

2. “SELECTION WITH ORDER” IS JUST LABELING An example will again explain:

Suppose 5 people are to be chosen from 12 for a team and the order in which they are chosen is considered important. In how many ways can this be done? Answer: We have: 1 person labeled “first” 1 person labeled “second” 1 person labeled “third” 1 person labeled “fourth” 1 person labeled “fifth” 7 people labeled “not chosen” This can be done

12! ways. 1!1!1!1!1!7!

□

Again … there is no need to fuss about order. Just come up with the labeling scheme that is appropriate for the problem. EXERCISE: Coming full circle … Explain, using the labeling principle, why the 6! number of ways to arrange six people in a line is 6! (which is really ) 1!1!1!1!1!1!

PART TWO: 14

MULTI-STAGE LABELING Although the labeling principle helps remove the confusion of order vs. non-order, many standard “arrangement” problems still possess a level of complication that is delicate. For example, consider the following typical standardized test problem:

In how many ways can one arrange the letters of the word ORANGE if the first and last letters must each be a vowel? This is not a straightforward labeling problem as some objects are given preferred status over others: the vowels require a different type of consideration from the consonants. It is really a two-stage challenge: STAGE 1: Contend with the vowels STAGE 2: Contend with the remaining letters Each of these stages can be handled separately. The Multiplication Principle tells us to then multiply the results. Solution: STAGE 1: One vowel shall be labeled “first position,” one “last position” and one shall be labeled “placed with the consonants.” There are

3! = 6 ways to complete 1!1!1!

stage 1. STAGE 2: We now have four “consonants,” R, N, G, and the remaining vowel, to label as second, third, fourth and fifth. There are

4! = 24 ways to accomplish stage 1!1!1!1!

2. Thus there are 6 × 24 = 144 desired arrangements of ORANGE.

□

COMMENT: Many might prefer to present the answer as a six-stage process:

Do you see what is meant by this diagram?

PART TWO: 15 EXAMPLE: A company would like to send out a team of five plumbers to a construction site. They will send two expert plumbers and three trainee plumbers. If there are a total of 10 expert plumbers available and 8 trainees, how many different teams are possible? Answer: This too is a two stage process: STAGE 1: Select the experts There are

10! possible ways to label two expert plumbers as “chosen” and 2!8!

the rest “not chosen.” STAGE 2: Select the trainees There are

8! possible ways to label three trainees as “chosen” and the 3!5!

rest “not chosen.” By the multiplication principle, there are thus

10! 8! × possible teams. 2!8! 3!5!

□

COMMENT: Notice that we have no control over who is labeled “expert” and who is labeled “trainee.” We only have control over the labels “chosen” and “not chosen.” That there some fixed previously assigned labels is a hint that this must be dealt as a multi-stage problem.

EXAMPLE: In how many ways can one arrange the letters ABCDE so that A is never at the beginning or the end? We’ll give three answers to this problem, even though most people would prefer to answer the question just the first we present. (We offer two more approaches just to illustrate that there are multiple ways to approach these problems.) Answer 1: Think of this as a five-stage process! Deal with the first letter, deal with the last letter, deal with the second letter, deal with the third letter, and deal with the fourth letter. By the multiplication principle, we multiply the results.

PART TWO: 16

Answer 2: The letters B, C, D, E have a different status than A. STAGE 1: Place the letter A There are 3 possible locations for this letter STAGE 2: Place the remaining letters There are four remaining positions for four letters. They can be placed in these positions

4! = 24 ways. 1!1!1!1!

Thus there are 3 × 24 = 72 desired arrangements.

Answer 3: There are five slots in which to place letters with the two end slots having a different status than the middle three. STAGE 1: Fill the end slots. There are four letters to work with, yielding

4! = 12 possibilities. 1!1!2!

(The labels here are “first slot,” “last slot” and “not used.) STAGE 2: Fill the middle three slots There are three letters to work with yielding

3! = 6 possibilities. 1!1!1!

By the multiplication principle we have 12 × 6 = 72 permissible arrangements. □

PART TWO: 17 EXAMPLE: Six people Albert, Bilbert, Cuthbert, Dilbert, Egbert and Filbert are to sit in a circle. How many different arrangements are possible if rotations of the same arrangement are considered equivalent? Answer: This question is tricky in that there are no clear “labels” associated with the question: there is no clear “first” seat or “second” seat, and so forth. We can think of it as a multi-stage process nonetheless by having the men take a seat one at a time: Albert must sit somewhere. He can sit anywhere (since all rotations are deemed equivalent) and there is thus only 1 action for him to take. Bilbert now has 5 options: take the seat one place to Albert’s left, two places to his left, and so on. Cuthbert has 4 options. Dilbert has 3. Egbert has 2. Filbert has 1. Thus by the multiplication principle, there are 1 × 5 × 4 × 3 × 2 × 1 = 120 possible arrangements. We can be a little slicker and think of this as a two-stage process: STAGE 1: Albert takes a seat There is only 1 option: Albert takes any seat. STAGE 2: The remaining five each take a seat. This is a labeling problem as Albert’s position now defines five labels: one place to his left, two places to his left, and so on. There are thus

5! = 120 possibilities. 1!1!1!1!1! By the multiplication problem there are 1 × 120 = 120 possible configurations. □

PART TWO: 18

FUN WITH POKER HANDS One plays poker with a deck of 52 cards, which come in 4 suits (hearts, clubs, spades, diamonds) with 13 values per suit (A, 2, 3, …, 10, J, Q, K). In poker one is dealt five cards and certain combinations of cards are deemed valuable. For example, a “four of a kind” consists of four cards of the same value and a fifth card of arbitrary value. A “full house” is a set of three cards of one value and two cards of a second value. A “flush” is a set of five cards of the same suit. The order in which one holds the cards in ones hand is immaterial. EXAMPLE: How many flushes are possible in poker? Answer: Again this is a multi-stage problem with each stage being its own separate labeling problem. One way to help tease apart stages is to image that you’ve been given the task of writing a computer program to create poker hands. How will you instruct the computer to create a flush? First of all, there are four suits – hearts, spades, clubs and diamonds – and we need to choose one to use for our flush. That is, we need to label one suit 4! as “used” and three suits as “not used.” There are = 4 ways to do this. 1!3! Second stage: Now that we have a suit, we need to choose five cards from the 13 cards of that suit to use for our hand. Again, this is a labeling problem - label five cards as “used” and eight cards as “not used.” There are 13! = 1287 ways to do this. 5!8! By the multiplication principle there are 4 × 1287 = 5148 ways to compete both stages. That is, there are 5148 possible flushes. □ 52! = 2598960 five-card hands in total in poker. 5!47! 5148 ≈ 0.20% . (Why?) The chances of being dealt a flush are thus: 2598960

Comment: There are

PART TWO: 19

EXAMPLE: How many full houses are possible in poker? Answer: This problem is really a three-stage labeling issue. First we must select which of the thirteen card values – A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K - is going to be used for the triple, which will be used for the double, and which 11 values are going to be ignored. There are 13! = 13 × 12 = 156 ways to accomplish this task. 1!1!11! Among the four cards of the value selected for the triple, three will be used 4! = 4 ways to accomplish for the triple and one will be ignored. There are 3!1! this task. Among the four cards of the value selected for the double, two 4! will be used and two will be ignored. There are = 6 ways to accomplish 2!2! this. By the multiplication principle, there are 156 × 4 × 6 = 3744 possible full houses. □ COMMENT: High-school teacher Sam Miskin recently used this labeling method to count poker hands with his high-school students. To count how many “one pair hands” (that is, hands with one pair of cards the same numerical value and three remaining cards each of different value) he found it instructive bring 13 students to the front of the room and hand each student four cards of one suit from a single deck of cards. He then asked the remaining students to select which of the thirteen students should be the “pair” and which three should be the “singles.” He had the remaining nine students to return to their seats. He then asked the “pair” student to raise his four cards in the air and asked the seated students to select which two of the four should be used for the pair. He then asked each of the three “single” students in turn to hold up their cards while the seated students selected on one the four cards to make a singleton.

PART TWO: 20

This process made the multi-stage procedure clear to all and the count of possible one pair hands, namely, 13! 4! × × 4× 4× 4 1!3!9! 2!2!

readily apparent. EXERCISE: “Two pair” consists of two cards of one value, two cards of a different value, and a third card of a third value. What are the chances of being dealt two-pair in poker?

EXAMPLE: A “straight” consists of five cards with values forming a string of five consecutive values (with no “wrap around”). For example, 45678, A2345 and 10JQKA are considered straights, but KQA23 is not. (Suits are immaterial for straights.) How many different straights are there in poker? Answer: A straight can begin with A, 2, 3, 4, 5, 6, 7, 8, 9 or 10. We must first select which of these values is to be the start of our straight. There are 10 choices. For the starting value we must select which of the four suits it will be. There are 4 choices. There are also 4 choices for the suit of the second card in the straight, 4 for the third, 4 for the fourth, and 4 for the fifth. By the multiplication principle, the total number of straights is: 10 × 4 × 4 × 4 × 4 × 4 = 10240 .

The chances of being dealt a straight is about 0.39%.

□

PART TWO: 21

Another popular gambling game …

EXERCISE: KENO In a KENO game ten numbers are selected at random from the number 1 through 80. Players of the game submit tickets beforehand selecting 1 through 10 numbers. They win prizes according to the number of matches they receive. a) Poindexter plays a “10-spot game,” meaning, that he selects 10 numbers on his ticket. What are his chances of obtaining 10 matches out of 10? What are his chances of receiving 9 matches out of 10? b) Bilbert pays $1 to play a “4-spot” game, meaning, that he selects 4 numbers on his ticket. The payouts for the 4-spot game are as follows: Match all four: $50 Match three out of four: $5 Match two out of four: $1 What is the expected value of this 4-spot game?

PART TWO: 22

A GRID OF NUMBERS Here’s a famous puzzle:

Starting at the top-left cell marked S and taking horizontal steps one place only to the right or vertical steps downwards only, how many different paths are to the location marked E?

Play with this puzzle for a while before reading on. As you play, perhaps contemplate the following questions: 1. Given the location of the point E, is the grid shown in the diagram unnecessarily large? 2. Marking in different paths from S to E is awfully complicated. One could first count paths to different cells first, ones easier to handle, and look for patterns. For example, how many distinct paths are there from S to any cell on the top row? Write the answers in those cells. How many distinct paths take you to any cell in the leftmost column? To cells in the second row? Second column? Third row? 3. If you are willing to trust patterns, can you make a good guess as to the answer to the original puzzle?

PART TWO: 23

There are two ways to approach this puzzle. APPROACH NUMBER 1: FORMULAS Every path from S to E can be described by a sequence of letters R and D. For example, the path given in the diagram can be described by the sequence: RDRRDDRRRDRR This sequence contains eight Rs and four Ds. Moreover, any sequence of eight Rs and four Ds corresponds to a path from S to E. Exercise: Mark in on the diagram the paths given by DDRRRRRDRRRD and RRRRRRRRDDDD. Thus … the number of paths from S to E matches the number of ways to arrange twelve letters – eight Rs and four Ds. (That is, to label twelve slots with eight Rs and four Ds). The answer to the original puzzle is: 12! = 495 paths. 8!4! Exercise: How many paths are there from S to the bottom-right cell of the grid? Exercise: Suppose the cell E is a steps to the right of S and b steps down from S. Show that the number of paths from S to E is given by:  a + b  (a + b)!  = a !b ! a b  In fact, number the rows 0, 1, 2, … (with the top row being the zero-th row) and we number the columns 0, 1, 2, … (with the leftmost column being the zero-th column). The cell E in the original diagram thus has position row 4, 12! column 8, and the number of paths to it is . (Paths to this cell involve 4 4!8! Rs and 8 Ds.)

PART TWO: 24

In general, numbering rows and columns this way, the cell row a and column b requires a Rs and b Ds to get to it and so the number of paths to it is: (a + b)! a !b !

Exercise: Is this formula still correct for the cells in the zero-th row? In the zero-th column? (Good thing we set 0! = 1 .) What value should we place in the cell labeled S – row zero, column zero? How many ways should we say that start at S and end at S? APPROACH 2: PATTERNS If you fill in the answers for the number of paths to each cell, the following grid of numbers appears:

(The exercise above suggests that the position labeled S should also be assigned the number 1.) Exercise: Explain why the table is symmetrical about the southeast diagonal line. (a + b)! for the entry in the a-th row and b-th column a !b ! (starting the counts at zero).

We have the formula

Have you noticed that each entry in an interior cell is the sum of two numbers – the number just above the cell and the number just to the left of the cell? This makes sense in terms of counting paths. Consider the circled

PART TWO: 25

cell. To reach this cell one can either first reach the cell just above – there are 15 ways to do this – and then step down, or reach the cell just to its left – there are 20 ways to accomplish this – and then step right. This gives a total of 15 + 20 = 35 paths to the circled cell. Exercise: Use this observation to fill in the remainder of the table.

This grid of numbers possesses a number of curious properties. For example, start at any “1” on the top row, head down any number of cells, and then turn right for one cell to create a stocking. The number in the toe of the stocking always equals the sum of the numbers in the leg of the stocking.

There is a horizontal version of this stocking property also. Exercise: Explain why the stocking property works. HINT: Use the fact that the number in the toe is really the sum of two other numbers in the grid.

PART TWO: 26

The grid of numbers appearing in the cells is just the grid of numbers made famous by French mathematician Blaise Pascal (1623-1662) for his work in probability theory.

Each row of this triangle is a diagonal of the grid. Regard the top row of the triangle (the single “1”) as the zero-th row. (Then the sixth row of the triangle, for example, is 1 6 15 20 15 6 1). In any row of the triangle call the left-most entry the zero-th entry. (Thus in the sixth row, “1” is the zero-th entry, “6” is the first entry, “15” is the second entry, and so on.) Then the formula for the entry in the n-th row k places in from the left (and n − k places from the right) is: n! k !(n − k )!

COMMENT: Make sure you understand that this is correct. Consider an entry in the a-th row and b-th column of the grid of numbers. (a + b)! Then the formula for that entry is: . But “ a + b ” is the number of the a !b ! diagonal on which that cell belongs, “a “ is the number of places in from one

PART TWO: 27

end of the diagonal, and “b “ the number of places in from the other end of the diagonal. Exercise: a) Draw a grid of squares and mark the cell in the 3-rd row and 4th column. Verify that it is indeed on the 7-th diagonal of the grid, 3 and 4 places in from each end. b) Consider the entry on the 5-th row of Pascal’s triangle, 2 places in from the left. Find the corresponding cell in your grid of squares. What is “a” and “b” for this cell and what does “5” correspond to?

For the grid of numbers we saw that every entry was the sum of the entry just above it and just to the left of it. In Pascal’s triangle this translates to: EACH INTERIOR ENTRY IS THE SUM OF THE TWO ENTRIES ABOVE IT

PART TWO: 28

Exercise: a) Without doing the computation, explain why the sum of entries in the bottom row shown above will turn out to be double the sum of the entries in the row just above it. HINT: Each entry in the bottom row is the sum of two entries in the row above it. b) Explain why the sum of entries in the n-th row is sure to be 2n . (Remember to call the top row of Pascal’s triangle - the one with a single “1” - the zero-th row.) Exercise: Explain why each alternating sum in Pascal’s triangle, beyond the zero-th row, is zero: 1−1 = 0 1− 2 +1 = 0 1− 3 + 3 −1 = 0 1− 4 + 6 − 4 +1 = 0 1 − 5 + 10 − 10 + 5 − 1 = 0 ⋮

The following property is strange. Look at the powers of 11:

110 = 1 111 = 11 112 = 121 113 = 1331 114 = 14641 115 = 161051 = 1 | 5 |10 |10 | 5 | 1 Any guesses as to why these powers appear as rows of Pascal’s triangle?

PART TWO: 29

THE BINOMIAL THEOREM Recall the act of expanding brackets: “Select one term from each set of parentheses and make sure to collect all possible combinations” e.g.

(a + b + c)( x + y )( p + q + r )( s + t + u + v) = axps + ayqu + cxps + ⋯

Imagine expanding the quantity:

( x + y)

5

= ( x + y )( x + y )( x + y )( x + y )( x + y )

The term x5 will appear once by choosing the term “x” from each set of parentheses. The term x 4 y will appear five times: once once once once once

by by by by by

choosing x, x, x ,x and then y. choosing x, x, x, y, and then x. choosing x, x, y, x, and then x. choosing x, y, x, x, and then x. choosing y, x, x, x, and then x.

That is, x 4 y will appear the same number of times as it is possible to 5! = 5 ways, which is also the arrange four xs and one y. This can be done 4!1! number of paths in the square grid from S to row 4, column 1. This is an entry of Pascal’s triangle. The term x3 y 2 will appear as many times as it is possible to arrange three xs 5! = 10 times. and two ys, that is, 3!2! The term x 2 y 3 ten times, the term xy 4 five times, and the term x5 once. We have:

( x + y)

5

= x 5 + 5 x 4 y + 10 x 3 y 2 + 10 x 2 y 3 + 5 xy 4 + y 5 .

The numbers 1, 5, 10, 10, 5, 1 are the entries of the fifth row of Pascal’s triangle.

PART TWO: 30

We have:

( x + y) = 1 1 ( x + y) = x + y 2 ( x + y ) = x 2 + 2 xy + y 2 3 ( x + y ) = x3 + 3x 2 y + 3xy 2 + y 3 4 ( x + y ) = x 4 + 4 x3 y + 6 x 2 y 2 + 4 xy 3 + y 4 0

and so on. SOME FUN … 1. Put x = 10 and y = 1 . Notice that, for instance: 114 = (10 + 1) = 10 4 + 4 ⋅ 103 + 6 ⋅ 10 2 + 4 ⋅ 10 + 1 = 10000 + 4000 + 600 + 40 + 1 = 14641 4

. This explains the connection of the powers of 11.

2. Put x = 1 and y = 1 . Notice that, for instance: 24 = (1 + 1) = 14 + 4 ⋅ 13 + 6 ⋅ 12 + 4 ⋅ 1 + 1 = 1 + 4 + 6 + 4 + 1 4

This explains – again - why the sum of entries in a row of entries of Pascal’s triangle is a power of two.

3. Put x = 1 and y = −1 . Notice that, for instance: 0 = (1 − 1) = 14 + 4 ⋅ 13 ⋅ (−1) + 6 ⋅ 12 ⋅ (−1)2 + 4 ⋅ 1 ⋅ (−1)3 + 1 ⋅ (−1) 4 = 1 − 4 + 6 − 4 + 1 4

This explains – again – why the alternating sum of entries in a row of Pascal’s triangle is always zero.

PART TWO: 31

Stated formally, we have … Binomial Theorem: ( x + y)n = xn +

n! n! n! a b x n −1 y + x n−2 y 2 + ⋯ + x y + ⋯ + yn a !b ! (n − 1)!1! (n − 2)!2!

The coefficients are the entries of the n-th row of Pascal’s triangle.

 n  n! COMMENT: We have used the notation  . Thus  for the expression a !b ! a b the binomial theorem can be written:

n  n  n  n  n −1   n−2 2  n  a b  n  n ( x + y)n =  x + x y+  x y +⋯+   x y +⋯ +  y  n 0  n − 1 1  n − 2 2 a b 0 n

Often mathematicians suppress one of the terms in the notation and write n  n  just   for   . (We must have b = n − a .) a a b 7  7  7! . Thus the binomial theorem might be For example,   =  =  5   5 2  5!2! written: n  n  n −1  n  n− 2 2 n a b n n ( x + y )n =   x n +  x y +  x y +⋯ +   x y +⋯ +   y n  n − 1  n − 2 a 0

n COMMENT: The entries of Pascal’s triangles -   - are also called “binomial a coefficients.”

PART TWO: 32

PART TWO: 33

PART II PROBLEMS Question 42: How many different paths are there from A to G?

Question 43: The word BOOKKEEPING is the only word in the English language with three consecutive double letters. In how many ways can one arrange the letters of this word? Question 44: In how many ways can you arrange the letters of your full name? Question 45: Evaluate the following expressions: a)

800! 799!

b)

15! 87! c) 13!2!0! 89!

Simplify the following expressions as far as possible: d)

N! N!

g)

1 (k + 2)! ⋅ k +1 k!

e)

N! ( N − 1)!

f)

h)

n! (n − 2)!

n !(n − 2)!

( (n − 1)!)

2

PART TWO: 34

Question 46: a) Suppose a and b are positive integers with a + b = n . Show that:  n   n −1   n −1   = +   a b   a − 1 b   a b − 1 b) Suppose a, b and c are three positive integers with a + b + c = n . Show that:  n   n −1   n −1   n −1   = + +   a b c   a − 1 b c   a b − 1 c   a b c − 1 Question 47: a) A mathematics department has 10 members. Four people are to be selected for a committee. In how many different ways can this be done? b) An English department has 10 members. Four people are needed for a committee and in that committee one person needs to be the chair. In how many different ways can the department form a committee of four with one chair? c) A Medieval Tibetan Poetry department has 10 members. Four people are needed for a committee which has two co-chairs. In how many different ways can one form a committee of four with two co-chairs? d) A Dramatic Arts of Left-Handed Mimists Department has 10 members. Two committees are to be formed: one with four members with two co-chairs, and one with three members and a single chair. In how many different ways can this be done? Assume no person is on both committees. Question 48: Twelve horses run in race. a) A ribbon will be presented to each of first, second, third, and fourth place. In how many possible ways can this be done? b) Make a comment as to why there is absolutely no need to mention or even think of “permutations” when answering questions like these. Question 49: a) In how many different ways can one arrange five As and five Bs. b) A coin is tossed 10 times. In how different ways could exactly five heads appear?

PART TWO: 35

Question 50: In how many ways can 10 people sit on a bench if only four seats are available? Question 51: Five pink marbles, two red marbles, and three rose marbles are to be arranged in a row. If marbles of the same colour are identical, in how many different ways can these marbles be arranged? Question 52: a) Twelve white dots lie in a row. Two are to be coloured red. In how many ways can this be done? b) Consider the equation 10 = x + y + z . How many solutions does it have if each variable is to be a positive integer or zero? Question 53: a) In how many ways can the letters ABCDEFGH be arranged? b) In how many ways can the letters ABCDEFGH be arranged with letter G appearing somewhere to the left of letter D? c) In how many ways can the letters ABCDEFGH be arranged with the letters F and H not adjacent? Question 54: a) Hats are to be distributed to 20 people at a party. Five hats are red, five hats are blue, and 10 hats are purple. In how many different ways can this be done? (Assume the people are mingling and moving about.) b) If the 20 people are clones and cannot be distinguished, in how many essentially different ways can these hats be distributed? Question 55: Let’s establish the formulas from the textbooks … a) Suppose r objects are to be selected from a collection of n objects with the order in which they are selected considered important. Use n! the labeling principle to show that this can be done in n Pr = (n − r )! different ways. b) Suppose r objects are to be selected from a collection of n objects without regard to order. Use the labeling principle to show that this n! different ways. can be done n Cr = r !(n − r )! NOW FORGET THESE FORMULAS. You don’t ever need them!

PART TWO: 36

Question 56: a) In poker “three of a kind” is a set of three cards of the same value with neither of the two remaining cards that value (or of value equal to each other). What is the probability of being dealt three-of-a-kind? b) What is the probability of being dealt “four of a kind” in poker?

Question 57: Consider the question …

In how many different ways can 8 people sit around a round table? This is a vague question. What does “different” mean? a) Answer the question if the chairs of the table are marked North, Northeast, East, Southeast, South, …, Northwest. b) Answer the question if the chairs are not marked so that two different rotations of the same arrangement of people would be considered the same. c) Answer the question under the assumption that rotations are considered the same and reflections about a diameter of the table are considered the same. Suppose two particular people must not sit next to one another. Answer each of the questions a), b) and c) with this added restriction. (HINT: First count the number of arrangements with that couple seated together.) Question 58: EXTREMELY HARD An ice-cream stand offers the “mega-bowl special:” twelve-scoops in a bowl from a choice of twelve possible flavors. How many different mega-bowl combinations does it offer? COMMENT: The problem here is that scoops, like the clones of question 47b), are indistinguishable AND you are not told how many scoops there are to be of a particular label (flavor). Problems like these are hard and fall under the category of what is called “multi-choosing.”

PART TWO: 37

Question 59: A committee of five must be formed from five men and seven woman. a) How many committees can be formed if gender is irrelevant? b) How many committees can be formed if there must be at exactly two women on the committee? c) How many committees can be formed if one particular man must be on the committee and one particular woman must not be on the committee? d) How many committees can be formed if one particular couple (one man and one woman) can’t be on together on the committee? Question 60: a) From 10 people k are needed for a committee. Write down a formula for the number of ways this can be done. b) Suppose we want our formula to hold NO MATTER WHAT. Set k = 11 into your formula. What value should (−1)! have so that your formula is correct for the number of ways to select 11 people from 10 for a committee? Question 61: a) Prove that the product of any 3 consecutive integers is sure to be divisible by 3! = 6. b) Prove that the product of any 7 consecutive integers is sure to be divisible by 7! = 5040. b) Prove that the product of any k consecutive integers is sure to be divisible by k! HINT: Consider the problem: k people from N are to be selected for a committee. In how many ways can this be done? We know that the answer to N! is always a whole number. this must be a whole number. Thus, k !( N − k )! Question 62: A “factorian” is a number that equals the sum of its digits factorialised. For example, 145 is a factorian since 1! + 4! + 5! = 145. The number 1 is a factorian, as is the number 2. (We have 1! = 1 and 2! = 2.) There is only one other factorian. What is it? CHALLENGE: Prove that there are only four factorians.

PART TWO: 38

PROBABILITY AND STATISTICS

Informal Course Notes

PART III of IV James Tanton © 2007 James Tanton

CONTENTS: Displaying and Summarising Data……………………………………………………… 2 Measures of Central Tendency ……………………………………………………… 5 Measures of Dispersion ……………………………………………………… 9 Scatter Plots ……………………………………………………… 16 Lines of Best Fit ……………………………………………………… 18 Correlation Coefficient ……………………………………………………… 23 Null Hypothesis Distributions Central Limit Theorem Normal Distribution 68-95-99.7 Rule z-scores Roulette Confidence Intervals P-values Gallup Poles Sampling Chi-Squared test Quality Control Run Tests Rank Correlation

……………………………………………………… 27 ……………………………………………………… 31 ……………………………………………………… 37 ……………………………………………………… 41 ……………………………………………………… 43 ……………………………………………………… 45 ……………………………………………………… 51 ……………………………………………………… 54 ……………………………………………………… 57 ……………………………………………………… 60 ……………………………………………………… 62 ……………………………………………………… 66 ……………………………………………………… 71 ……………………………………………………… 74 ……………………………………………………… 82

Exercises

……………………………………………………… 85

PART THREE:

2

DISPLAYING AND SUMMARIZING DATA The practice and the study of the tools and techniques for collecting, displaying and summarizing numerical information is called descriptive statistics. For example, a medical study might record the blood types of 100 university students and present the information obtained in a list, a table, or a diagram of some kind. Inferences and conclusions might then be drawn from the information presented. For this example the data, in and of itself, is not numerical but rather categorical (the categories type A, type B, type AB and type O are examined) but the count of entries that fall into each particular category is numerical. In other examples, numerical information might adopt potentially continuous array of values (such as age, height, or weight) and might be divided into categories for ease (height 30-36 inches, 37-42 inches, 43-48 inches, etc. for instance). Once categories have been established, there are a number of standard methods for presenting and summarizing data.

Presenting Data A frequency distribution is a table or chart that shows the count (number) of individuals in each possible category considered. For example, of the 100 students tested above, suppose 20 have blood type A, 27 blood type B, 16 blood type AB and 37 blood type O. This information can be summarized in a frequency table, or a bar chart or a pie graph.

PART THREE:

3

Notice for the bar chart that the individual “bars” are separated to emphasize the distinct nature of the different categories. If the data presented comes from a continuous array of values, then the bars are drawn without separation. In this situation, the bar chart is called a histogram.

A frequency polygon is a histogram with a polygonal line drawn in connecting the midpoints at the height of each bar (with end points touching the axis). Tables of whole number values are sometimes displayed via stem-and-leaf plots. Each number is divided into two parts: the units digit (the “leaf”), and the set of digits to its left (the “stem”) (or perhaps the hundreds are separated from the tens and units together, or some other variation.)

COMMENT: Each stem and leaf plot should be accompanied with a “legend” indicating how the numbers are split. For example, to the table above we might attach the comment “ 4|0 = forty.”

PART THREE:

Describing Data Statisticians use three broad features to swiftly describe data. 1. The General Shape of a Histogram: Distributions rarely conform to exact shapes, but statisticians still find it useful to describe general features of the shape of a histogram.

2. Measures of Central Tendency: Statisticians usually seek some means to identify a single measurement that, in some sense, represents the “middle value” or “most typical” value of an entire data set. Four different measures of “central tendency” are commonly used (which we shall describe). 3. Measures of Dispersion: Statisticians also seek some means to measure the “spread” of a data set. If data is tightly clustered about a single central value, then that central value is a “meaningful” representative of the entire data set. If, on the other hand, data that is widely scattered across spectrum of values, attributing meaning to a value of central tendency must be done with care.

4

PART THREE:

5

In detail … MEASURES OF CENTRAL TENDENCY A measure of central tendency is a single measurement that, in some sense, is typical of the entire data set. It represents the approximate “centre” of the frequency distribution. Four different measures of central tendency are in common use today:

Mean or Average [Usually denoted by the Greek letter µ ] This is simply the arithmetic mean (average) of the data values at hand. It is found by summing together all the data values and dividing by the total number of measurements. Example: Consider the data set 5, 6, 9, 9. Then the mean is: 5+6+9+9 µ= = 7.25. 4 Example: If in a study the value 7 occurs 32 times and the value 9 occurs 25 7 × 32 + 9 × 25 times. The mean is: µ = ≈ 7.88 57 CHALLENGE EXERCISE: Prove that the sum of differences of each data value from the means is always zero. (For example, in the previous example we have (5 − 7.25) + (6 − 7.25) + (9 − 7.25) + (9 − 7.25) = 0 . ) The mean is the most commonly used measure of central tendency. Exercise: Some texts might give the following formula for mean:

µ=

f1 x1 + f 2 x2 + ⋯ + f n xn f1 + f 2 + ⋯ + f n

Can you interpret what the symbols in this formula mean and why the formula is correct?

PART THREE:

6

Mode The mode is the value in the data set that occurs most often. Example: For the ten data values 3, 6, 5, 3, 1, 6, 5, 3, 8, 3 the mode is 3. Example: For the data set 4, 5, 8, 8 the mode is 8. Example: The data set 5, 5, 6, 6, 9, 9, 3, 3, 10, 10 has no mode. Example: The data set 1, 1, 1, 1, 5, 5, 7, 7, 7, 7, 8, 8, 9, 9, 9 is bimodal. For non-numerical data (such as colours, or letters of the alphabet) the mode is the only measure of central tendency available.

Median Arrange the data set in increasing order. Then the median is the middle value of the sequence of data values. Example: The median of the data set 3, 3, 5, 6, 7, 16, 16, 19, 37 is 7. If the data set contains an even number of entries, then the average of the middle two values is taken as the median. Example: The median of 3, 4, 4, 5, 8, 8, 10, 12 is

5+8 = 6 .5 . 2

The median is useful for finding the value at the center of the distribution. It divides the data set into two equally sized groups.

PART THREE:

7

Midrange The midrange of a set of data is the average of the smallest and largest values. Example: The midrange of the data set 5, 6, 9, 9 is

5+9 = 7. 2

The midrange provides a quick estimate to a central value. It is easy to compute, but is highly affected by extremely low or high values in the data set. COMMENT: MTEL (the Massachusetts licensure exam) likes to toy with the interplay between these different measures of central tendency. EXERCISE: a) Find FIVE data values with: Median = 10 Mode = 10 Mean = 1000 b) Now find five data values with median = 10, mode = 1000 and mean = 10. c) Can you find five data values with median = 1000, mode = 10, mean = 10? EXERCISE: Repeat the previous exercise but this time for SIX data values. EXERCISE: Scientists observe the speed of different turtles. Their observation of a number of turtles yields a data set with: Mean = 3.0 ft/min Mode = 2.9 ft/min Midrange = 3.2 ft/min Median = 3.1 ft/min On her walk home back from the lab, one scientist finds a turtle with ground speed 1000 ft/min. How would the addition of this extra data value to the data set likely affect the mean, median, mode, and midrange of the data?

PART THREE:

8

ASIDE: SIMPSON’S PARADOX Two students Albert and Bilbert each took a sample of math questions over a series of two days. There were 100 questions in total and Albert scored 65% and Bilbert 64% overall. So Albert proved himself a better test taker. But here are the scores day-by-day: FIRST DAY: Albert = 71% Bilbert = 80% SECOND DAY: Albert = 50% Bilbert = 57% So each day Bilbert did a better job than Albert, but did not beat Albert overall! How is this possible? The following table shows raw data of their test results.

This paradox arises because Albert and Bilbert did not complete the same number of questions each day and the averages we computed are not equally weighted. This curious phenomenon is known as Simpson’s paradox, discovered by the Statistician Simpson in the 1960s when it arose in the examination of graduate school admission rates for men and women into UC Berkeley.

PART THREE:

9

MEASURES OF DISPERSION How well clustered about the central value is a set of data? Is the data “spread out” or “tight” about this value?

Example: Consider the following two sets of data, each with µ = 5.22

DATA SET I : 5.0 5.1 5.2 5.4 DATA SET II:

5.4

µ = 5.22

0.1 2.2 2.3 8.9 12.6

µ = 5.22

These sets are very different! We need to quantify methods for measuring scatter about a mean. There are several approaches. Range The range of a data set is simply the difference between the lowest and highest values in the set. Example: The range of the data set 5, 6, 9, 9 is:

9 – 5 = 4.

The range is a very simplistic measure of dispersion, and does not reveal any information about how the data values are distributed. It is highly affected by extremely low or high values in the data set. However, the range is often a useful measurement in practical daily issues. For example, weather forecasts usually give the range of temperatures to expect for the day.

PART THREE: 10

Deviation from the Mean Example: Consider the data set 5, 6, 9, 9 which has mean µ = 7.25 . We can measure the deviation of each data point from the mean: | 5 − 7.25 | = 2.25 |6 − 7.25 | = 1.25 | 9 − 7.25 | = 1.75 | 9 − 7.25 | = 1.75

The average of these deviations gives a good measure of overall scatter. Here, the average deviation is: | 5 − 7.25 | + | 6 − 7.25 | + | 9 − 7.25 | + | 9 − 7.25 | 2.25 + 1.25 + 1.75 + 1.75 = = 1.75. 4 4

Thus we can say: “The data set 5, 6, 9, 9 has mean µ = 7.25 with average deviation from the mean of 1.75”

A SUBTLE POINT .. A subtle point should be noted. Given n data values x1 , x2 ,… , xn , one first computes the mean µ , and then the n deviations: | x1 − µ |,| x2 − µ |,… ,| xn − µ | . Once the first n-1 of these quantities are computed (and these could turn out to be of any value), the value of the nth quantity, however, is forced – the data set must conform to a mean µ .

Exercise: As an example, suppose I tell you that a set of three data values has mean µ = 3 and two of the data values are 1 and 5. Do you know the third data value?

PART THREE: 11

Thus there are only n − 1 “independent” computations to be made. For this reason mathematicians choose to divide the sum of deviations by n − 1 rather than n. Thus a measure of scatter for the data set 5, 6, 9, 9, for example, is computed: | 5 − 7.25 | + | 6 − 7.25 | + | 9 − 7.25 | + | 9 − 7.25 | ≈ 2.33. 3

NOTE: When dealing with thousands of data points, dividing by n-1 as opposed to dividing by n will have very little effect. The results will be practicably the same. WARNING: Textbooks are confused about this. Some texts will choose to divide by n while others will choose to divide by n-1. Watch out for this when you read different books. Mathematicians prefer to divide by n-1.

PART THREE: 12

ANOTHER MEASURE OF DISPERSION Working with absolute values in mathematical equations is difficult. [CHALLENGE: Solve | x − 2 | −3x − 5 − x = 7 .] Another way to work with positive quantities is to square values, rather than take absolute values. One can later apply a square root if desired. The variance of a data set is the sum of all deviations squared, divided by one less than the number of data values. Example: The variance of the four data values 5, 6, 9, 9 with mean 7.25 is: (5 − 7.25) 2 + (6 − 7.25) 2 + (9 − 7.25) 2 + (9 − 7.25)2 = 4.25 . 3

Variance is usually denoted by the symbol σ 2 ( read “sigma squared”). For n data values x1 , x2 ,… , xn variance is given by the formula:

σ2 =

( x1 − µ ) 2 + ( x2 − µ ) 2 + … + ( xn − µ ) 2 . n −1

To nullify the effect of squaring, mathematicians will next take a square root. The standard deviation of a set of n data values x1 , x2 ,… , xn is:

σ= σ = 2

( x1 − µ )2 + ( x2 − µ ) 2 + … + ( xn − µ ) 2 . n −1

Example: The standard deviation of the four data values 5, 6, 9, 9 is σ = 4.25 ≈ 2.06. NOTE: The mean µ is still computed by dividing by N.

µ=

x1 + x2 + ⋯ + xN N

PART THREE: 13

COMMENT: If the data is given in terms of inches, say, then σ 2 is in terms of inches squared, but σ is back to being in terms of inches. Standard deviation σ always has the same units as the original data set.

Standard deviation is the most commonly used measure of dispersion.

EXERCISE: Compute the standard deviation of the rolls of a die. Answer:

□

PART THREE: 14

A COMMENT ON OUTLIERS The median of a set of data is a value which divides the data into two equal “halves.” (If the number of data values is even, then exactly 50% of the data values lie below the median, and 50% above. If the number of data values is odd, then close to 50% of the data values lie below and above.) The median of the lower “half” of data values is called the first quartile, denoted Q1 , and the median of the upper half the third quartile, Q3 . (And the median itself can be called the second quartile, Q2 .) These quartiles divide the data into approximately 25% blocks. COMMENT: There is some confusion here in the literature. Some texts insist that the quartile values correspond to actual data values and some don’t. Some handle the cases of an odd number of data values differently than others. For large data sets these differences are negligible and hence the lack of uniformity. It does cause concern, however, for standardized test makers who will have students work with small data sets for which the differences can be striking. One must examine any particular author’s protocols with care with regard to this matter. (See also questions 63 and 66.) The interquartile range IQR of a set of data is the value Q3 − Q1 . EXAMPLE: The data 2 4 4 7 10 12 25 has Q1 = 4 Q2 = 7 Q3 = 12 IQR = 12 − 4 = 8

An outlier is a data value that seems too large or too small to be coherent with the data set. (Maybe an error occurred during the experiment, or a value was recorded incorrectly, for example.) This is a subjective call and often statisticians will bring a little more solidity to this notion by declaring:

A data value is suspect to be an outlier if its value is 1.5 × IQR or more above the third quartile or 1.5 × IQR or more below the first quartile.

PART THREE: 15

In our example 2 4 4 7 10 12 25 we have: Q3 + 1.5 × IQR = 12 + 1.5 × 8 = 24 Q1 − 1.5 × IQR = 4 − 1.5 × 8 = −8

As the value 25 is larger than 24 it might be considered an outlier. As 2 is larger than -8, it’s value would likely be accepted. Folks like to identify outliers as they may adversely affect the analysis of data: the mean and standard deviation of a data set changes with the inclusion of outliers.

PART THREE: 16

SCATTER PLOTS A scientist records the pH level of a reactive solution every 10 minutes. She records the data on a graph.

A graph such as this displaying the measurements of two quantities – here pH level and time – is called a scatter diagram. A scatter diagram can show whether there seems to be some relationship between the two quantities. In this example, it looks like that there is a fairly good linear relationship of positive slope between pH and time. The following scatter diagram between IQ levels and shoe size suggests no relationship between these quantities:

PART THREE: 17

If a scatter diagram suggests a linear relationship of positive slope, then we say that the two quantities depicted are positively correlated. If the relationship seems to be linear of negative slope then we say that they are negatively correlated.

EXERCISE: Find a group of friends. Draw a scatter diagram for shoe size and height. Any correlation? [Warning: Men’s and women’s shoe sizes are computed differently. Perhaps use a yard stick to measure the lengths of people’s feet in inches?]

PART THREE: 18

LINES OF BEST FIT Suppose some data, for x and y values, that looks as though it is linearly correlated.

We want to determine an equation for the line that fits the data well. There are two approaches: 1. Just “eyeball” one. 2. Use mathematics to derive the equation of the line that fits the data well in some sense. Notice: In conducting an experiment, one usually has complete control of one variable, the x variable. For example, in measuring pH levels, one has control of the times that the measurements are taken, but not of the pH levels one reads. Thus, deviations of a data points from a line of best fit should be measured as vertical segments – variations of the y-values – with no deviation horizontally. For this reason, people look for lines that minimize vertical deviations only (or, to avoid absolute values, the squares of the vertical deviations). Here’s one method for doing this, the least squares method. We’ll explain it with an example.

PART THREE: 19

EXAMPLE: Here are three data points: (1,2) (2,5)

(6,8)

Choose a line that minimizes the squares of the vertical deviations.

Answer: One thing that seems reasonable (and turns out to be a true property of the general theory) that a line of best fit would properly represent the data and go through the “most average” data point. Let: 1+ 2 + 6 =3 3 2+5+8 y = average of the y-values = =5 3

x = average of the x-values =

So the line should go through the point (3, 5).

PART THREE: 20

Now the question is: What should the slope of this line be? If we call the slope m then the equation of the line will be: y −5 =m x−3 y = m( x − 3) + 5

That is:

Let’s work out the y-values of this line for the given x-values and compare them to the actual y-values of the data points:

The sum of the differences squared is:

( 5 − 2m − 2 )

2

+ ( 5 − m − 5 ) + ( 5 + 3m − 8 ) 2

2

= 14m 2 − 30m + 18 This has smallest value when “ m =

−b 30 15 ”, that is, when m = = 2a 28 14

So the line of best fit (by minimizing squared differences) is: y=

15 ( x − 3) + 5 14

PART THREE: 21

Definition: The process of finding a line of best fit is called regression. The method of choosing a line of best fit by minimizing squares of differences is called the least squares method. For completeness, here are the general formulas for the least squares method: LEAST SQUARES METHOD Suppose we have N data points in a scatter diagram:

Let: x1 + x2 + ⋯ + xN N y + y2 + ⋯ + y N y= 1 N

x=

Let:

S xx

( x − x) + ( x =

S yy

(y =

2

1

S xy =

1

2

−y

) +(y

1

)

(

2

+ ⋯ + xN − x

)

2

(called the variance of the x-values).

N −1

2

( x − x )(

−x

2

−y

)

2

N −1 y1 − y + x2 − x

) (

(

+ ⋯ + yN − y

)( y

2

)

2

)

(called the variance of the y-values).

(

− y + ⋯ + xN − x

)( y

N

−y

)

N −1 (called the covariance of the x- and y-values). Then the line of best fit goes through the point ( x , y ) and has slope The equation is:

y−y =

S xy S xx

( x − x) .

S xy S xx

.

PART THREE: 22

Example: For our three data points:

we have: 1+ 2 + 6 =3 3 2+5+8 y= =5 3 (1 − 3) 2 + (2 − 3) 2 + (6 − 3)2 S xx = =7 2 (2 − 5) 2 + (5 − 5) 2 + (8 − 5)2 S yy = =9 2 (1 − 3)(2 − 5) + (2 − 3)(5 − 5) + (6 − 3)(8 − 5) S xy = = 7.5 2 x=

The line of best fit goes through (3,5) and has slope

7.5 15 = - just as we have 7 14

seen. □

Question: Why did we compute S yy ? Answer: It is involved in answering the question: How good is the fit really?

PART THREE: 23

MEASURING THE DEGREE OF FIT: THE CORRELATION COEFFICIENT. Here are some data values:

We chose a line y = mx + b that made the sum of deviations squared:

D = ( y1 − ( mx1 + b ) ) + ( y2 − ( mx2 + b ) ) + ⋯ + ( y N − ( mxN + b ) ) 2

2

2

the smallest. This quantity reflects the amount of variation of the points about the regression line. Now:

(

T = y1 − y

) +(y 2

2

−y

)

2

(

+ ⋯ + yN − y

)

2

represents the amount of variation of the y-values in general – in the sense of measuring the amount of variation about the mean given by the horizontal line y = y. Since the regression line is designed to be better than any other line, we necessarily have: D ≤T .

This prompts one to think of the proportion: T −D T

This is a number guaranteed to be between 0 and 1.

PART THREE: 24 T −D equals 1, then this is saying that D= 0, which means that there is no T scatter about the regression line. That is, all data points lie exactly on a line.

If

T −D equals 0, then this is saying that T = D. That is, the amount of scatter T about the regression line is no different than the amount of scatter in general. That is, computing a regression line has no effect on scatter, and so there is no relationship between the x- and y-values of any significance.

If

T −D is always a positive number we give it a name that is always a positive T quantity:

Since

Definition: R 2 =

T −D T

A tedious (but not difficult) exercise in algebra shows that this quantity is given by the formula:

R2

(S ) =

2

xy

S xx S yy

Usually people take the square root of this quantity:

R=±

(S )

2

xy

S xx S yy

choosing the + sign to indicate data has a positive slope

indicate negative slope

.

The number R is called the correlation coefficient of the data.

and the – sign to

PART THREE: 25

Example: Let’s compute the correlation coefficient of our data:

Since the data has positive slope:

R=+

(S )

2

xy

S xx S yy

=

( 7.5 )

2

7⋅9

≈ 0.95

This is very good. [Of course, with just three data points there is little information to go on. CHALLENGE: Explain why the correlation coefficient will have value R = 1 - indicating perfect fit – if we work a data set of just two data points.]

One wants a correlation coefficient pretty close to 1 or to -1. A value around 0.85 or higher (or -0.85 and lower) is usually deemed “good.” One wouldn’t want to make predictions of interpolation or extrapolation with poor fitting lines.

EXERCISE: Consider the following data.

a) b) c) d) e)

Use the least squares method to find a line of best fit. Find the correlation coefficient for this line Does it seem reasonable to use this line of best fit for general analysis? Make a prediction as to the y-value of the data when x= 1.7. (Interpolation) Make a prediction for the y-value when x = 13.2. (Extrapolation)

PART THREE: 26

A WORD OF WARNING It’s always wise to LOOK at a data a set before diving in and completing a linear regression. For example, although we can certainly find a line of best fit to the data shown, it would have little meaning. (We might wish to find a quadratic or an exponential curve to fit the data.)

If you suspect data fits a curve of the form y = ac x taking logarithms gives

log y = x log c + log a , a straight line relationship between x and log y . Perform a linear regression (via the methods of this section) to the table of data values shown …

Suppose we obtain a line of best fit log y = mx + b with:

m = 1.3 b = −0.2

(

This gives: log y = 1.3 x − 0.2 and so y = 101.3

)

x

10−0.2 = 0.63 ⋅ 19.95 x .

If you suspect data follows a curve of the form y = ax 2 , take square roots and fit a line to the data

y and x.

And so forth.

PART THREE: 27

INFERENTIAL STATISTICS THE NULL HYPOTHESIS Here’s something “fun” … PUZZLE: A disease is spreading across the country at an alarming rate. Fifty percent of the people who get it, get better on their own. The remaining fifty percent die. Two serums, A and B, has been developed hurriedly and little time has been given to test them. The only information available right now is: • •

3 patients with the disease who were given serum A all survived 7 out of 8 patients who were given serum B survived.

You have just learned that you have the disease. Which serum should you take? Some comments … “Three-out-of-three” is a 100% success rate, but only three test patients isn’t much to go on. “Seven-out-of-eight” is not perfect, but it is a larger sample. Which seems more promising?

KEY IDEA: • •

Assume that serum A has no effect and ask: How likely is it that 3/3 people would naturally survive on their own? Assume that serum B has no effect. How likely is it that 7/8 people would survive on their own?

PART THREE: 28

Answers: If serum A has no effect that the chances that three people would all naturally 1 1 1 1 survive is: × × = = 12.5% . 2 2 2 8 If serum B has no effect, then the chances that 7/8 people survive is: 1 1 1 1 1 1 1 1 1 8× × × × × × × × = = 3.125% 2 2 2 2 2 2 2 2 32 It is quite unlikely that we’d see 7/8 people surviving if serum B had no effect. (Far more unlikely than seeing 3/3 people survive.) We conclude then: there is a good chance that serum B is having an effect. I would take serum B. □

The act of assuming that there is no effect at play and working to see where that assumption leads is called testing the null hypothesis.

PART THREE: 29

EXAMPLE: You have a suspect coin in hand. a) You toss the coin 10 times and get ten HEADS in a row. Would you likely conclude that the coin is biased? b) Suppose instead when you tossed the coin you got nine HEADS out of ten. Are you still likely to conclude that the coin is biased? c) What if you got 8 HEADS out of ten? Just seven? Answer: Let’s test the null hypothesis and assume for the moment that the coin is fair (that is, that nothing suspect is going on). a) What are the chances of naturally getting 10/10 heads with a fair coin? 10

1 1 ≈ 0.1%   = 1024 2

With 99.9% confidence I would say that the coin is biased. (Note: There is a 0.1% chance that I am wrong.) b) What are the chances of receiving 9/10 heads with a fair coin? 10

10!  1  ⋅   ≈ 1.0% 1!9!  2  In this case I would say, with 99.0% confidence, that the coin is biased.

c) What are the chances of receiving 8/10 heads naturally? 10

10!  1  ⋅   ≈ 4.4% 2!8!  2 

With 95.6% confidence I would say that the coin is biased. COMMENT: There is an issue of wording here. The chances of seeing eight or more heads, as one might phrase a test for bias, is greater than 4.4%.

In fact, let’s make a table:

PART THREE: 30

With 7/10 heads I am less confident to conclude that the coin is biased. Even less so for 6/10 heads. □

NOTICE: We’ve presented here a table displaying the likelihood of each and every possible outcome. This is an example of a distribution. Here we have also talked about confidence in making some kind of inference about the meaning of a result. We’re now in the thick of inferential statistics.

[Comment: The distribution in the above table is called the binomial distribution.]

PART THREE: 31

DISTRIBUTIONS Loose Definition: A distribution is a table or a diagram that illustrates the frequency of measurements or counts from an experiment or study. Histograms lead to distributions. For example, consider the following histogram displaying the heights of 1000 people:

If the heights of the bars are percentages, not actual counts, then this diagram has total area 1. We can make the information displayed on the diagram more precise by choosing small category intervals.

and smaller and smaller … In the limit we get a smooth curve of area 1, the “height distribution curve.”

PART THREE: 32

Of course, one cannot do this in practice, but we do like to think that human heights follow some kind of smooth distribution curve of area 1. Then we like to say … the probability that someone chosen at random has height 72" ≤ x ≤ 78" is given by …

P (72 ≤ x ≤ 78) = the fraction of area above the interval 72 ≤ x ≤ 78

Since the area of the whole curve is 1, this fraction of area matches actual area above the interval 72 ≤ x ≤ 78 .

PART THREE: 33

Better Definition: A distribution for a quantity X (such as height or foot length) is a curve with area 1 such that the probability that a randomly chosen value for X lies between a and b is: P (a ≤ X ≤ b) = the area under the curve from a to b.

This is abstract! Usually in practice, one doesn’t actually know any formula for the distribution curve a quantity X seems to follow. ARTIFICIAL EXERCISE: Suppose people’s ages among the world’s population is distributed as follows:

a) Verify that the area under this distribution is indeed 1. b) A person is chosen at random. According to this model, what are the chances that this person is between 20 and 30 years old?

PART THREE: 34

Comment: People like to use the following adjectives for distributions:

PART THREE: 35

BIG QUESTION … How does one find or estimate the distribution for a quantity? Answer: Take some samples and make a guess based on what you observe.

But there is something deeper going on …

ACTIVITY: What is the distribution of heights of the people in this class? What is the mean and the standard deviation? Here’s a curious idea: What if we took another class somewhere in the nation with the same number participants and worked out their mean height? And say we did this for a third class as well. Actually for 1000 other classes! We’d expect all the means we calculate to be close to each other.

The means have their own distribution.

Here’s the curious thing …

PART THREE: 36

In the 1700s, when scientists conducted experiments multiple times and computed the average result or aggregate result over many runs of the same experiment, they noticed that the means always seem to follow a bell-shaped curve – no matter the type of experiment was being conducted:

They thought this odd. Human height comes in a bell-shaped curve. (The height of each human is the aggregate effect of growth rates of a collection of cells. Thus each human is the mean result of a “collection of experiments.”) The lengths of carrots come in a bell-shaped curve. (Each carrot is the aggregate result of cell growth.)

Scholars began work on identifying this curve and finding a formula for it. [These scholars include Gauss (~ 1820), Laplace (1818) and Lyapunov (1901).] This special curve is today called the normal distribution. These scholars managed to prove the famous “central limit theorem,” which we shall discuss next. For those interested … 1

1 ( x − µ )2 − ⋅ 2 σ2

e where µ is the 2πσ 2 mean of the original experiment and σ is the standard deviation of the original experiment.

The normal distribution follows the formula y =

PART THREE: 37

THE CENTRAL LIMIT THEOREM ACTIVITY: Here’s a simple activity you can perform with 20 of your closest friends to illustrate the Central Limit Theorem in action.

Have each person roll a single die FOUR times and compute the average of the four values obtained. Repeat three more times. First Roll Second Roll Third Roll Fourth Roll AVERAGE 1. 2. 3. 4.

Also, for the 16 rolls of the die recorded above, list the total number of 1s, 2s, 3s, 4s, 5s, and 6s that occurred. 1s 2s 3s 4s 5s 6s

On the board, create two charts: one for the raw data and one for the means. Have each person come to the board and place a dot on the left chart for each of the 16 rolls of his or her die, and one dot for each of the four average values on the right chart. Do you see distributions akin to the diagrams below? Is the distribution of dots on the left close to uniform? Is the distribution on the right approximately normal?

PART THREE: 38

Here’s the theorem: CENTRAL LIMIT THEOREM:

Suppose a quantity has some kind of distribution with mean µ and standard deviation σ . Take a sample of N measurements of this quantity and calculate its mean. Do this repeatedly for many different samples of size N. Then the means of all the samples will closely follow a normal distribution with: mean = µ standard deviation =

σ N

The mathematical proof of this result is HARD!! But the idea behind it is fairly clear. The last statement of the theorem states, essentially: “The larger the sample size, the less deviation one obtains.” Example: Suppose you are looking at the heights of people. •

Select 10 people at a time and plot their means: Expect a lot of spread.

•

Select 100 people at a time and plot their means: Expect less spread.

PART THREE: 39 •

Select 1000 people at a time and plot their means: Expect even less spread.

•

Select 6.5 billion people at a time (that is, the entire world’s poputation!) and plot their means: Expect no spread!

People say: “Error decreases as sample size increases.”

PART THREE: 40

EXAMPLE: A manufacturer makes light bulbs. Their average bulb lasts

µ = 55.0 hours with standard deviation

σ = 1.8 hours They ship boxes of 100 bulbs to their distributors. Suppose we open a box and compute the average life span of all 100 bulbs. We repeat this act for many many more boxes. Then, according to the central limit theorem., the average box lifespan has: mean = µ = 55.0 hours standard deviation =

σ 100

= 0.18 hours

OPTIONAL ACTIVITY FOR THOSE WITH COMPUTER SKILLS a) Have a computer select ten random numbers from the range 1 to 100, and compute the mean of those ten numbers. Have the computer do this one hundred times and then plot the means. Does the distribution look bellshaped? b) Repeat this exercise but this time have the computer select twenty numbers at random each time and compute their mean. Does the resulting curve look more “normal”? c) Repeat this exercise but this time have the computer select fifty numbers each time and compute their mean. How does the distribution appear?

PART THREE: 41

UNDERSTANDING THE NORMAL DISTRIBUTION The normal curve is symmetrical, bell-shaped, and of area 1.

Now … we haven’t talked about how to compute the mean and standard deviation of a continuum of values (that is, over a small curve); only how to compute these for a finite set of discrete values. x + x2 + ⋯ + xN and σ = [Recall: µ = 1 N

( x1 − µ ) 2 + ⋯ + ( xN − µ )2 ] N −1

One way to deal with a continuous set is to approximate it as a finite collection of values as though it is a histogram:

One computes the mean and standard deviation for these finite values, and then take the limit as one repeats this for a finer and finer histogram. (This is calculus.)

PART THREE: 42

The upshot is … For the normal distribution: mean = µ = “central value” standard deviation = σ = “measure of spread”

PART THREE: 43

THE 68 – 95 – 99.7 RULE: For the normal distribution … 68% of the data lies with within one standard deviation of the mean:

95% of the data lies within two standard deviations of the mean:

99.7% of the data lies within three standard deviations of the mean:

PART THREE: 44

EXERCISE: A species of carrot has average length 12.3 cm with standard deviation 0.8 cm. Assume carrots are normally distributed. a) What are the chances that a carrot chosen at random lies in the range 11.5 cm to 13.1 cm.? b) What are the chances that a carrot chosen at random is longer than 13.9cm?

Answer:

□

PART THREE: 45

Z-SCORES The following example leads to an important concept: EXAMPLE: Here are the college test scores on math – along with John’s scores.

On which subject did John do best? Worst? [Assume scores are normally distributed.] Answer: Notice that John is 60 points below the mean in algebra, that is, 1½ standard deviations below. This is not good. Notice that John is 70 points above the mean in calculus, that is, 2σ above. This is very good! John is 20 points above the mean in statistics, that is, 0.8σ above. This is fairly good.

Even though John got the lowest number score in calculus, this was his best result! □ In this example we were able to compare different scores by bringing them to the same standard: the number of standard deviations above or below the mean. This leads to the notion of a z-score:

PART THREE: 46

Definition: If x is the value of an experiment which has mean µ and standard deviation σ , then it’s z-score is: z=

x−µ

σ This is the number of standard deviations x lies above or below the mean. Example: John’s z-scores are: Algebra:

z=

670 − 730 1 = −1 (That is, 1½ standard deviations below the mean.) 40 2

Calculus:

z=

450 − 380 =2 35

Statistics:

z=

(That is, 2 standard deviations above the mean.)

660 − 640 = 0.8 (That is, 0.8 standard deviations above the mean.) 25 □

NOTE: The transformation z =

x−µ

σ

converts the points x = µ − σ , x = µ and

x = µ + σ into z = −1 , z = 0 , and z = +1 , and thus transforms a distribution centered

about µ with a spread of σ into a distribution centered about 0 with a spread of 1.

So …

PART THREE: 47

The distribution of the z-scores of a normal distribution has is a normal distribution with mean 0 and standard deviation 1. (A normal distribution with µ = 0 and σ = 1 is called the standard normal distribution.) It is convenient to define the following function: Definition: ϕ ( z ) = the area under the standard normal distribution ( µ = 0 and

σ = 1 ) to the left of the value z.

(This is called the probability density function for the standard normal distribution.) Books publish tables of ϕ ( z ) values. (Such tables are also easy to find on the internet.) For example: ϕ (0) = 0.500

(Do you see why?)

According to one of these tables we also have:

ϕ (1) = 0.841

ϕ (1.72) = 0.957

ϕ (−3.0) = 0.013

PART THREE: 48

EXAMPLE: The wealth index of a floogle is normally distributed with mean 0 and standard deviation 1. A floogle is selected at random. What is the probability that its wealth index is between 1.00 and 1.72?

Answer:

Probability =

area between 1 and 1.72

=

ϕ (1.72) − ϕ (1)

= =

0.957 − 0.841 0.116

=

11.6%

□

EXAMPLE: The wealth index of a woogle is normally distributed with mean 18 and standard deviation 4. A woogle is selected at random. What is the probability that its wealth index is between 22 and 24.88? Answer: We need to convert this to a problem about a normal distribution with mean 0 and standard deviation 1. That is, we need to work with z-scores. 22 − 18 =1 4 24.88 − 18 z= = 1.72 4 z=

x = 22

x = 24.88

So …

P(between 22 and 24.88) = ϕ (1.72) − ϕ (1) =11.6%.

□

PART THREE: 49

COMMENT: Sometimes texts will only give values for the areas between 0 and a positive value z as shown:

Exercise: Find a text that gives values as described here. Use it to compute: a) ϕ (0.2) b) ϕ (−0.2) c) ϕ (1.16) d) ϕ (−2.21)

PART THREE: 50

A QUICK TASTE OF HYPOTHESIS TESTING EXAMPLE: It is generally believed that parsnip length is normally distributed with mean 18.5 cm and standard deviation 3.2 cm. Johnny found a parsnip next to a nuclear power plant that is 25.0 cm long. Should we conclude that something unusual happened to the parsnip? Answer: What are the chances of finding a parsnip of that length under normal circumstances? (NULL HYPOTHESIS!!)

Recall the 68-95-99.7 rule. The chances that we naturally find ourselves in the shaded region is only 2.5%. Thus, with 97.5% confidence we can say that something strange happened to the parsnip! □

PART THREE: 51

THE CENTRAL LIMIT THEOREM … CONTINUED Recall the result … CENTRAL LIMIT THEOREM:

Suppose a quantity has some kind of distribution with mean µ and standard deviation σ . Take a sample of N measurements of this quantity and calculate its mean. Do this repeatedly for many different samples of size N. Then the means of all the samples will closely follow a normal distribution with: mean = µ standard deviation =

σ N

And recall that the normal distribution has 95% of its values lying with two standard deviations of its mean, and 99.7% of its values within three standard deviations. These are the key ideas behind statistical hypothesis testing. One example … ANALYSIS OF ROULETTE: A Roulette wheel has 18 red spaces, 18 black spaces and two green spaces. In the simplest version of the game one can place a $1 bet on either “red” or “black.” If your colour comes up, you win $1. If it doesn’t, you lose $1. (Thus the two green spaces give a slight advantage to the house.) So … Your chances of winning $1 are

Your chances of losing $1 are

18 38

20 38

PART THREE: 52

In one round of the game your expected profit is:

µ = 1⋅

18 20 + (−1) ⋅ = −0.053 38 38

So, on average, you lose 5.3 cents per bet. What’s the standard deviation here? It is not clear what this means in this case (what are the data points one is referring to here?) but one can reason as follows: If we were to play 38 games, then we would expect, on average, to receive the data value +1 eighteen times and the data value -1 twenty times. The standard deviation from the mean of -0.053 is given by: (1 − (−0.053)) 2 + ⋯ + (1 − (−0.053)) 2 + (−1 − (−0.053))2 + ⋯ + ( −1 − ( −0.053)) 2 σ = 37 2

(1 − (−0.053))2 ⋅18 + (−1 − (−0.053)) 2 ⋅ 20 = 37 = 1.0242

σ = 1.0242 = 1.0120 So, for a single roulette bet:

µ = −0.053 σ = 1.012

Now … suppose I am a habitual gambler and go to the casino every night to play 100 rounds of roulette. By the central limit theorem the results of my nightly activity closely follow a normal distribution with:

µ = −0.053 σ=

1.012 = 0.1012 100

PART THREE: 53

Almost all of my nightly results (99.7% of them) lie within the range:

µ − 3σ to µ + 3σ i.e. within the range -0.3566 to 0.2506. For 100 bets of a dollar each, this means that for almost all evenings my winnings lie between -$35.66 and $25.06. Although I often lose, I also often win. This keeps me coming back!

FROM THE CASINOS POINT OF VIEW … I am not the only person playing the game each night. They may see something like 100 000 rounds of roulette being played per night. These samples of 100 000 rounds per night closely follow a normal distribution with:

µ = −0.053 σ=

1.012 = 0.003 100 000

So … almost all evenings (99.7% of them) outcomes lie within the range:

µ − 3σ to µ + 3σ i.e. within the range -0.062 to -0.044.

With 100 000 bets of $1 this corresponds to a nightly win for gamblers between the range -$6200 and -$4400. The casino sees an almost guaranteed profit somewhere between $4400 and $6200 per night from Roulette alone.

PART THREE: 54

CONFIDENCE INTERVALS Often people are comfortable making statements that have a 95% chance of being correct! EXAMPLE: It is generally believed that the lengths of carrots are normally distributed with mean 18.5 cm and standard deviation 3.2 cm. Johnny found a carrot growing next to the nuclear power plant that is 25.0 cm long. He wants to say that something unusual happened to this carrot. With what level of confidence can he say this? Answer: We know that 95% of data in a normal distribution lies within two standard deviations from the mean. In this case, 95% of carrot lengths should lie within the range 18.5 ± 6.4 cm., that is, in the interval [12.1 , 24.9] cm. Johnny’s carrot has length outside of this range. There is only a 5% chance that this would happen naturally. Thus, with 95% confidence, Johnny can say that something strange happened to his carrot. □ EXAMPLE: The average mass of all planets in the universe is not known, but scientific theories do suggest that planet masses vary with a standard deviation of the order σ = 3000 units. Astronomers have observed the motion of 24 planets and have calculated their average mass to be m = 27650 units. What can we say about the value of µ , the mean mass of all planets in the universe? Answer: For sample sizes of 24 planets we expect to obtain a distribution very close to the normal distribution with: mean = µ standard deviation =

We obtained: m = 27650.

σ 24

= 612

PART THREE: 55

Now, in general, there is a 95% chance that m lies within two standard deviations of µ .

By the same token, there is a 95% chance that µ lies within two standard deviations of m!! (If m is within two units of µ , then µ is within two units of m.) So we can say, with 95% confidence, that the true value for the mean µ lies somewhere in the interval 27650 ± 1224 , that is, in the range [26426, 28874] We call [26426, 28874] the 95% confidence interval.

□

IN GENERAL: Suppose a population has unknown mean µ and known standard deviation σ . A sample of size N yields an average value m. Then, with a 95% level of confidence, we can say that the true mean µ lies somewhere in the interval: [m−2

σ N

, m+2

σ N

]

This is called the 95% confidence interval for the mean. EXAMPLE: A newspaper reports that the average height of an Australian male is 175 ± 8 cm with 95% confidence. What does this mean? Answer: We are 95% sure that the true mean for the entire population of Australian males lies somewhere between 167 and 183 cm. NOTE: It DOES NOT mean that 95% of Australian men have height somewhere between 167 and 183 cm. THIS IS A COMMON MISCONCEPTION!

PART THREE: 56

EXAMPLE: A manufacturer of pipes should produce pipes with diameter 3 inches. However, the manufacturing equipment is not perfect and has a standard deviation of σ = 0.02 inches in the pipes it produces. Inspectors select 100 pipes at random and find their average diameter to be 2.98 inches. Are they pleased? Answer: The true mean µ lies somewhere in the interval: [2.98 − 2

0.02 0.02 , 2.98 + 2 ] = [2.976, 2.984] 100 100

with 95% confidence. Things don’t look good. It is very unlikely that the manufacturer is producing pipes with the required mean of 3 inches. □

EXAMPLE: A filter removes dust from my living room. The amount of dust it removes from day to day has standard deviation σ = 0.3 mg. I measured the weight of dust collected over three days and got the figures: 13.2 mg, 13.7 mg, 12.9 mg. Compute the 95% confidence level of the true average weight of dust removed per day. Answer: The sample has mean m=

13.2 + 13.7 + 12.9 = 13.27 mg 3

The 95% confidence level is m ± 2

σ 3

= 13.27 ± 0.35 mg.

□

QUESTION: What is a 99.7% confidence interval? How would I have to adjust my calculations to find this level of confidence?

PART THREE: 57

P-VALUES Sometimes we like to be more specific and follow our hunches about what we think the value of a mean µ should actually be. EXAMPLE: Suppose a population has a normal distribution.

µ = unknown, but we suspect it has value 12 σ =3 A sample of size 40 was found to have average value m = 12.98. What do we now think about our hunch that µ =12? Answer: Let’s assume that µ really is 12 and how likely it is we would have obtained a sample mean of m = 12.98. Now samples of size 40 have means that follow a normal distribution with: mean = 12 standard deviation =

σ 40

= 0.47

Now 95% of the means should lie between 11.06 and 12.94.

We got the value 12.98. There is a 2½% chance that we’d land in this range if the true mean really were µ =12. We reject our hunch that µ =12 with 97.5% certainty that this is the right thing to do.

□

PART THREE: 58

Some people like to go further: In the same example: If µ =12, what is the probability that we’d get a sample mean of m = 12.98 or more? Answer:

Convert this to a z-score so that we can look up values on a table of standard normal distribution values: z=

12.98 − 12 = 2.09 0.47

According to the tables, this region has area 0.5000 – 0.4817 = 0.0183 = 1.83% This value is called the p-value of the sample mean 12.98 for the assumption that µ =12. The chance of getting a sample mean of 12.98 or more under the assumption that µ =12 is extremely low. We reject the claim that µ =12 with 98.17% level of confidence that we are doing the right thing.

PART THREE: 59

We’ve just computed a “right-tail p-value’ for a given sample mean. One can also compute left-tail p-values. EXERCISE: Suppose a population has a normal distribution.

µ = unknown, but we suspect it has value 12 σ =3 A sample of size 40 was found to have average value m = 11.03. Find the left-tail p-value for this sample. Would you reject the claim that µ =12?

A “p-value” gives a measure of the likelihood that we would obtain a sample mean at least as extreme as the one we observed, under the assumption that beliefs about the true value of the mean are correct.

PART THREE: 60

GALLUP POLLS and the like: There is another version of the central limit theorem that comes in handy for interpreting Gallup Polls, for example. Suppose we are interested in determining what percentage of a population has a certain characteristic, say, predicting the percentage of Americans that will vote Republican at the next elections. Call the proportion of the population with the desired property p. Our goal is to estimate the value of p. Suppose we take a sample of size N from the population and find that percentage p has the desired property. (For example, we interview just 1000 Americans and find that 32% say they will likely vote Republican.) CENTRAL LIMIT THEOREM: Version II Let p be the (unknown) percentage of a population possessing a certain characteristic. If samples of size N are taken and the values p are computed for those samples, then: The values p have a distribution that is approximately normal and the larger the sample size N the better the approximation. The mean and standard deviation of the distribution of p values are:

µ=p σ=

p (100 − p ) N

(Here, p is given as a percentage.) Given a particular value p for one sample, the 95% confidence level for that sample is:

[ p − 2σ , p + 2σ ] where σ is the standard deviation computed with p rather than p. With 95% confidence we can say that the true value p lies within this range.

PART THREE: 61

EXAMPLE: Of 1500 adult Americans that were polled, 3.2% of them said they had an overall thoroughly pleasurable experience studying math in high-school. Estimate the proportion of ALL adult Americans that will say the same. Answer: We have

p = 3.2

σ =

3.2 × 96.8 = 0.454 1500

With 95% confidence we can say that the percentage of Americans who claim to have enjoyed high-school math lies in the range [2.292, 4.108] . □

COMMENT: IN THE NEWSPAPER … Newspaper reports usually don’t mention a “confidence level” but rather a “margin of error.” They mean by this the same thing – with the understanding that we are talking about a 95% confidence level. For example, if a journalist writes … “According to our survey, 36% of Americans are now afraid to eat cheese. The margin of error in this report is 2.6 percentage points.” The journalists means by this that, with 95% confidence, we can say that the true percentage of Americans afraid to eat cheese lies in the interval [33.4,38.6] . WARNING: Some organizations prefer to use 90% confidence levels (phrased in terms of margins of error.) For example, the Bereau of Labor Statistics reports unemployment figures with 90% confidence.

POLLS CAN BE MISLEADING! There is a large sub-branch of statistics concerned with the issue of how to select an appropriately representative sample. By considering this very question, one may begin to influence the types of people one may accept to interview for a survey. For example, in studying the shopping habits of Americans one may think to go to the local mall to

PART THREE: 62 interview shoppers. Right there you have a bias in your sample – you are considering only people who like to shop at malls! A famous historical example of an erroneous prediction based on biased sampling occurred during the 1936 U.S. presidential elections. The popular magazine Literary Digest, as part of the sensationalism leading up to the election, conducted a poll to predict the outcome of the race. After interviewing a sample of eligible voters, chosen by drawing names at random from telephone books from across the nation, the editors of the publication concluded that the election was a foregone conclusion - Alfred Landon was to win with a comfortable lead—and they subsequently published much editorial commentary to this effect. It turned out, however, that Langdon’s opponent, Franklin Roosevelt, won the election by a landslide. Members of the Digest did not realize that they had worked with a biased sample—only affluent Americans could afford telephones at the time of the Great Depression and be listed in telephone books. This was a class of voter then more likely to vote Republican. Consequently, the Digest’s prediction was erroneous. The publication folded in 1937 due to both the sampling fiasco and the difficult times of the depression. Today, a number of sampling methods are commonly used to help ensure that no bias occurs. These methods include:

Random Sampling Each subject of the population is assigned a number, and numbers are generated randomly with the aid of a computer to select members.

Systematic Sampling Each subject of the population is assigned a number, and, starting at a random number, every kth member from then on is selected. For example, one might select every 23rd person, starting with the 533rd member.

Stratified Sampling When a population is naturally divided into groups (such as male/female, or age by decade), selecting a random sample from within each group produces what is called a “stratified sample.” Samples produced this way are used to ensure representatives of each subgroup are present in the study. For example, in a study involving college freshman and sophomores, one might select twenty-five students at random from each group – freshman males, freshman females, sophomore males and sophomore females – to make a sample of one hundred students.

Cluster Sampling If an intact subgroup of a population is used as a representative sample of the entire population then the sample is called a cluster sample. For example, the set of all freshman females might be used to represent the population of all college students for the purposes

PART THREE: 63 of one study, or the 12 eggs in one carton of eggs as representative of all the eggs handled by a particular supermarket. The list goes on! PRACTICE EXERCISE: a) Take a deck of cards and divide it into four suits. Randomly select 2 cards from each suit and list the eight cards here: Which of the above type of sampling is this? Is there an 8-card sample that would not arise via this method?

b) Shuffle the deck of cards. Select two suits by randomly drawing two cards from the deck. Now separate those two suits from the deck. Randomly select four cards from each of those two suits. Write the sample of eight cards here: This is an example of a “multistage random sample.” Can all possible samples arise this way? c) Describe a procedure for selecting 8 cards from a deck that illustrates the method of systematic sampling.

Statisticians often prefer to work with a simple random sample (SRS) scheme: not only does each person in the population have equal chance of being selected, but every combination of people has the same chance of appearing as any other combination. For example, random sampling and systematic sampling can be SRS, but stratified sampling and cluster sampling are certainly not.

The central limit theorem is assuming that samples chosen are simple random samples.

PART THREE: 64

There is a wonderful student exercise being bandied about teaching conferences. Sadly, I have not been able to track down the originator of this idea to assign proper credit. As you will see, the following activity can be attempted on the first day of class, and as student sophistication grows, can be repeated with more powerful tools. ACTIVITY: HOW MANY RED BOOKS ARE IN THE LIBRARY? You mission is to simply give a reasonable answer to this question! Working in teams of three, formulate an approach to an answer by following the three steps outlined below. These are precisely the steps a statistician must follow when she commences a new type of project. STEP 1: DESCRIBING THE DATA The question to be explored is vague. Can you come up with some reasonable points of clarification? What does it mean for a book to be red? Go to the library and look at some books. Try to come up a consensus within your team as to what “red-ness” should mean. Write down your definition of a red book. Does the phrase “in the library” need to be clarified? STEP 2: COLLECTING DATA Use a method that your team believes will give the best approximate answer to the question. Describe your method and approach.

STEP 3: CONCLUSION Write down your answer. Approximately how many books in the library are red?

PART THREE: 65

LATER … Select a sampling method that seems appropriate for garnering a non-biased sample of 200 or more library books. Count that number in that sample that fit your definition of being red, and express this count as a proportion p . Using

µ=p σ=

p (100 − p ) N

write a 95% confidence interval for the true proportion of library books that are red.

PART THREE: 66

TWO-VARIABLE ANALYSIS: CHI-SQUARED TESTS It’s easiest just to begin with an example: EXAMPLE: Is there any correlation between hair colour and eye colour? A team goes out and examines a random sample of people. The data collected is displayed in the following contingency table:

Does there seem to be some connection? Answer: Here’s one way to think about this. First, compute each row sum, column sum and grand total.

We see that 38 out of the 110 people examined had blue eyes. That is, 38/110 = 34.5% of the sample had blue eyes. We also see that there were 37 of the people examined were blonde, 46 brown haired and 27 red haired.

PART THREE: 67

If there is absolutely no influence of hair colour on eye colour, then we’d expect 34.5% of the blonde population to have blue eyes, 34.5% of the brown haired population to have blue eyes, and 34.5% of the red haired population to have blue eyes. That is, THE EXPECTED FREQUENCY (aka count) OF BLONDE PEOPLE WITH BLUE EYES IS: 38 × 37 = 12.8 . 110

THE EXPECTED FREQUENCY OF BROWN HAIRED PEOPLE WITH BLUE EYES IS: 38 × 46 = 15.9 110

THE EXPECTED FREQUENCY OF RED HAIRED PEOPLE WITH BLUES EYES IS: 38 × 27 = 9.3 110

How many green-eyed people with red hair would we expect? 27 × 25 = 6.1 110

(The proportion of our sample with red hair is 27/110. This proportion of 25 green eyed folk should have red hair.) In general:

The expected frequency of the entry in the i-th row and j-th column is given by: row i total × col j total grand total

PART THREE: 68

We can go and fill in all the expected frequencies under the assumption that there is no relationship between the two qualities.

Looking at this it seems that our observed frequencies are vastly different from the expected frequencies (for no relationship). It seems that something is going on. To make this more precise … The CHI-SQUARED STATISTIC for a table of observed frequencies (o) and expected frequencies (e) is:

χ = sum of all calculations 2

(o − e)

2

e

In our example …

( 8 − 15.9 ) (23 − 12.8) 2 χ = + 12.8 15.9 2

2

+ ⋯ +

( 5 − 6.1) 6.1

2

= 17.88

Now … if there truly were no relationship twixt the two variables then you’d expect the observed values to be very close to the expected values, that is: A χ 2 value close to zero suggests no connection. A large χ 2 value suggests something is going on.

PART THREE: 69

In our example, we seem to have a large χ 2 value. Statisticians have done the mathematics on the χ 2 statistic and have computed the distribution one would expect it to follow. [NASTY MATH!] Each table of a different size has its own chi-squared distribution. A table with r rows and c columns is said to have;

ν = (r − 1)(c − 1) degrees of freedom. E,g, In our example we have ν = 2 × 2 = 4 degrees of freedom. (COMMENT: Why r-1 and c-1? We know that all r rows add to 100% of the data. Thus if you know the sum of the first r-1 rows, then you do not need to be given the sum of the r-th row. It’s value is not free. Ditto for the columns.) See internet or most any stats book for a table of χ 2 values for each value of ν . In our example: χ 2 = 17.88 for ν = 4 . According to the table of values, this value lies between χ 2 0.995 and χ 2 0.999

The chances of this χ 2 value occurring is above 99.5%. Thus, with 99.5% confidence, we can say that there is some kind of correlation between eye colour and hair colour. NOTE: There is no claim as to what that connection is. Further independent analysis is required.

PART THREE: 70

EXERCISE: Analyse this (fictional) data from interviews with 5406 fifteen-year olds.

CAVEAT: LOW EXPECTED FREQUENCIES (values lower than 5) TEND TO SKEW CHI-SQUARED TESTS. Statisticians have the rule of thumb that if more than 20% of the entries in a table have expected values less than 5, then the test is unreliable.

PART THREE: 71

QUALITY CONTROL EXAMPLE: A pipe manufacturer makes pipes of diameter 3 inches. Consumers will tolerate a spread of values with standard deviation σ = 0.06 inches. To test the quality of their manufacturing techniques, each day a random sample of 10 pipes is selected and their mean diameter is computed. Here are the results of twelve days of data: DAY 1 2 3 4 5 6 7 8 9 10 11 12 Mean 2.98 3.01 3.04 2.97 2.99 3.01 3.05 3.04 3.07 3.08 3.06 3.09 Is the wear and tear of the production equipment having an effect? Answer: Let’s plot the data. Also, the mean is meant to be 3.00 so let’s plot that line as well:

It looks like the data is drifting away from the target mean. To make this more precise… 0.06 = 0.02 . 10 99.7% of the results should lie within three standard deviations of this mean. That the data is drifting above the critical line of +3 standard deviations suggests that quality is “out of control.” (Day 9 is the first day of concern.) □

Samples of size ten should have mean 3.00 and standard deviation

PART THREE: 72

EXAMPLE: Two machines each produce 1 000 bolts per day. The following table shows the number of defective bolts each machine manufactured over a ten day period. MACHINE 1 42 37 18 37 17 26 35 21 18 17 MACHINE 2 44 36 23 41 24 25 31 35 23 21 Using only basic techniques, is one machine significantly less reliable than the other? Answer: There isn’t much to work with on this problem. One approach is to count totals: Over 10 days, machine 1 produced 268 out of 10 000 defective bolts: Percentage: 2.68% Over 10 days, machine 2 produced 303 out of 10 000 defective bolts: Percentage: 3.03% These seem on par. ANOTHER APPROACH: Perform a COUNTS TEST. Here’s the list showing which machine produced the greatest number of defective bolts per day: 2 1 2 2 2 1 1 2 2 2 Machine 2 is listed seven times out of the ten days. If there is no difference in the quality of the machines, that is, if each is equally likely to be listed on a day as having produced the most defective bolts for that day, then this sequence is akin the sequence of Hs and Ts in flipping a coin. Is it unusual to get seven Hs in a run of ten flips? That is, is the fact that machine 2 is listed seven times at all significant?

PART THREE: 73

Question: What are the chances of receiving seven heads in flipping a coin ten times? Answer:

10! 1 ⋅ ≈ 11.7% . 3!7! 210

It can happen. (It occurs about 12% of the time.) This is not considered “rare” enough to be significant. So we would say that there is no significant evidence to suggest that machine 2 is behaving differently to machine 1. □

COMMENT: We usually look for events that a “rare”, say have a 5% chance of occurring, to say, with 95% confidence that something unusual is occurring.

For example, suppose machine 2 was listed NINE times out of the ten of string. 10! 1 The chance of this “naturally” occurring is ⋅ ≈ 1% , so we would conclude, with 1!9! 210 99% confidence, that machine 2 is indeed less reliable than machine 1.

PART THREE: 74

RUN TESTS FOR RANDOMNESS Suppose some activity has two possible outcomes: A or B. e.g

Toss a coin: H or T Roll a die: Even or Odd Height of a person: Above the mean or Below the mean

Suppose we perform the activity a number of times and record the sequence of As and Bs that result: e.g. A A | B B B | A | B | A A A A A | B B B |A A | B | A A A Definition: A run is a string of repeated letters in the sequence. One usually separates runs with a “|” to make them easier to see. In the example above there are nine runs. TWO COMMENTS: a) A sequence with a large number of runs suggests that the sequence of A and B generated is not truly random. For instance, the following sequence has the maximal possible of runs. You would unlikely believe it to be a random sequence: A|B|A|B|A|B|A|B|A|B|A|B|A|B|A b) A sequence with very few runs doesn’t seem that random either. AAAAAAAAA|BBBBBBBB|AAAAAAAAAAA There seems to be “too much clustering.”

PART THREE: 75

So … the count of runs in a sequence should, in some way, give an indication of just how random that sequence is. Some mathematical facts … Suppose in a string of N symbols we have a As and b Bs. (So N = a + b.) There are

N! possible ways to arrange these As and Bs. a !b !

List them all and count the number of runs in each possible example. Mathematicians have proven that the count of runs has mean and standard deviation given by these formulae:

µ=

σ=

2ab +1 N 2ab ( 2ab − N ) N 2 ( N − 1)

They have also shown that if a and b are each 7 or greater, then 95% of the run counts lie within two standard deviations of this mean. COMMENT: This is using the version of standard deviation with “N” in the denominator rather than “N-1.”

EXERCISE: a) Write down all the possible ways to list three As and two Bs. b) Count the runs in each c) Find the mean and the standard deviation of the count of runs. Verify that the above formulae give the same values.

PART THREE: 76

EXAMPLE: Consider the following string: HHHTTHHHTTTHTTTT How likely is it that this came from flipping a coin ? Answer: We have a = 7 heads and b= 9 tails. Here N = 16. There are 6 runs. Now, according to the previous result, the runs should follow a distribution with:

µ = 8.875 σ = 1.9 The count of six runs is within the range of two standard deviations from the mean. We cannot conclude that this example is unusual. □ EXAMPLE: Consider the following string: HTHHTHTHTTHHHTHT How likely is it that this came from flipping a coin ? Answer: We have a=9 heads, b= 7 tails, and N = 16. There are 12 runs. Again: µ = 8.875 σ = 1.9 The number 12 is within 2 standard deviations from the mean. We cannot conclude that this sequence is not random. □

PART THREE: 77

EXAMPLE: Consider the following string: HHHHHTTTTTTHHTTT How likely is it that this came from flipping a coin ? Answer: a = 7 heads; b = 9 tails; N = 16. There are 4 runs. Again:

µ = 8.875 σ = 1.9

The count of 4 runs is more than two standard deviations below the mean. With 95% confidence we can say that this sequence was not produced by a random phenomenon. □

PART THREE: 78

TWO APPLICATIONS

ABOVE- and BELOW- the MEDIAN TEST To determine whether or not a set of numerical data is “random” a) Write the data in order it was collected b) Compute the median of the data c) Write “A” or “B” next to each data point to indicate whether that point is above or below the median. (If an entry has the same value as the median, then omit it.) d) Do a runs test on the sequence of As and Bs. If the data really was generated by a random phenomenon, then the sequence of As and Bs produced should be random. EXAMPLE: Here’s some data. Does it seem random? Use the above/below median test. 16 12 23 18 37 21 13 14 30 79 11

Answer: We need to find the median. (Unfortunately, this means ordering the data!): 11 12 14 13 16 18 21 23 30 37 79 median = 18. Now here’s the sequence in terms of aboves and belows: BBA *AABBAAB (The star indicates the omitted value.)

PART THREE: 79

We have: a = 5, b = 5 with N = 10. There are 5 runs. (I know that these a- and bvalues are a bit low, but let’s follow the test anyway just for the fun of it!) This gives:

µ =6 σ = 1.49 The value of 5 runs is not outside two standard deviations from the mean. The data seems to be following a random phenomenon. □

DIFFERENCE IN POPULATIONS TEST Suppose two samples of sizes m and n are denoted: a1 a2 a3 … am b1 b2 … bn

To decide whether or not the two samples came from the same type of population, arrange all m + n values in increasing order. (If some values of repeated, choose an order among them at random.) Record a sequence of As and Bs to show from which sample each data point came from. If the resulting sequence of As and Bs is random, then we can conclude that the samples are not really different and come from the same source. If the sequence is not random, then no such conclusion can be made.

PART THREE: 80

EXAMPLE: Twelve people from a mall were interviewed for their ages. Call these the M values: 13 18 34 17 16 30 13 47 37 35 15 35 Twelve people at an art museum were interviewed for their ages. Call these the A values: 45 52 17 28 41 63 48 23 38 60 40 40 Are these ages from the same type of population? Answer: Arrange the data in numerical order and keep track of which are Ms and which are As. 13 13 15 16 17 17 18 23 28 30 34 35 35 37 38 40 40 41 45 47 48 52 60 63 M MM MA M M A A M M M M M A A A A A M A A A A We have: a = 12 b = 12 N = 24 There are 8 runs. For these values:

µ = 13 σ = 2.40 The value of 8 runs is more than two standard deviations away from the mean. With 95% confidence we can say that these two sets of data are not coming from the same type of population! □ Here’s a fun example: EXAMPLE: Here are the first twenty digits of π : 3 1 4 1 5 9 2 6 5 3 5 8 97 93 26 4 3 Do they seem random?

PART THREE: 81

Answer: Do the median test: One checks that the median is 4.5. The sequence of Aboves and Belows is: BBBB|AA|B|AA|B|AAAAA|BB|A|BB Here: a = 10 b = 10 N = 20 There are 9 runs. We have:

µ = 11 σ = 2.17

The value of 9 runs is within two standard deviations of the mean. This sequence looks random! □

EXERCISE: a) Write a sequence of Hs and Ts twenty symbols long that looks random to you. (The number of Hs need not be the same as the number of Ts.) Perform the runs test. Is your sequence “random.”? b) Flip a coin 20 times and record results. Perform a runs test for randomness on your sequence!

PART THREE: 82

RANK CORRELATION Here’s an opportunity to offer students a challenging exercise that illustrates the way tools and ideas in statistics are created. THE PROBLEM: Five men – Albert, Bilbert, Cuthbert, Dilbert and Egbert – take part in a singing contest and are ranked by two judges 1 – 5 (with “1” as best and “5” as least favored). For example, a possible outcome of the contest might be: Albert Bilbert Cuthbert Dilbert Egbert Judge 1 1 4 3 5 2 5 2 4 1 Judge 2 3 If the judges followed purely objective assessment criteria and were completely free of personal preferences, then we would expect the two rankings should be identical. If, on the other hand, the judges followed no set procedures for their ranking schemes and assigned rankings in a random fashion, then we would expect very little or no correlation between the two lists. In the example presented above we seem to be somewhere between these two extremes. THE CHALLENGE: Develop an “index” that takes two lists of rankings from two judges and, from those lists, applies some formula or algorithm to those lists and computes a numerical value, which we shall call R. We would like R to have the following properties: i) 0 ≤ R ≤ 1 ii) R has value 1 if the two lists are identical. iii) R has value 0 if the two lists are in complete disagreement. (e,g. The first judge lists the candidate in the order 1, 2, 3, 4, 5 and the second judge in the order 5, 4, 3, 2, 1.) Compute the value of your “Rank Correlation Coefficient” to the example above and interpret the results.

PART THREE: 83

Here are some possible approaches: APPROACH 1: Given two lists compute the difference of scores of each contestant, square, and sum. This gives a number D. In our example, we have: D = (1 − 3) + ( 4 − 5 ) + ( 3 − 2 ) + ( 5 − 4 ) + ( 2 − 1) = 8 2

2

2

2

2

The largest value D can possess (for a list of five numbers) is 40 and this occurs when the rankings are in reverse order. (Why?). The smallest value D can possess is 0, and this occurs when the orders are in complete agreement. So set: R = 1−

D 40

This does the trick. In our example, R = 1 −

8 = 0.8 , which indicates some disagreement. 40

Comment: This is the approach Charles Spearman took in 1904. He defined his D index to be ρ = 1 − 2 ⋅ where M is the maximum value D could be for two lists n M entries long. Here ρ = 1 corresponds to complete agreement and ρ = −1 to complete disagreement.

Note: One can show that M =

n ( n 2 − 1)

and this occurs if the two lists are in 3 reverse order of one another. [To see this, show what happens to the value D if two numbers in one list are swapped. Show that the value of D increases if we swap two elements that aren’t already in reverse order.]

PART THREE: 84

APPROACH 2: Use absolute values instead of squaring in the previous approach. What is the maximal value D can obtain in this case and when does it occur? APPROACH 3: We can reorder the names of the contestants so that list of ranks for the first judge is 1, 2, 3, 4, 5. The list of ranks for the second judge changes accordingly. Albert Egbert Cuthbert Bilbert Dilbert Judge 1 1 2 3 4 5 Judge 2 3 1 2 5 4 Now look at each contestant in turn along the second row. Count the number of scores to the right of each entrant with a lower score. In our example, according to Judge 2, Albert has TWO lower scores to his right. Bilbert has ZERO lower scores to his right, Cuthbert ZERO, Bilbert ONE, Dilbert ZERO. Summing these scores gives a value S = 2 + 0 + 0 + 1 + 0 = 3 . If the rankings were in perfect agreement, then S would have value 0. If they were in perfect disagreement (in reverse order), then S would have value 10, and this is maximal. Set S R = 1− . 10 In our example, R = 0.7 indicating some disagreement. Comment: In 1938 M. G. Kendall took an approach similar to this one. **** Many approaches, of course, are possible. The difficulty in this work is determining when and how a maximal value for a count occurs (and generalizing this to a list of n contestants and not just five). [Approach 2 is problematic in this regard.]

PART THREE: 85

PROBLEM SET III Question 63: PERCENTILES and QUARTILES One of the 99 values that divide a set of data in numerical order into 100 equal parts is called a percentile. For example, the 90th percentile is the data value such that 90 percent of the data points are below that value. Often scores in standardized tests are presented in terms of percentiles. For example, if 525 students take an exam and 95% of the students receive a score lower than 74 (and some student actually did earn a score of 74), then the 95th percentile for the exam is 74. It is often convenient to divide data sets into four equal parts. The lower (or first) quartile, denoted Q1, is the 25th percentile. The middle (or second) quartile, Q2, the median, is the 50th percentile, and the upper (or third) quartile, Q3, is the 75th percentile. (COMMENT: As we have seen on page 14, there is some confusion over the definition of a quartile. Notice that this paragraph defines it as the 25th percentile, and so, here at least, a quartile should correspond to a data value.) The following table shows test scores for 120 participants: Score

Number Participants 97 3 95 1 89 8 88 10 86 2 85 6 83 1 80 31 Between 70 and 79 28 Below 70 30 a) b) c) d)

What is the 90th percentile for this data? What is the third quartile for this data? What is the median for this data? From the information presented is it possible to determine the mode? The mean? The midrange?

PART THREE: 86

Question 64: a) Find an example of SIX data points with: Mean = 1000 Median = 10 Mode = 10 b) Find an example of SIX data points with: Mean = 10 Median = 10 Mode = 1000 c) Find an example, if possible, of SIX data points with: Mean = 10 Median = 1000 Mode = 10 Question 65: QUIRKY AVERAGES “The average American has one ovary and one testicle” “The average square on a checkerboard is grey” “The average roll on a die is 3½.” “One average, each planet of the solar system has 722 million human inhabitants.” Each of these statements is technically correct, but meaningless! (There is no roll of 3½ on a die, checkerboard squares are either black or white, who lives on Pluto? and as for the average American … well… ) Come up with two more quirky mis-uses of the average.

PART THREE: 87

Question 66: BOX PLOTS A “box plot” is a quick graphical representation showing the range of a data set, the median of the data set, and the medians of the lower half and of the upper half of the data set. For example, the following picture is a box plot:

We see that the data ranges from 20 to 70 and that the three divider marks at 35 or so (the left end of the box), at 42 or so (the line in within the box), and at 65 or so (the right end of the box) divide the data into four groups each representing 25% of the data. a) What is the range and median according to the following box plot? 50%75% of the data lies within which range of values?

b) Draw a box plot for the following set of data: 1, 1, 1, 3, 4, 6, 6, 6, 6, 7, 10, 14. Following page 14 … Be clear that in your mind that the left end of the box lies at position 2 and the right end at position 6.5. (The line within the box is the median of the data. The lines at the end of the box are the medians of the lower and upper halves of the data.)

PART THREE: 88

Question 67: STEM AND LEAF PLOTS Often data, given as whole numbers, is presented in via a “stem-and-leaf plot.” Each number is divided into two parts: the unit’s digit (the “leaf”) and the digits to the left of the unit (the “stem”). In one column all the stems are list, and in the second, all the corresponding leaves. For example, the data set: 22. 23. 26. 31. 31. 31. 38. 42. 63, 69, 69, 127, 129, 131 is presented: 2 3 4 6 12 13

2, 3, 6 1, 1, 1, 8 2 3, 9, 9 7, 9 1 KEY: 3|8 = 38

a) Draw a stem-and-leaf plot for the data: 113, 113, 114, 115, 116, 123, 123, 130, 130, 203, 308, 308, 319 b) Find the mean, median, and mode for leaf: 0 2 4 8 14 18

the following data presented via stem-and1,1,4 2,3,3 2,2,2 1,8,8 0 9,9

KEY: 18|9 = 189 COMMENT: Other types of stem-and-leaf plots are possible. For example, one might use the key 13|04 = 1304. COMMENT: Stem and leaf plots are used to give a quick sense of a shape as to how the data is distributed.

PART THREE: 89

Question 68: Find the mean and standard deviation of the following data set: 5.6 5.2 4.6 5.7 4.9 6.4 Question 69: Consider the following set of data: x Y 2 3 3 5 5 12 6 20 Draw a scatter diagram for this data. Find the equation of the line of best fit for this data. Sketch that line on the scatter diagram. Visually – does this line seem to fit the data well? Compute the correlation coefficient for this data. What does this say about the fit of the line? Question 70: Consider the following set of data: Y X 1 1.94 3 5.98 4 8.04 6 12.02 Draw a scatter diagram for this data. Find the equation of the line of best fit for this data. Sketch that line on the scatter diagram. Visually – does this line seem to fit the data well? Compute the correlation coefficient for this data. What does this say about the fit of the line?

PART THREE: 90

SOME MTEL-TYPE QUESTIONS Question 71: a) Find the mean, median, and mode of the following test scores: Score Number of Students 100 3 95 2 93 1 92 2 87 3 81 1 79 4 60 1

b) Another student later took the test and scored just three points. Describe, in words only, what effect such a low-value additional data point will have on each of the mean, median, and mode. Question 72: In what way is the following graph misleading?

PART THREE: 91

Question 73: Here is a stem-and-leaf plot. What is the mode of this data set?

(Here the data values are: 120, 120, 120, 130, …, 615) Question 74: The median of a data set is significantly larger than the mean of the data set. What could cause this? (A) (B) (C) (D)

There are a few exceptionally small values in the data set There are a few exceptionally large values in the data set The data values are tightly clustered around one value The data values are evenly spread across a range of values

Question 75: Draw a reasonably accurate pie chart for the following data:

PART THREE: 92

Question 76: A sample of people at a mall were measured for their heights. The results are displayed in the following histogram.

Based on this data, what are the chances that a person at the mall selected at random is less than 61 inches tall? Question 77: An investment company offers fifteen different investment options, varying from low-risk to high-risk plans. The average rate of return on these plans have been: 5% 5% 5% 5% 5% 5% 10% 10% 10% 15% 15% 15% 30% 90% 200% a) Display this data by any visual means of your choice. b) Compute the mean, mode, and median of this data. A representative of this investment firm is talking with a potential new client. When speaking about the central tendency of the company’s return-rate figures, would you advise the representative to speak about the mean, the mode, or the median of the data values? Choose one and give justification for your choice.

PART THREE: 93

Question 78: Scientists come up with an equation of best fit: y = 0.98 x + 0.02

with correlation coefficient r = 0.01 . Would they want to use this equation to predict values for y ? Explain. Question 79: Here is some data displayed on a graph.

Which of the following seems like a reasonable line of best fit? (A) 2 y − 6t = 90

(B) 6t + 2 y = 90

(C) 180 − 2 y − 6t = 0

(D) 3t − y − 90 = 0

What seems to be a reasonable value for the correlation coefficient? (A) 1.28

(B) -0.01

(C) – 0.95

(D) 0.90

Question 80: Here is some data displayed on a graph.

The line of best fit is: y = 45 − 2.1t

What does this model predict for the y-value at t = 2.5 ?

PART THREE: 94

BACK TO NON-MTEL QUESTIONS … Question 81: A terrible disease is sweeping across the nation at an alarming rate. Only 10% people who catch the disease survive. Two experimental serums have hurriedly been developed but only limited testing has been done on them. Two people with the disease were given serum A and both survived. Six people with the disease were given serum B and four survived. a) Assuming that serum A had no effect, show that the chances of two people naturally surviving the disease is 1%. b) Assuming that serum B had no effect, show that the chances of four out of six people naturally surviving the disease is 0.12%. c) Given these figures, which serum is more likely to have had a better effect on recovery? Question 82: Another terrible disease is sweeping across the nation at an alarming rate. 50% people who catch the disease survive. Three experimental serums have hurriedly been developed but only limited testing has been done on them. Serum A: Three people with the disease were given the serum and all three survived. Serum B: Ten people with the disease were given the serum and eight survived. Serum C: Five people with the disease were given the serum and four survived. You’ve just contracted the disease! Based on this limited information, which of the three serums would you take and why?

PART THREE: 95

Question 83: The following diagram gives the distribution of the number of minutes Australian women can tolerate left-handed 8 year-old boys whistling while chewing gum.

a) Verify that the area under this curve is one unit. According to this distribution … If an Australian woman is chosen at random, what is the probability that she can tolerate whistling b) c) d) e)

for a length of time between 5 and 15 minutes? for less than one minute? between 6 and 8 minutes? for more than 3 minutes?

Question 84: A coin is tossed 8 times. Complete the following table. # Heads that appear 8 7 6 5 4 3 2 1 0 Probability 0.391% 3.125% 10.938% Question 85: You suspect a coin is biased. You toss it 12 times and heads appear ten times. With what level of confidence would you say that the coin is indeed biased? Question 86: SOMETHING REALLY COOL !!! American pennies are biased! a) Stand 20 American pennies on edge on a table and then bang the table so that they all fall over. Count the number of heads that appear. What do you notice? Repeat. b) Spin 20 American pennies and let them come to rest. Count the number of tails that appear. What do you notice? Repeat.

PART THREE: 96

Question 87: You suspect a die is biased towards landing “6”. You toss the die six times and get a six three of those times. a) What are the chances of obtaining exactly three sixes in a roll of a fair die? b) Would you conclude that the die above is biased? If so, with what level of confidence would you make such a claim? Question 88: A company manufactures bolts. If 5% of the bolts they produce are defective, what are the chances that four bolts chosen at random are: a) all defective? b) all but one is defective? c) none are defective? Question 89: The distribution of weights of woogles is known to be symmetrical and triangular, with mean 50 pounds and range ±20 pounds.

a) A woogle is selected at random. What is the probability that its weight is between 30 and 50 pounds? b) A woogle is selected at random. What is the probability that its weight is over 60 pounds? c) Find the value c so that 75% of the woogles have weight between 30 and c pounds.

PART THREE: 97

Question 90: Use a table of values for the normal distribution curve (with mean 0 and standard deviation 1) to find the area under the curve between: a) b) c) d) e) f)

z = 0 and z = 1.4 z = -0.70 and z = 0 z = 1.1 and z = 1.2 z = -1.8 and z = 0.6 all z values below -0.1 all z values above 0.1

Question 91: The mean weight of a 1000 high-school students is 147 pounds with standard deviation 17 pounds. Assume the weights are normally distributed. a) How many students weigh between 130 and 164 pounds? b) How many students weigh between 113 and 181 pounds? Use z-scores and the table of values for the normal distribution curve (with mean 0 and standard deviation 1) to find … c) The number of students who weight between 147 and 152 pounds. d) The number of students who weight between 132 and 150 pounds. Question 92: A company produces cars that last an average of 10.2 years on the road with standard deviation 4.3 years. You buy a car from the company. Assuming that car ages are normally distributed, what are the chances that your car will be on the road for over 20 years? Question 93: A company produces light bulbs with average lifespan 40 hours (standard deviation 4 hours). a) Consumer advocates across the nation test 100 light bulbs and calculate the average life span of the bulb according to their samples. To a close approximation, what mean do they obtain with what standard deviation? b) A year later they repeat the experiment but this time testing 1000 light bulbs each. To a close approximation, what mean do they obtain with what standard deviation?

PART THREE: 98

Question 94: In a hand of Blackjack one has a 45% chance of winning a dollar and a 55% chance of losing a dollar. Following the same analysis as we did in class for the game of Roulette … a) Show that the mean and standard deviation for a single hand of Blackjack are given by: µ = −0.10 σ = 1.00 (This issue of whether to divide by “n” or “n-1” for standard deviation is annoying. Here I divided by n-1.) b) A habitual gambler attends the casino every night and plays 100 hands of blackjack, betting a dollar each and every time. Find the range of winnings she can expect (99.7% of the time) for each night of gambling. c) The casino sees 100,000 hands of blackjack played per night. Find the range of profit they can expect (99.7% of the time) per night from Blackjack. Question 95: A simple dice game is played as follows: Roll a six, win $4. Roll anything else, lose a dollar. a) Show that the mean and standard deviation for a single play of this game are: µ = −0.167 σ = 2.041 b) A habitual gambler plays 50 rounds of this game every day. Find the range of winnings she can expect (99.7% of the time) for each day of gambling. c) The casino sees 1 000 000 rounds of this game per day. Find the range of profit they can expect (99.7% of the time) per day from this game.

PART THREE: 99

Question 96: Rabbits who eat carrots only have weights that are normally distributed with mean 12.5 pounds and standard deviation 3.2 pounds. a) Attilla is a rabbit weighing 15.2 pounds. Is this unusual? b) Priscilla is a rabbit weighing 19.1 pounds. Does Priscilla eat only carrots? With what level of confidence do you answer this question? Question 97: All men with the name of JIM have a “handsomeness value” that is normally distributed with mean 86.6 and standard deviation 2.3. a) What proportion of men named Jim have handsomeness value 82 or less? b) What proportion of men named Jim have handsomeness value 93.5 or higher? c) Your instructor has handsomeness value of 100. How many standard deviations above the norm is this? Question 98: The mean weight of floogles is not known, but it is known that their weights vary with standard deviation σ = 4 units. A biologist measured the weights of 60 floogles and found that her sample had mean m = 143.2 . Find the 95% confidence level for the mean weight of all floogles.

Question 99: Gibgobs have heat factors that vary about some unknown mean with standard deviation σ = 12 . A scientist measured the heat factor of four gibgobs. He obtained the values: 133

146 137 and 140

Find the 95% confidence level for the mean heat factor of all gibgobs. Question 100: A soccer ball company is meant to produce soccer balls of radius 12.4 cm with an error of at most ±0.4 cm. They set their machines so that the balls they produce have mean µ = 12.4 with standard deviation σ = 0.2 . They produce 500 balls per day. How many balls per day must be rejected?

PART THREE:100

Question 101: Suppose a population has normal distribution.

µ is unknown, but it is suspected to have value 4. σ = .03 A sample of size 50 was found to have average value m = 4.01 What do you think about the suspicion that µ =4? Question 102: Find the p-value for the sample mean of 4.01 in the previous example. Question 103: A company produces cables with breaking strengths of mean 1800 lbs and standard deviation 100 lb. A new manufacturing technique, however, claims to increase the average breaking strength. To test this claim, a sample of 50 cables is examined and is found to have mean breaking strength 1845 lb. a) If the new technique had no effect of breaking strength, how likely is it that a batch of 50 cables would have an average breaking strength of 1845 lbs? b) Do you think the new technique had an effect?

Question 104: Two drugs, A and B, are being tested for possible cure to a disease. The following data has been collected thus far. Does there seem to be any correlation worth pursuing?

PART THREE:101

Question 105: Does there seem to be a correlation between favourite colour and shoe size?

Question 106: The following table shows test scores of students in a physics course and the same students in a math course. Does there seem to be a correlation between math and physics proficiency?

Question 107: According to this data do you think there is some connection between marital status and performance in a Prob. and Statistics course?

PART THREE:102

Question 108: A ball bearing company produces balls with a diameter, hopefully, of mean value µ = 11.5 mm with a tolerance given by standard deviation σ = 0.2 . Each day the company selects 20 balls at random and computes the mean of that sample. Over the course of three weeks they collected the following values: 11.602 11.547 11.312 11.449 11.401 11.608 11.471 11.453 11.446 11.522 11.664 11.823 11.629 11.602 11.756 11.707 11.612 11.628 11.602 11.816 11.812 a) Is there a trend moving away from the mean 11.5? b) What is the first data value that deviates from the 99.7% (three standard deviation) range from the mean? Question 109: Twenty-five people were asked to try a new type of gum. They were asked whether or not the liked it: yes or no. The results are as follows: YYNNNNYYYNYNN YNNNNNYYYYNN How many runs are in this sequence? Does this sequence appear random? Question 110: a) In how many different ways can one arrange 3 As and 3Bs? b) List all the ways and count the number of runs that appear in each. c) What is the mean number of runs and what is the standard deviation for the number of runs? Question 111: Here are the first 20 digits of 2 . Do they seem random? 14142135623730950488 Question 112: Use the above/below median test to determine whether or not the following list of data values appear random: 8 15 9 12 10 7 11 8 13 9 11

PART THREE:103

Question 113: a) Write a list of Hs and Ts that seem random to you. Do a list that is 20 long. (The number of Hs and Ts do not have to be the same.) Test your sequence for randomness. b) Flip a coin 20 times and record the list of Hs and Ts that result. Test the sequence for randomness.

SOME MTEL-TYPE QUESTIONS Question 114: According to a statistical survey: “Australian men have blood pressure 118 ± 13 ms (with 95% confidence)” What does this mean? (A)

95% of all Australian men have blood pressure 118.

(B)

95% of all Australian men have blood pressure under 131.

(C)

95% of all Australian men have blood pressure between 105 and 131.

(D)

There is a 95% chance that the average blood pressure of all Australian men lies somewhere between 105 and 131.

Question 115: A survey displays milk preferences amongst men and women: Men Whole Milk 10 2% Milk 18 Non-Fat 7 No Preference 6

Women 3 16 15 12

Does there seem to be a correlation between gender and milk preference? (Actually, MTEL will not ask you to do a chi-squared analysis, but do this one in any case!)

PART THREE:104

Question 116: Administrators at a local grocery store conduct a survey on the average number of gallons of milk a customer buys per day. They interview N customers per day, and thus work with samples of size N. Administrators later decide to work interview 3N customers per day, thereby tripling the sample sizes. What effect will this have on the sampling error? (A) (B) (C) (D)

No effect Decrease sampling error Increase sampling error Not enough information to say.

COMMENT: The term “sampling error” is vague here. The question is really …

Which would yield least spread of data values (that is, least standard deviation): Calculating the mean of samples of size N or calculating the mean of samples of size 3N?

PROBABILITY AND STATISTICS

Informal Course Notes

PART IV of IV

BRIEF INTRODUCTION TO MORE ADVANCED THINKING (and filling in some gaps!) James Tanton © 2008 James Tanton

CONTENTS: Mean and Variance Revisited: The Human Perspective ……………… 2 One Data Set ……………………………………………………… 2 Two Data Sets ……………………………………………………… 4 Playing with Formulas ……………………………………………………… 7 Vectors ……………………………………………………… 9 Mean and Variance: The Perspective of the Gods ………………… 10 Random Variables ……………………………………………………… 15 Sum, differences, multiples ……………………………… 17 Connection to Central Limit Theorem ……………………………… 22 Cereal Box Problem ……………………………………………………… 22 Geometric Distribution ……………………………………………………… 25 Binomial Distribution ……………………………………………………… 27 Proportions ……………………………………………………… 32 Student’s t-distribution ……………………………………………………… 33 Chi Squared distribution ……………………………………………………… 36 Chebyshev’s Inequality ……………………………………………………… 36 Law of Large Numbers ……………………………………………………… 37

PART FOUR:

2

MEAN AND VARIANCE REVISITED: THE HUMAN PERSPECTIVE ONE SET OF DATA: A geometric perspective Suppose we run an experiment and gain from it n data values:

x1 , x2 ,… , xn and, as mere mortals, we know nothing more about the situation than these n values. (That is, we have no understanding about what to expect from the experiment such as the mean value, the variation from the mean, the underlying frequency of data values behind the scenes, etc.) But if the experiment were “ideal,” meaning that outcomes were absolutely and utterly repeatable, then we would expect no variation in data values at all. This means that all measurements would adopt exactly the same value q, say. How close is our data ( x1 , x2 ,… , xn ) from an ideal ( q, q,… , q ) ? To answer this question we seek a value q so that the point M = ( q, q,… , q ) is as close as possible to our point P = ( x1 , x2 ,… , xn ) . We want to choose a value q that minimizes the distance:

PM =

( x1 − q ) + ( x2 − q ) 2

2

+ ⋯ ( xn − q )

2

It is easier to just to minimize the quantity under the square root sign. Now

( x1 − q ) + ( x2 − q ) 2

2

+ ⋯ ( xn − q ) = nq 2 − 2 ( x1 + ⋯ + xn ) q + ( x12 + ⋯ + xn 2 ) 2

is a quadratic in q and has minimum value for:

q= the data’s mean.

2 ( x1 + ⋯ + xn ) 2n

=

x1 + ⋯ + xn =x n

PART FOUR:

3

The minimum value under the square root sign thus occurs when q = x and the minimum value is:

( x − x) + ( x 2

1

2

)

(

2

− x + ⋯ + xn − x

)

2

This is a sum of individual deviations, squared, and in and of itself is a measure of the total spread of values. Dividing by n gives an average spread. The resulting quantity is the VARIANCE of the data:

( x − x) Var ( x ,… , x ) = 1

1

2

(

+ ⋯ + xn − x

n

)

2

n

COMMENT: If the data is a measurement of length, say, then each xi has a unit of meters perhaps and so Var ( x1 ,… , xn ) has units of meters squared. It is handy to have a measure of spread in the same units as the data. For this reason, folk take the square root of variance and call the result STANDARD DEVIATION:

σ ( x1 ,… , xn ) = Var ( x1 ,… , xn ) =

( x − x)

2

1

(

+ ⋯ + xn − x

)

2

n

COMMENT: As we have seen, many texts alter these definitions slightly. Mathematicians note the following:

(

) (

)

(

)

FACT: x1 − x + x2 − x + ⋯ + xn − x equals zero. (EXERCISE: Show this!)

(

)(

)

(

)

Thus if one knows the value of n -1 of the terms, x1 − x , x2 − x ,… , xn − x , then one can deduce the value of nth one from the fact that their sum should be zero.

(

)(

)

(

)

So among the values x1 − x , x2 − x ,… , xn − x , there are only n -1 real pieces of information. To reflect this, many choose to divide by n-1 rather than n and set

( x − x) Var ( x ,… , x ) = 1

1

n

2

(

+ ⋯ + xn − x n −1

)

2

and σ ( x1 ,… , xn ) =

( x − x) 1

IN THIS CHAPTER OF THE NOTES WE SHALL DIVIDE BY n.

2

(

+ ⋯ + xn − x n −1

)

2

PART FOUR:

4

TWO SETS OF DATA: A summary and a geometric interpretation Suppose we run an experiment and record two sets of data values from it. (For example, our experiment could be to ask passersby for their heights and their shoe sizes.) We have data values:

x1 , x2 ,… , xn y1 , y2 ,… , yn We can plot the points ( xi , yi ) on a diagram to create a SCATTER PLOT.

The plot might reveal a relationship (CORRELATION) between the data values. If there seems to be a linear correlation, then one might be interested in finding a straight line that approximates the data points well.

LINE OF BEST FIT: It seems reasonable to believe that the “best” line for the data should go through the

( )

most average point for the data: x , y . So the line of best fit should have an equation of the form:

(

y− y = m x−x

)

for some best slope m yet to be determined. The line of best fit should minimize the total sum of deviations from that line. So for the

(

)

data point xi the line of best fit predicts the value m xi − x + y compared to the actual data value yi . We need a value m that minimizes:

PART FOUR:

)) + ⋯ + ( y − y − m ( x − x )) ( = m ( ( x − x ) + ⋯ + ( x − x ) ) − 2m ( ( x − x )( y − y ) + ⋯ + ( x (

2

D = y1 − y − m x1 − x

2

n

2

2

2

1

+

n

1

n

1

n

−x

)( y

n

−y

))

(( y − y ) + ⋯ + ( y − y )) 2

1

n

This is a quadratic in m and has minimum value for:

m=

( x − x )( y − y ) + ⋯ + ( x − x )( y ( x − x) +⋯ + ( x − x) 1

1

n

2

n

( x − x) =

(

2

+ ⋯ + xn − x

1

S xx

S xy = S yy

)

2

1

Folk define:

−y

n

)

2

= Var ( x1 ,… , xn )

n

( x − x )( y − y ) + ⋯ + ( x 1

1

n

−x

)( y

n

−y

)

n

( y − y) =

2

1

(

+ ⋯ + yn − y n

)

2

= Var ( y1 ,… , yn )

And the line of best fit (LEAST SQUARES METHOD) is:

y− y =

S xy S xx

(x − x )

CORRELATION COEFFICIENT: We created a line that minimizes the total amount of scattering D of y-values about that line. Here:

(

(

D = y1 − y − m x1 − x

(

The quantity T = y1 − y

)

2

(

))

2

(

(

+ ⋯ + yn − y − m xn − x

+ ⋯ + yn − y

)

2

))

2

is a measure of the amount of scattering of y-

values in general. We can also view this as the amount of scattering about the horizontal

5

PART FOUR:

6

line y = y , which is not the line of best fit. Since D is the minimal value for all lines, we have D ≤ T . As we have seen, the he proportion

T −D , with value between 0 and 1, is a measure of T

“desired scattering.” To make sense of this note that:

T −D = 0 means T = D , which says that the amount of scattering about a T supposed line of best fit is no different from the amount of scattering in general. THERE IS NO CORRELATION between the data values at all.

T −D = 1 means D = 0 , which says that there is absolutely no scatter about the T line of best fit, that is, the data fits this line exactly. We have PERFECT LINEAR CORRELATION between the data values. An exercise in algebra gives:

T − D ( S xy ) = T S xx S yy

2

and people usually denote this quantity R 2 , calling it THE CORRELATION COEFFICIENT.

COMMENT: Actually people usually set R = ±

(S )

2

xy

S xx S yy

using the + sign if the slope m is

positive and the – sign if m is negative.

(

IF x1 − x

)( y − y ) + ⋯ + ( x 1

n

−x

)( y

n

)

− y = 0 THEN THERE IS ABSOLUTLEY NO

CORRELATION BETWEEN DATA VALUES.

PART FOUR:

7

PLAYING WITH FORMULAS FOR THE FUN OF IT: Given two sets of data values from an experiment (which we label set X and set Y):

X : x1 , x2 ,… , xn Y : y1 , y2 ,… , yn we can create new data sets by adding or multiplying all values (which we label X + Y and XY ):

X + Y : x1 + y1 , x1 + y2 ,… , xn + yn XY : x1 y1 , x1 y2 ,… , xn yn

NOTE: If there are originally n data values for X and for Y, there are n 2 values for X + Y and for XY .

The mean of X + Y : The average value of the X + Y data set is:

( x1 + y1 ) + ( x1 + y2 ) + ⋯ + ( xn + yn ) = nx1 + ( y1 + ⋯ + yn ) + nx2 + ( y1 + ⋯ + yn ) + ⋯ + nxn + ( y1 + ⋯ + yn ) n2

n2

n2 x + n2 y n2 = x+ y =

The mean of XY : The average value of the XY data set is:

( x1 y1 ) + ( x1 y2 ) + ⋯ + ( xn yn ) = x1 ( y1 + ⋯ + yn ) + x2 ( y1 + ⋯ + yn ) + ⋯ + xn ( y1 + ⋯ + yn ) n2

n2 nx y + nx2 y + ⋯ + nxn y = 1 n2 n2 x y = 2 = xy n

PART FOUR: The variance of X + Y : The variance of the X + Y data set, about its data mean of x + y is a little long and scary looking, but actually not too tricky to work out. Here goes:

( x + y − x − y) + ( x + y 2

1

1

1

(

)

(

2

−x− y

)

2

(

+ ⋯ + xn + yn − x − y

)

)( y − y ) + ( y − y ) + ( x − x ) − 2 ( x − x )( y + ⋯ + ( x − x ) − 2 ( x − x )( y − y ) + ( y − y )

2

2

= x1 − x − 2 x1 − x

1

2

1

1

1

2

)

n

(

)

n

2

= n x1 − x + ⋯ + n xn − x + n y1 − y 1

1

2

) (

− y + y2 − y

n

( ) +⋯ + n ( y − 2 ( x − x ) (( y − y ) + ( y − y ) + ⋯ + ( y − y )) − 2 ( x − x ) (( y − y ) + ( y − y ) + ⋯ + ( y − y )) 2

2

2

n

(

2

2

2

1

n

−y

)

2

n

n

2

⋮

(

− 2 xn − x

) (( y − y ) + ( y 1

2

)

(

− y + ⋯ + yn − y

))

(

)

2

(

)

2

(

)

2

+ ⋯ + n yn − y

(

)

2

(

)

2

(

)

2

+ ⋯ + n yn − y

(

) (

= n x1 − x + ⋯ + n xn − x + n y1 − y = n x1 − x + ⋯ + n xn − x + n y1 − y

)

(

)

2

(

)

2

(

)

(

)

− 0 − 0 −⋯ − 0

[We used the fact that y1 − y + y2 − y + ⋯ + yn − y = 0 .] Divide by n 2 to get:

Var ( X + Y ) =

(

)

(

2

n x1 − x + ⋯ + n xn − x n2

( x − x) = 1

2

(

+ ⋯ + xn − x

n = Var ( X ) + Var ( Y )

)

2

+

n y1 − y

2

(

+ ⋯ + n yn − y

)

2

n2

) + ( y − y) 2

1

2

(

+ ⋯ + yn − y

)

2

n

We have: FOR ANY TWO DATA SETS X AND Y:

Var ( X + Y ) = Var ( X ) + Var (Y ) (for variance computed with respect to the data mean). COMMENT: There is no easy formula for Var(XY). (Try it!)

)

2

8

PART FOUR: ASIDE ON VECTORS: Given a set of data values x1 , x2 ,… , xn form the vector

vx =< x1 − x, x2 − x,… , xn − x > This vector has the property that its entries sum to zero. Our formulas can be rewritten in terms of vector notation. For example,

|| vx ||2 Var ( X ) = n

σ (X ) =

S xy =

R2 =

|| vx || n

vx ⋅ v y n S xy 2 S xx S yy

=

(v

⋅ vy )

2

 v vy  2 = x i  = cos θ  || vx || || v y ||  || vx || || v y ||  x

where θ is the angle between the vectors vectors vx and v y .

9

PART FOUR: 10

MEAN AND VARIANCE: THE PERSPECTIVE OF THE GODS Let’s now assume that we are omniscient and are fully aware of all information about all experiments ever run. For any experiment we now assume we know all possible values that can occur and the likelihood of each and every particular value actually occurring. That is, we know the PROBABILITY DISTRIBUTION of any given experiment. Definition: A (discrete) RANDOM VARIABLE X is a set of values x1 , x2 , x3 ,… along with a set of probabilities p1 , p2 , p3 ,… with pi representing the chances of the value xi actually appearing. (We have that the sum of probabilities p1 , + p2 + p3 +⋯ is 1.) COMMENT: We are usually not God-like and do not know the true nature of a random variable. For example, who really knows the probability distribution of the number of humming birds that will visit someone’s feeder during a given hour of the day while the homeowner happens to be watching the TV tuned to a prime-numbered station? On occasion us mere mortals do have glimpses into the world of the Gods. For example, let X be the random variable: All values of a fair die. We know the values of this random variable are 1, 2, 3, 4, 5, 6 with probability distribution

1 1 1 1 1 1 , , , , , . 6 6 6 6 6 6

Definition: The EXPECTED VALUE, denoted E(X) or µ , of a random variable X is the quantity:

µ = E ( X ) = p1 x1 + p2 x2 + p3 x3 +… This is the God’s version of the “average value.” To see why, imagine we ran the experiment n times. Then pi , in an ideal setting, is the proportion of times we can expect to see the value xi . So for n runs of the experiment we should expect to see xi a total of npi times. The average result we expect to see, in the ideal case, is thus:

p1nx1 + p2 nx2 + p3 nx3 + … = p1 x1 + p2 x2 + p3 x3 + … = E ( X ) n The expected value of rolling a die is µ =

1 1 1 1 1 1 ⋅1 + ⋅ 2 + ⋅ 3 + ⋅ 4 + ⋅ 5 + ⋅ 6 = 3.5 6 6 6 6 6 6

PART FOUR: 11 COMMENT: We need to be careful to distinguish between the ideal God-like situation and the non-ideal mortal reality. When running an experiment, such as rolling a die, it is unlikely our data values will ever actually have mean value E(X)! (Try it. Roll a die six times and compute the average result. It probably isn’t 3.5 - though it can happen!)

Suppose we do run an experiment n times and obtain n data values: x1 , x2 ,… , xn . God knows the value of the “true mean” µ = E ( X ) We don’t know the true mean. We can only compute the data mean:

x=

x1 + x2 + … + xn n

We can hope that the data mean x is a close approximation to the true mean µ .

COMMENT: We saw for the data mean that the values x1 − x, x2 − x,… , xn − x sum to zero and so represent only n -1 truly independent values. In the God world, the values x1 − µ , x2 − µ ,… , xn − µ are not guaranteed to sum to zero and so actually do represent n independent values. So … From the perspective of the God’s, dividing by values n , rather than by n − 1 , is always appropriate.

Moving on … As mere mortals defined the variance as the average of the sum of deviations from the data-mean squared. The God’s analogy to variance is thus:

var( X ) = ( x1 − µ ) p1 + ( x2 − µ ) p2 +… 2

2

And the standard deviation of a random variable is:

σ ( X ) = Var ( X ) =

( x1 − µ )

2

p1 + ( x2 − µ ) p2 +… 2

PART FOUR: 12 Playing with formulas: If X is a random variable and k is a constant, then kX is the random variable with all values multiplied by k with the same underlying probability distribution; and X + k is the random variable with all values increased by k. (See page 17 for more.) We shall prove in the next section: THEOREM:

E (kX ) = kE ( X ) Var (kX ) = k 2Var ( X ) E( X + k ) = E( X ) + k Var ( X + k ) = Var ( X ) If X and Y are two random variables then X+Y is the random variable with values xi + y j where xi is a value adopted by X, y j is a value adopted by Y, and the probability associated with xi + y j is P ( X = xi and Y = y j ) . How one computes this probability depends on the nature of X and Y. The nicest situation of all would be if, as in naïve probability theory, the word “and” continues to translate into an action of multiplication.

Definition: Two random variables are INDEPENDENT if P ( X = xi and Y = y j ) equals the product P ( X = xi ) ⋅ P (Y = y j ) .

For example, if X is the roll of a die and Y is the spin of a spinner with numbers 1 through 10 (each equally likely), then there are 60 equally like outcomes for the pair (X,Y) and

P ( X = 3 and Y = 8) , say, equals

1 1 1 . This equals ⋅ which is P ( X = 3) ⋅ P (Y = 8) . Here X 60 6 10

and Y are independent.

THEOREM: If X and Y are independent, then

E ( X + Y ) = E ( X ) + E (Y ) E ( XY ) = E ( X ) ⋅ E (Y )

Var ( X + Y ) = Var ( X ) + Var (Y ) Var ( X − Y ) = Var ( X ) + Var (Y ) Proof: Next section.

□

PART FOUR: 13 COMMENT: It is curious that Var ( X + Y ) = Var ( X ) + Var (Y ) is always true in our mortal world (calculated with data means) and not always true for the God-world (calculated with actual means). We require the condition that X and Y should be independent for this result to hold in the God world. What is the difference? In our mortal world we examined two data sets:

x1 , x2 ,… , xn y1 , y2 ,… , yn We did not know the true means of the underlying random variables X and Y, but calculated instead just the data means x and y :

x1 + x2 + … + xn 1 1 1 = x1 + x2 + ⋯ + xn n n n n y + y2 + … + yn 1 1 1 = y1 + y2 + ⋯ + yn y= 1 n n n n

x=

But these expressions each look like the expected value of a random variable. Let’s create our own, human, random variables: X’ and Y’:

1 1 1 , ,⋯ , n n n 1 1 1 Y’ has values y1 , y2 ,… , yn with probabilities , ,⋯ , n n n

X’ has values x1 , x2 ,… , xn with probabilities

Then E(X’) is the data mean x and E(Y’) is the data mean y . X’ and Y’ aren’t the real random variables lurking behind the data sets, but they are the ones we “see” by only looking at the data. We next calculated the sum of data values:

x1 + y1 , x1 + y2 ,… , xn + yn This gives n 2 values and we computed their mean as

( x1 + y1 ) + ( x1 + y2 ) + … + ( xn + yn ) . n2

PART FOUR: 14 But in doing this we tacitly assumed that each of the pairs in this sum, xi + y j , has the same frequency as any other pair. That is, we assumed each pair if equally likely, and so, since there are n 2 pairs in all, each pair xi + y j comes with probability

1 . n2

But this makes X’ and Y’ independent random variables:

1 n2 1 1 1 P ( X ' = xi ) ⋅ P (Y ' = y j ) = ⋅ = 2 n n n P ( X ' = xi and Y ' = y j ) =

So

P ( X ' = xi and Y ' = y j ) = P ( X ' = xi ) ⋅ P(Y ' = y j ) .

Our human construct X’ and Y’ obey all the conditions of the Gods and so obey:

Var ( X '+ Y ') = Var ( X ') + Var (Y ')

We did not realize at the time, but in Part I of these notes we were implicitly drawn to mimicking the ways of the Gods! (Such is always the wont of mankind?)

PART FOUR: 15

RANDOM VARIABLES: THE THEORY AND PROOFS Loosely speaking … A random variable X is a quantity whose value is not known but whose probability of taking a particular value or a range of values is known. Thus random variables come a priori with a probability distribution function in mind (at least in principle). A random variable is said to be discrete if it adopts only finitely many possible values (each having a known probability of occurring) or a list of possible values. It is continuous if it can adopt a continuous range of values with probability values P ( a ≤ X ≤ b) known and given by a probability distribution curve. Example: A roll of a die is a random variable X. It can have values 1, 2, 3, 4, 5, or 6 each with probability defined to be

1 . 6

Example: A roll of a biased die could be a random variable Y with probability distribution:

Example: The height of a person chosen at random is a (continuous) random variable H with probability distribution assumed to be a normal curve.

PART FOUR: 16 Definition: Suppose a discrete random variable X has value x with probability P ( x ) , then, the expected value (or mean) of the random variable is:

µ = E ( X ) = ∑ x ⋅ P( x) (Here

∑

denotes summation. So if X takes on values x1 , x2 ,… , xn with probabilities

p1 , p2 ,… , pn , then this expression is stating: µ = E ( X ) = x1 p1 + x2 p2 + ⋯ + xn pn .) Example: The expect value of rolling an ordinary die is, as before,

E( X ) =

1 1 1 1 1 1 ⋅ 1 + ⋅ 2 + ⋅ 3 + ⋅ 4 + ⋅ 5 + ⋅ 6 = 3.5 6 6 6 6 6 6

COMMENT: If the random variable is continuous, then the summation is replaced by an integral (“continuous summation”). Generalising the definition of variance and standard deviation … Definition: The variance of a discrete random variable X is:

σ 2 = Var ( X ) = ∑ ( x − µ ) P( x) 2

Its standard deviation is σ = SD ( X ) = Var ( X ) . (Again, there is an integral analogue for continuous random variables.) Example: Standard deviation of rolling an ordinary die is

σ 2 = (1 − 3.5 )

2

1 2 1 2 1 2 1 2 1 2 1 + ( 2 − 3.5 ) + ( 3 − 3.5 ) + ( 4 − 3.5 ) + ( 5 − 3.5 ) + ( 6 − 3.5 ) = 0.73 6 6 6 6 6 6

σ = 0.85 NOTE: The “ n − 1 ” versus “ n ” issue comes into play here! This definition assumes the “divide by n ” convention. NOTE: Var ( X ) = E

(( X − µ ) ) 2

PART FOUR: 17 SUMS, DIFFERENCES and MULTIPLES OF RANDOM VARIABLES If all the numbers on a dice a doubled then the we’d expect the mean value of a roll to double (from 3.5. to 7) and standard deviation (spread) of values to double as well (from 0.85 to 1.70).

E (2 X ) = 2 E ( X ) SD (2 X ) = 2 SD ( X ) (By “2X” we mean a new random variable with values double those of X, but with the same underlying probability distribution.) In general:

E (aX ) = aE ( X ) Var (aX ) = a 2Var ( X ) so that

SD (aX ) = | a | SD ( X )

Proof: If X has values x1 , x2 ,… , xn with probabilities p1 , p2 ,… , pn , then aX has values

ax1 , ax2 ,… , axn with probabilities p1 , p2 ,… , pn . Thus:

E (aX ) = ax1 p1 + ax2 p2 + ⋯ + axn pn = a ( x1 p1 + ⋯ + xn pn ) = aE ( X ) If µ = E ( X ) then

Var (aX ) = ( ax1 − a µ ) p1 + ⋯ + ( axn − a µ ) pn = a 2 2

2

(( x − µ ) 1

2

)

p1 + ⋯ + ( xn − µ ) pn = a 2Var ( X ) 2

□

NOTE: This proofs are more compactly written in

∑

notation:

E (ax) = ∑ ax ⋅ P ( x) = a ∑ x ⋅ P ( x) = aE ( X ) Var (ax) = ∑ ( ax − a µ ) P ( x) = a 2 ∑ ( x − µ ) P ( x) = a 2Var ( X ) 2

2

We shall follow this style of presentation in the proofs that follow. (But of course it is always possible – and usually helpful – to translate the lines in these proofs back to the form without notation.)

∑

PART FOUR: 18 Suppose we add “3” to all the values of the die. Then we’d expect the mean roll to increase by three (from 3.5 to 6.5) but the spread of values, the standard deviation, not to change:

E ( X + 3) = E ( X ) + 3 Vax( X + 3) = Var ( X ) In general:

E ( X + c) = E ( X ) + c Var ( X + c) = Var ( X ) Proof: E ( X + c) =

∑ ( x + c ) P( x) = ∑ xP( x) + c∑ P( x) = E ( X ) + c ⋅ 1 = E ( X ) + c

(Note that all probabilities sum to 1. Thus

∑ P( x) = 1 .)

If we set E ( X ) = µ we have just

established that E ( X + c) = µ + c .We thus have:

Var ( X + c) = ∑ ( ( x + c ) − ( µ + c ) ) P( x) = ∑ ( x − µ ) P( x) = Var ( X ) 2

2

□

Let X be the random variable associated to rolling a die, and Y the random variable of rolling the die a second time. (OR X and Y can be the random variables associated to rolling two different dice simultaneously.) Then X+Y is the random variable that corresponds to the sum of two die and XY the random variable that corresponds to the product of the two rolls.

Definition: Two (discrete) random variables X and Y with probability distributions P ( X = x) and P (Y = y ) , respectively, are said to be independent if the probability distribution P(X = x and Y = y) is given by the product P ( X = x) ⋅ P (Y = y ) .

Example: If X and Y are the rolls of two separate dice, then

P ( X = 3 and Y = 2) =

1 1 1 = ⋅ = P ( X = 3) ⋅ P (Y = 2) . This is true for all values (not just 36 6 6

X= 3 and Y = 2) and so X and Y are independent.

PART FOUR: 19 The following formulas are true, but they take some work to establish: For independent random variables X and Y:

E ( X + Y ) = E ( X ) + E (Y ) E ( XY ) = E ( X ) ⋅ E (Y ) Var ( X + Y ) = Var ( X ) + Var (Y ) Example: For X and Y the rolls of two die,

E ( X + Y ) = 2 P ( X = 1 and Y = 1) + 3P ( X = 1 and Y = 2) + 3P ( X = 2 and Y = 1) + = 2⋅

⋯

+

12 P ( X = 6 and Y = 6)

1 1 1 1 + 3 ⋅ + 3 ⋅ + ⋯ + 12 ⋅ 36 36 36 36

=7

And E ( X ) + E (Y ) = 3.5 + 3.5 = 7

Notice: From the third line we have that

Var ( X − Y ) = Var ( X + ( −Y )) = Var ( X ) + Var (−Y ) = Var ( X ) + (−1) 2 Var (Y )

= Var ( X ) + Var (Y )

That is,

Var ( X − Y ) = Var ( X ) + Var (Y )

PART FOUR: 20 Proof (OPTIONAL READING): Suppose s is a given value and we wish to compute the probability that the sum X + Y equals s. We can do this by finding listing all the x-values and all the y-values that sum to s. Suppose this list appears:

x1 + y1 = s x2 + y2 = s

⋮ xk + yk = s Then

P ( X + Y = s ) = P ( X = x1 , Y = y1 OR X = x2 , Y = y2 OR … OR X = xk , Y = yk ) =

∑

P ( X = x, Y = y )

∑

P ( X = x) P (Y = y )

x+ y=s

=

x+ y=s

where

∑

denotes summation over all the pairs of x- and y-values that sum to s.

x+ y=s

Note that the double sum

∑∑ s

is the same as summing over all x- and y-values:

x+ y=s

x

Using these facts we can now see:

E ( X + Y ) = ∑ sP ( X + Y = s ) s

=∑

∑ sP( X = x) P(Y = y )

x+ y =s

s

= ∑∑ ( x + y ) P ( X = x) P (Y = y ) x

y

= ∑∑ xP ( X = x) P (Y = y ) + yP ( X = x) P(Y = y ) x

y

= ∑ xP ( X = x)∑ P (Y = y ) + ∑ yP (Y = y )∑ P ( X = x) x

y

y

= ∑ xP ( X = x) ⋅ 1 + ∑ yP (Y = y ) ⋅ 1 x

= E ( X ) + E (Y )

∑∑

y

x

y

PART FOUR: 21 Also:

E ( XY ) = ∑ sP ( XY = s ) s

= ∑ ∑ xyP ( XY = s ) s

xy = s

= ∑∑ xyP ( X = x and Y = y ) x

y

  = ∑  xP ( X = x)∑ yP (Y = y )  x  y  = ∑ xP ( X = x) E (Y ) x

= E (Y )∑ xP ( X = x)

= E ( X ) E (Y )

x

Finally, if X and Y are independent with E ( X ) = µ and E (Y ) = ν , then:

(( X + Y − µ −ν ) ) = E (( X − µ + Y −ν ) ) = E ( ( X − µ ) + 2 ( X − µ )(Y − ν ) + ( Y − ν ) ) = E ( ( X − µ ) ) + 2 E ( X − µ ) E (Y − ν ) + E ( (Y − ν ) ) 2

Var ( X + Y ) = E

2

2

2

2

2

= Var ( X ) + 2 ( µ − µ )(ν − ν ) + Var (Y ) = Var ( X ) + Var (Y ) □ CHALLENGE EXERCISE: Show that Var ( X ) = E

(( X − µ ) ) = E ( X ) − ( E ( X )) 2

2

2

PART FOUR: 22 CONNECTION TO THE CENTRAL LIMIT THEOREM Suppose X 1 , X 2 ,… , X n is a collection of random variables that represent the results of running an experiment n times. Let

X =

X1 + X 2 + ⋯ + X n n

This is the average result. The central limit states that if each of the random variables X 1 , X 2 ,… , X n has mean µ and standard deviation σ , then:

Then the probability distribution of X is well approximated by the normal distribution with mean µ and standard deviation

σ

n

.

The hard part of this theorem is proving the approximation to the normal curve. Calculating the mean and standard deviation of X is now easy!

1 1  1 E( X ) = E  ( X1 + ⋯ + X n )  = ( E ( X1 ) + ⋯ + E( X n ) ) = ( µ + ⋯ + µ ) = µ n n  n 1 1 nσ 2 σ 2 Var ( X ) = 2 (Var ( X 1 ) + ⋯ + Var ( X n ) ) = 2 (σ 2 + ⋯ + σ 2 ) = 2 = n n n n

PART FOUR: 23

THE CEREAL BOX PROBLEM

Krunchy-Munch Cereal company has placed a prize in every third box. How many boxes of cereal can I expect to buy before seeing a prize? 1 The chances of finding a prize in a given box is p = and the chances of failing to see a 3 2 prize is q = . We have: 3 The probability of finding a prize in the first box you try is p . The probability of first finding the prize in the second box you try is: qp . The probability of first finding the prize in the third box you try is: q 2 p and so on. Let X be the random variable which is the box number you open when you first find the prize. Then:

E ( X ) = 1 ⋅ p + 2 ⋅ qp + 3 ⋅ q 2 p + 4 ⋅ q 3 p + ⋯ COMMENT: Here X can have one of an infinite set of discrete values. It is also called a discrete random variable. To evaluate this sum we need to make use of the famous geometric formula. ASIDE: Proving that 1 + x + x 2 + x3 + ⋯ =

1 1− x

Suppose we wish to find the value of this infinite sum. Let’s call its value S:

1 + x + x 2 + x3 + ⋯ = S Multiply through by x:

x + x 2 + x 3 + x 4 + ⋯ = xS That is:

S − 1 = xS Solving gives:

S=

1 1− x

COMMENT: We’ve actually only proven the statement: IF the sum 1 + x + x 2 + x3 + ⋯ has a

1 . In a calculus course, one proves that the sum 1− x does indeed converge to a finite answer for values −1 < x < 1 .

finite value, then that value must be

PART FOUR: 24 Returning to our cereal box problem we have:

E ( X ) = 1 ⋅ p + 2 ⋅ qp + 3 ⋅ q 2 p + 4 ⋅ q 3 p + ⋯ Notice that qE ( X ) = qp + 2q 2 p + 3q 3 p + 4q 4 p + ⋯ from which it follows that:

E ( X ) − qE ( X ) = p + qp + q 2 p + q 3 p + ⋯ That is:

(1 − q ) E ( X ) = p (1 + q + q 2 + q 3 + ⋯) = p ⋅

1 1− q

That is:

pE ( X ) = p ⋅

1 =1 p

and so:

E( X ) =

With p =

1 p

1 this has value 3. 3

We can expect to buy three boxes before seeing a prize.

FOR THE BOLD: Suppose Krunchy-Munch Cereal company actually has three different prizes and each box contains one of these prizes. (Assume there is a one-third chance of finding any particular prize in a given box.)

 

Show that one can expect to buy 3 1 +

1 1 +  = 5.5 before seeing all three prizes. 2 3

[In general: If there are n prizes to be had, show that one can expect to buy

1  1 1 n 1 + + + ⋯ +  boxes before seeing all n prizes.] n  2 3

PART FOUR: 25

THE GEOMETRIC DISTRIBUTION The cereal box problem is an example of a general situation. Suppose an experiment is run and the probability of observing a “success” is p and of observing a “failure” q = 1 − p . (For example, in tossing a coin with “heads” deemed a success, we have p = q = die with “six” deemed a success, we have p =

1 . In rolling a 2

1 5 and q = .) 6 6

Let X be the random variable: X = the number of runs of the experiment needed for seeing a first success. There is a p chance that one will see a success on the first run, so X has value 1 with probability p. The probability that X = 2 is qp . (Fail and then succeed.) The probability that X= 3 is q 2 p . (Two failures then a success.) And so on. The probability distribution associated to X appears:

This is called the geometric distribution. We have P ( X = n) = q n −1 p . We have seen on the previous page that: E ( X ) = that Var ( X ) =

1 With some work it is possible to show p

q 2 . (One uses the formula Var ( X ) = E ( X 2 ) − ( E ( X ) ) and shows that 2 p

E ( X 2 ) = 1 p + 4qp + 9q 2 p + 16q 3 p + ⋯ =

1+ q

(1 − q )

3

.)

PART FOUR: 26 In summary: For the geometric distribution:

P ( X = n) = q n −1 p E( X ) =

1 p

Var ( X ) =

q p2

EXAMPLE: You would like to know the meaning of life, but only 1 in 100 people on this planet know the answer. You decide to ask each person you meet until you find someone who can tell you the answer. a) How many people do you expect to meet before finding someone with the answer? b) What is the probability that you will find the person with the answer within the first three people you meet? Answer: This is a geometric probability situation with p = 0.01 and q = 0.99 .

1 = 100 . We can expect to meet 100 people before finding the answer. p b) P ( X ≤ 3) = P ( X = 1) + P ( X = 2) + P ( X = 3) = p + qp + q 2 p ≈ 2.97% . a) E ( X ) =

□

The geometric probability distribution has a nice feature:

P ( X > n) = 1 − P ( X ≤ n) = 1 − p − qp − ⋯ q n −1 p = 1 − p (1 + q + ⋯ + q n −1 )  1 − qn  = 1− p   1− q 

= 1 − (1 − q n ) = qn

The probability that it will take me more than 100 people to find the meaning of life is thus (0.99)100 ≈ 36.6% .

PART FOUR: 27

THE BINOMIAL DISTRIBUTION In tossing a coin ten times we have already worked out the distribution of probabilities of obtaining exactly 10, 9, 8, …, 2, 1 and no heads. (See part 1.) This is a specific example of a more general situation. Suppose an experiment has probability p of producing a “success” and probability q = 1 − p of producing a “failure.” Let Bn ( k ) be the probability of producing exactly k successes in a run of n experiments. 3

7

10!  1   1  [For example, in tossing a coin 10 times, B10 ( 3) =     ≈ 11.7% .] 3!7!  2   2  For each value n we have a series of probability values, one for each of k = 0 to k = n . Each of these distributions of probabilities is called a binomial distribution. In general, we have:

Bn ( k ) =

n! p k q n−k k !(n − k )!

In our studies of the binomial theorem in part II, we saw that:

( p + q)

n

= p n + np n −1 q + ⋯ +

n! p k q n−k + ⋯ + q n k !(n − k )!

Hence the name of the binomial distribution. We shall prove: Consider the binomial distribution of n trials. If X is the random variable that counts how many successes occur, then, as we have seen:

P( X = k ) = For this random variable we have:

n! p k q n−k k !(n − k )!

E ( X ) = np Var ( X ) = npq

PART FOUR: 28 Proof: First consider a single run of the experiment ( n = 1 ). There is either 1 success or zero successes:

In this simple case:

E ( X ) = 1. p + 0.q = p Using Var ( X ) =

∑(x − µ )

2

P ( x) with µ = p we have:

Var ( X ) = (1 − p ) p + ( 0 − p ) q 2

2

= q2 p + p2 q = pq ( p + q ) = pq Now consider the situation of n runs of the experiment. Let

X = the count of successes in all n experiments X 1 = the count of success in just the first experiment X 2 = the count of success in just the second experiment ⋮ X n = the count of success in just the last experiment Then X = X 1 + X 2 + ⋯ + X n . As X 1 , X 2 ,… , X n are independent we have:

E ( X ) = E ( X 1 ) + E ( X 2 ) + ⋯ + E ( X n ) = p + p + ⋯ + p = np Var ( X ) = Var ( X 1 ) + Var ( X 2 ) + ⋯ + Var ( X n ) = pq + pq + ⋯ + pq = npq This completes the proof.

□

PART FOUR: 29 EXAMPLE: Returning to my pursuit of the meaning of life … Suppose I invite 40 people at random over for a party. What is the mean and standard deviation of the number of people at my party I can expect to know the answer to my question? What is the probability that 2 or 3 people among this group have the answer? Answer: Here p = 0.01 and q = 0.99 . Among a sample of 40 people we can expect:

E ( X ) = 40 p = 0.4 SD( X ) = npq = 20 × 0.01 × 0.99 ≈ 0.44 Also

P ( X = 2 or 3) = P ( X = 2) + P ( X = 3)

40! 40! 2 38 3 37 ( 0.01) ( 0.99 ) + ( 0.01) ( 0.99 ) 2!38! 3!37! ≈ 6.0% =

□

CONNECTION TO THE NORMAL CURVE Imagine that as a hospital administrator I need 100 samples of O- blood. There will be 450 donors over the next week, but only 6% of people have this blood type. In order to find the probability of obtaining 100 donors we need to compute:

450! 150 300 ( 0.06 ) ( 0.94 ) 150!300! This is extraordinarily unwieldy and impossible to compute. Fortunately, for large samples (and we’ll explain what we mean by “large” in a moment) the binomial distribution can be well approximated by a normal curve with:

µ = np σ = npq We can then use the tables of the normal curve values in order to approximate the probabilities we need.

PART FOUR: 30 WHY SHOULD THE NORMAL CURVE COME INTO PLAY? There are two reasons. Firstly, the binomial distribution analyses results of running experiments multiple times (the probability of obtaining k = 0 , k = 1 , … k = n successes over n runs). If n is large, the central limit theorem states that the situation should begin to approximate a normal curve. The true reason is that there is a connection between factorials and the number e. Stirling proved that:

n ! ≈ 2π n ⋅ n n ⋅ e − n with the approximation only improving as n grows larger. Thus it seems plausible (and is in fact the case) that the formulas one obtains from the binomial distribution begin to look 2

like a formula for the curve of the form: y = e − x , a normal curve.) [Calculus gives a hint as to why the Stirling’s formula might be true. Notice that:

ln n ! = ln1 + ln 2 + ln 3 + ⋯ + ln n appears as the sum of areas of rectangles under the curve y = ln x . Thus: n

ln n ! ≈ ∫ ln x dx = [ x ln x − x ]1 = n ln n − n + 1 ≈ n ln n − n n

1

n −n

which gives n ! ≈ n e . (Stirling gave a more refined version of this argument.)]

HOW LARGE IS LARGE ENOUGH? The normal curve extends infinitely far both to the left and to the right, but 99.7% of the region under the curve lies within 3σ of the mean µ . The two tails constitute only 0.3% of the region under the curve. The binomial distribution, however, only have probability values for k = 0, k = 1, … , k = n , and does not extend infinitely far to the left and to the right. In fact, it does not have values for k less than zero and for k greater than n. we would like these “non-existent” regions to match the two tails of he normal curve so that the mismatch of “missing” region is negligible. For the binomial distribution µ = np and σ =

npq and we would like: µ + 3σ < n and 0 < µ − 3σ so that the two tails correspond to moving beyond n and below 0.

PART FOUR: 31 The statement 0 < µ − 3σ gives:

np > 3 npq np > 9q Since q ≤ 1 we obtain:

np > 9 The other relation gives:

nq > 9

For simplicity, folk usually work with the number 10 to say: If np ≥ 10 and nq ≥ 10 then the binomial distribution can be approximated as a normal distribution.

PART FOUR: 32

PROPORTIONS: POPULATION PROPORTIONS and SAMPLE PROPORTIONS Suppose we’re interested in finding the proportion of American’s who can whistle. Call this true, and unknown, proportion p . To estimate p we could collect of sample of 1000 people say and find the sample proportion p of those who can whistle. To understand how good an estimate p is for p, we can analyse the distribution of p values. This turns out to be a binomial distribution. To see why, note the following: The probability of selecting an American at random who can whistle is p. Since the population of Americans is so large, removing this first person from the pool really won’t alter the probability of choosing a second American who can whistle. The chances are still, for all meaningful purposes, p. Thus, selecting n = 1000 Americans is equivalent to running an experiment 1000 times with p chance of ”success” and q = 1 − p chance of failure. The parameter n p is counting the number of successes and so must have binomial distribution. It has mean np and standard deviation

npq . Diving by n gives the distribution of p . It

has:

µ=p σ=

pq n

Moreover, if np and nq are each ≥ 10 , this distribution is approximately normal.

This is the “central limit theorem: Version II” from part III of these notes.

PART FOUR: 33

STUDENT’S t-DISTRIBUTION Throughout section III of these notes, in calculating confidence intervals and the like, we assumed that the standard deviation σ of a distribution is assumed known. This is rarely the case. In many cases we approximate σ with the value given by the standard deviation of the

. sample at hand σ EXAMPLE: Of 1500 adult Americans that were polled, 3.2% said that they had a

thoroughly enjoyable experience studying math in high-school. Estimate the proportion of ALL adult Americans that will say the same. We answered this question in part III as follows: Answer: We have

p = 3.2

σ =

3.2 × 96.8 = 0.454 1500

as an approximation for σ , with 95% confidence we can say that the percentage Using σ of Americans who felt this way about math lies in the range [2.292, 4.108]. □

Gosset noticed, when performing statistical checks in his brewing company, that unacceptable errors occurred in his analyses by making use of these approximate values

values in their own right, in the for σ . He set to work to analyzing the distribution of σ same way we analyse the distribution of sample means x via the central limit theorem. To me more precise, Gosset realized that since we always need to convert entities to their z-score, it is best to analyse the distribution of the quantities:

t=

x−µ σ n

He completely determined the mathematics of these entities, showing that, for each n (the size of the sample) they follow their own distribution curves called Student’s tdistribution with n − 1 degrees of freedom. (Gosset published under the pseudonym Student.)

PART FOUR: 34 Thus, one can create more accurate confidence intervals for means from a sample by working with tables from Student’s t-distributions rather than work with the normal distribution and approximate values for the standard deviation.

EXAMPLE: The speed of 23 along a particular road with posted speed limit 40 mph was

= 4.23 mph. Is there reason to recorded. Their mean speed was x = 41.0 mph with σ believe that mean speed of all cars along this road is greater than 40 mph? Answer: Assuming that car speeds follow something akin to a normal distribution (we can plot the results and see if they appear “bell shaped”) we can follow Gosset’s model. We’ll make the assumption that µ = 40.0 and see what we conclude. Here:

t=

41.0 − 40.0 ≈ 1.13 4.23 23

There are n − 1 = 22 degrees of freedom. According to a table of t-distributions:

P (t22 > 1.13) = 0.136 = 13.6% This is not sufficiently rare to conclude that receiving a sample mean of 41.0 is unusual under the assumption that the true mean is 40.0. That is, we have no reason to reject the idea that the true mean is 40.0 mph. □

PART FOUR: 35 COMPARING TWO MEANS Two companies make light bulbs: company 1 and company 2. We’d like to know if there is any difference between the mean life-time of the bulbs they each produce. We take a sample of bulbs of size n1 and compute their sample mean x1 , and a sample of size n2 from company 2 and compute their sample mean x2 . What does the difference x1 − x2 of these two sample means tell us about the difference the two true means µ1 − µ2 ? Let X 1 be the random variable of possible x1 values, and define X 2 similarly. By the central limit theorem:

E ( X 1 ) = µ1 SD( X 1 ) =

E ( X 2 ) = µ2

σ1 n1

SD( X 2 ) =

σ2 n2

We also have:

E ( X 1 − X 2 ) = E ( X 1 ) − E ( X 2 ) = µ1 − µ2

SD ( X 1 − X 2 ) = Var ( X 1 − X 2 ) = Var ( X 1 ) + Var ( X 2 ) =

σ 12 n1

+

σ 22 n2

This is, of course, assuming that we know the true standard deviations. Without this knowledge, we can still test if the difference µ1 − µ2 = 0 by using student’s t-distribution by computing the value:

t=

(x − x ) − (µ 1

2

1

2

σ 1

n1

+

σ 2

− µ2 ) 2

n2

=

(x − x ) 1

σ 1

2

2

n1

+

σ 2

2

n2

and seeing if this value is an acceptable “distance” from µ1 − µ2 = 0 . Details are omitted here, but the gist is clear. COMMENT: One of the details omitted is determining the number of degrees of freedom that are appropriate for this problem. Because we have a mix of sample sizes, the count of degrees of freedom is complicated considerably.

PART FOUR: 36

A COMMENT ON CHI-SQUARED DISTRIBUTION If X 1 , X 2 ,… , Xν are random variables with normal probability distributions, then situations can arise in which one wishes to study the random variable (or close variations of it):

X 12 + X 2 2 + ⋯ + Xν 2 [This arises in our study of contingency tables.] The mathematics of this random variable is well understood and its probability distribution is known. It is called the chi squared distribution with ν degrees of freedom. This distribution has µ = ν and σ =

2ν .

CHEBYSHEV’S INEQUALITY Here is an astounding fact:

For any probability distribution, if the mean of the associated random variable is µ and its standard deviation σ , then the area under the curve greater then k 1 standard deviations from the mean is no more than 2 . k Its proof is swift! We have:

σ 2 = ∑ ( x − µ ) P( x) ≥ 2

all x

≥

(x − µ) ∑ µ σ

2

P( x)

|x− | ≥k

k σ ∑ µ σ 2

2

P( x)

|x− | ≥k

= k 2σ 2

∑

P( x)

| x − µ | ≥ kσ

= k 2σ 2 P ( | x − µ | ≥ k σ ) and so:

P ( | x − µ | ≥ kσ ) ≤

1 k2

CHALLENGE: Prove, more generally, P (| x − µ | ≥ ε ) ≤

σ2 . ε2

PART FOUR: 37

LAW OF LARGE NUMBERS One of the great “check points” of probability theory was the proof of the “intuitively obvious” result, the law of large numbers. It showed that all was on the right track. In formal language, here is the result: Suppose X 1 , X 2 ,… , X n are independent random variables each of mean µ and standard deviation σ . We usually think of these random variables as the result of running the same experiment n different times. Let S n = X 1 + X 2 + ⋯ + X n (so that

Sn is the average result). Then n

S  lim n →∞ P  n − µ > ε  = 0 .  n  That is, the probability that the average

Sn differs from the true mean µ by more than n

some error value ε goes to zero as n grows, no matter the degree of error ε you wish to tolerate. COMMENT: This is not actually saying that lim n →∞ that lim n →∞

Sn equals µ , only, in the some sense, n

Sn equals µ with probability 1. (Very strange! But curious issues do arise in n

the theory. For example, the chances of choosing a whole number by selecting a point at random along a number line is zero, even though integers are themselves valid points to be selected!)

S Proof: By the results of pages 17-19 we have: E  n  n S  σ2 . By Chebyshev’s inequality: P  n − µ ≥ ε  ≤ 2  n  nε

 =µ 

S Var  n  n

 Sn  σ2 − µ > ε  = 0 since clearly lim n →∞ 2 = 0 . nε  n 

And so lim n →∞ P 

 σ2 . =  n

□

PART FOUR: 38