Theory Meets Data. A Data Scientist s Handbook to Statistics DRAFT. Ani Adhikari. Editor: Dibya Jyoti Ghosh

Theory Meets Data A Data Scietist s Hadbook to Statistics Ai Adhikari Editor: Dibya Jyoti Ghosh DRAFT Cotributors: Shreya Agarwal, Thomas Athoy, Bryaie Bach, Adith Balamuruga, Betty Chag, Aditya Gadhi, Dibya Jyoti Ghosh, Edward Huag, Jiayi Huag, J. Westo Hughes, Arvid Iyegar, Adrew Lixie, Rahil Mathur, Nishaad Navkal, Kyle Nguye, Christopher Sauceda, Roha Sigh, Parth Sighal, Maxwell Weistei, Yu Xia, Athoy Xia, Lig Xie

Cotets 1 Averages 3 1.1 What is a average?..................................... 3 1.2 Perturbig the list...................................... 3 1.3 Bouds o the Average................................... 4 1.4 Averagig averages..................................... 4 1.5 Aother way to calculate the average........................... 5 1.6 Questios........................................... 6 2 Deviatios 8 2.1 What is Stadard Deviatio?................................ 8 2.2 Variace........................................... 10 2.3 Questios........................................... 11 3 Bouds 13 3.1 Markov s Iequality..................................... 13 3.2 Chebychev s Iequality................................... 17 3.3 Questios........................................... 19 4 Probability 21 4.1 Probability.......................................... 21 4.2 Examples: Samplig with Replacemet.......................... 24 4.3 The Gambler s Rule..................................... 26 4.4 The Birthday Problem.................................... 29 4.5 Questios........................................... 32 5 Samplig 34 5.1 Samplig With Replacemet................................ 34 5.2 Samplig Without Replacemet.............................. 36 5.3 Radom Permutatios................................... 39 5.4 Questios........................................... 41 6 Radom Variables 43 6.1 Radom Variables...................................... 43 6.2 Probability distributio................................... 44 6.3 Fuctios of Radom Variables.............................. 46 6.4 Expectatio.......................................... 47 6.5 Stadard Deviatio ad Bouds.............................. 48 6.6 Boudig Tail Probabilities................................. 49 1

2 CONTENTS 6.7 Questios........................................... 50 7 Sums of Radom Variables 52 7.1 Joit Distributios...................................... 52 7.2 The Expectatio of a Sum.................................. 53 7.3 The Variace of a Sum................................... 55 7.4 Questios........................................... 59 8 Correlatio 61 8.1 The Correlatio Coefficiet................................. 61 8.2 Liear Trasformatios................................... 62 8.3 Bouds o Correlatio................................... 63 9 The Regressio Lie 65 9.1 Mea Squared Error..................................... 65 9.2 Miimizig the Mea Squared Error........................... 65 9.3 The Best Itercept for a Fixed Slope............................ 66 9.4 The Best Slope........................................ 66 9.5 The Equatio of the Regressio Lie........................... 67 9.6 Notes............................................. 68 10 Appedix 69 10.1 Summatio Notatio.................................... 69

Chapter 1 Averages Whe aalyzig data, oe of the first thigs we d like to kow about is the ceter of the data. The average or the mea 1 of a list of umbers, is a measure that is used to represet a "cetral" value of the dataset. As we will see, there is more tha oe reasoable defiitio of "cetral". The average is oe of these. 1.1 What is a average? For a list of umbers x 1,x 2,...,x, we defie the average x (x with a bar above it, read as "x bar"). Defiitio 1 Average x = 1 x i (1.1) 1 I other words, we take uequal umbers, stitch them all together, ad the split them ito equal pieces, each whose size is the mea. I this represetatio, we ca cosider the mea to be the "equalizig value" of the data. Recall that costats ca be moved through the sum, ad so we ca rearrage our defiitio of the average to the followig: x = 1 x i (1.2) illustratig that the equalizatio process ca take place before we pool all the values together. 1.2 Perturbig the list Uderstadig the equalizig or "smoothig" property of averages allows us to quickly see how the average chages whe you chage etries i the list. Suppose a list of dollar amouts has 100 etries i it, ad oe of the etries goes up by $500. Whe you take the average of the ew list, 1 "Mea" is aother ame for "average", so other sources may use µ to represet certai types of averages. That s the Greek letter m, read as "mu". 3

4 CHAPTER 1. AVERAGES those additioal 500 dollars will get split evely 100 ways, ad so the average will go up by $5. No algebra eeded. All you eed is the amout of chage to the etry ad the umber of items i the list. If you wat to work through the algebra, of course you ca. Say that the first value i our list, x 1, becomes k. The chage c i our average (that is, the differece betwee the averages of our first dataset x: (x 1,x 2,...,x ) ad our secod dataset y: (k,x 2,...,x ) is give by c = x ȳ = x i y i 1 1 = x 1 + x i k x i 2 2 = x 1 k c = x 1 k (1.3) This calculatio cofirms that the oly thigs we eed to kow to determie c are the chage to the etry (x 1 k), ad the total umber of values beig averaged (). As a cosequece, we have the followig observatio: the more values we have i our list, the smaller the effect the chage to a sigle etry ca have. 1.3 Bouds o the Average How big or small ca the average be? A atural aswer is that the average will be somewhere i betwee the smallest ad largest value i the list. You ca formally establish these lower ad upper bouds o the average. Let m be the miimum value of a list x 1,x 2,...,x. The by defiitio for all x j s we have x j 1 x i 1 m x m x m A similar assertio ca be made for the maximum M of a list of umbers, but we leave that proof to the reader. Thus for ay list of umbers x 1,...,x with miimum m, maximum M, ad average x, 1.4 Averagig averages m x M Say you have two lists of umbers x 1,x 2,...,x ad y 1,y 2,...y m, ad say they have averages of x ad ȳ respectively. How would we go about fidig the average of a combied list of all + m etries together?

1.5. ANOTHER WAY TO CALCULATE THE AVERAGE 5 A commo miscoceptio is that we ca just average the two averages(sum ad divide by two), but a simple example shows this does t hold. The average maratho time at the Rio Olympics was 2 hours ad 15 miutes, ad the average maratho time for amateur ruers is 4 hours ad 19 miutes. If we create a group of all amateur ruers as well as the Rio Olympics marathoers, do you thik that the average time of that group would be 3 hours ad 17 miutes, halfway betwee the two times? We hope ot! Clearly, we eed to take ito accout the fact that there are may more amateur maratho ruers tha Rio Olympias. To see how to do this, let us retur to our two lists x 1,x 2,...,x ad y 1,y 2,...y m. You ca figure out the average of the overall list (ofte called the "pooled" list) by rememberig that the average is a equalizer. The cotributio of the x s to the total pot will be their sum, which is x. The cotributio of all the y s will be mȳ. So the average of the pooled list will be x + mȳ + m Not formal eough for you? Ok, the let s deote the average of the etire list x 1,x 2...x,y 1,y 2,...y m by A. The we see A = 1 + m ( x i + = 1 + m ( = x + mȳ + m = + m x + y i ) x i + m m y i ) m + mȳ (1.4) Rather tha just just averagig the two averages, we first have to "weight" them accordig to the correspodig umber of etries. 1.5 Aother way to calculate the average Now that we kow how to put two lists together ad fid the average of the combied list, we have aother way of fidig the average of ay list. Cosider the list 7, 7, 7, 8, 8. You ca thik of this as a pooled list, if you pool the list 7, 7, 7 ad the list 8, 8. The average of the first list is 7, ad the average of the secod list is 8. So the average of the pooled list 7, 7, 7, 8, 8 is 3 5 7 + 2 5 8 (1.5) What gets used i this calculatio? We eed the two distict values i the list, amely 7 ad 8. We also eed the proportio of each of those values i the list. The average of the list ca be thought of as the average of the distict values weighted by their proportios. You ca exted this formally to fid the average of a list x 1,x 2,...x with lots of repeatig values. Say there are k distict values v 1,v 2,...v k i our list, appearig respectively with frequecies 1, 2,... k. I the list i our umerical example above, = 5 ad there are two distict values, so k = 2. The two distict values are v 1 = 7 ad v 2 = 8.

6 CHAPTER 1. AVERAGES It should ow be apparet that the average of the list is x = k i v i (1.6) Try to do the math that proves this! Also ote that for each i, the proportio of times v i appears i the list is p i = i. So the average ca be expressed as x = k p i v i (1.7) As before, that s the average of the distict values i the list, weighted by their proportios. This way of expressig the average shows you that if 3/5 of a list cosists of the value 7, ad the other 2/5 cosists of the value 8, the the average will be 3 5 7 + 2 5 8 (1.8) regardless of whether the list has 5 etries or 500. I other words, the list cosistig of 300 7 s ad 200 8 s has the same average as the list 7, 7, 7, 8, 8. 1.6 Questios 1. Prove the followig two simple but very useful facts about averages. a) If all the etries i a fiite list of umbers are the same,the the average is equal to the commo value of the etries. b) If a fiite list of umbers cosists oly of 0 s ad 1 s, the the average of the list is the proportio of 1 s i the list. 2. Cosider the list {1,2,...,}, where is a positive iteger. a) Guess the average of the list ad give a ituitive explaatio for your guess. b) Prove that your guess i part a is correct. c) Let i be a elemet of the list; i other words, suppose i is a iteger such that 1 i. Suppose the elemet i gets replaced by 0. By how much does the average chage? If you followed what we did i class,you should be able to just write dow this aswer ad explai it without calculatio. d) Start with the origial list {1,2,...,} ad delete a elemet i. What is the average of the ew list? 3. Suppose you are i a class that has the followig gradig scheme: 70% of the grade comes evely from two exams: a midterm ad a fial 20% comes from homework 10% comes from quizzes

1.6. QUESTIONS 7 You have received a average score of 93% o your homework ad 75% o your quizzes. O the midterm, you scored 82%. Write dow a formula for the miimal percetage score you eed o the fial to achieve a overall score of 90% i this course. You do ot eed to evaluate this expressio. 4. A dataset cosists of the list {x 1,x 2,...,x } ad has average x. Someoe is goig to pick a elemet of the list ad I have to guess its value.i have decided that my guess will be a costat c, regardless of which elemet is picked. Therefore if the elemet picked is x i, the error that I make will be x i c. Defie the mea squared error of my guess to be mse c = 1 (x i c) 2 Show that the miimum value of mse c over all c is attaied whe c = x, i two differet ways: a) I the defiitio of mse c, replace x i c by (x i x) + ( x c) ad use algebra. b) Use the defiitio of mse c ad calculus. 5. Suppose all the etries i a fiite list of umbers are equal. Prove that the average of the list is equal to the commo value of the etries. 6. Let x 1,x 2,...,x be a list of umbers, ad let x be the average of the list.which of the followig statemets must be true? There might be more tha oe such statemet, or oe, or oe; a) At least half of the umbers o the list must be bigger tha x. b) Half of the umbers o the list must be bigger tha x. c) Some of the umbers o the list must be bigger tha x. d) Not all of the umbers o the list ca be bigger tha x. 7. Suppose the list of umbers {x 1,x 2,...,x } has average x ad the list {y 1,y 2,...,y m } has average ȳ. Cosider the combied list of + m etries {x 1,x 2,...,x,y 1,y 2,...,y m }. Write a formula for the average of this combied list, i terms of x, ȳ,, ad m. You do ot have to prove your aswer. 8. Let {x 1,x 2,...,x } be a list of umbers ad let x deote the average of the list. Let a ad b be two costats, ad for each i such that 1 i, let y i = ax i + b. Cosider the ew list {y 1,y 2,...,y }, ad let the average of this list be ȳ. Prove a formula for ȳ i terms of a, b, ad x. 9. Let be a positive iteger. Cosider the list of eve umbers {2,4,6,...,2}. What is the average of this list? Prove your aswer. 10. Let {x 1,x 2,...,x } be a list of umbers with average x, ad let c be a costat. Show that 1 (x i c) 2 = 1 (x i x) 2 + ( x c) 2

Chapter 2 Deviatios 2.1 What is Stadard Deviatio? Two lists of data with the same average ca look quite differet. For example, the lists 5, 5, 5, 5 ad 3, 3, 7, 7 both have 5 as their average. But while all of the etries are equal to 5, oe of the etries i the secod list is 5. The secod list is more "spread out" tha the first. The list 1, 1, 9, 9 also has 5 as its average, ad it is eve more "spread out" tha the 3, 3, 7, 7. To see how far the umbers o a list are from their average, it is atural to look at distaces. Suppose the list is x 1,x 2,...,x with average x. For each idex i i the rage 1 through defie the ith deviatio from the mea to be d i = x i x To see how big the deviatios are, it is atural to take the average of all these deviatios. Let s try it out. Average Deviatio = d = 1 d i = 1 (x i x) = 1 ( x i = 1 ( x x) = x x x) Oh o! Sice all positive "distaces" offset all egative oes whe added together, the average deviatio from mea for ay data sets is always equal to 0. While that s correct, it s ot helpful for our purpose, which is to fid roughly how far off the umbers ca be from the mea. We have to fid a way past this problem of cacellatio. We have to esure that all distaces are o-egative. There are two time-hoored ways of doig this. The first is to take the absolute value of each distace. But the absolute value fuctio has some mathematical properties that make it complicated to work with for example, it is ot differetiable at 0. The other way of gettig rid of mius sigs is to calculate squares. So let us fid the average of the squared deviatios from mea That is a o-egative umber, but ufortuately it has differet 8 = 0

2.1. WHAT IS STANDARD DEVIATION? 9 uits from the origial list. For example if the umbers were moey i dollars, the deviatios would also be i dollars (though possibly egative), ad squared deviatios would be i squared dollars which is a difficult uit to iterpret. So, oce we have foud the average of the squared deviatios, we must the take the square root to get back to the origial uits. This motivates the defiitio of the stadard deviatio of the list. The stadard deviatio (SD) is the root mea square of the deviatios from average. Read that defiitio backwards, ad you ll see that it s a formula for how to calculate the SD. Here is the defiitio usig otatio. Defiitio 2 Stadard Deviatio s = the stadard deviatio = the umber of values x i = each value i the list x = the mea of the list SD = s = 1 (x i x) 2 Example 2: Studets Scores A class of 18 studets took a maths test.their scores are as below 82 63 81 95 79 90 80 75 64 74 88 72 87 77 82 78 89 84 Work out the stadard deviatio of studets scores. Solutio: 1. Calculate the Mea x i x = (82 + 63 + 81 +... + 89 + 84) = 18 = 1440 18 x = 80 2. Calculate the Average Squared Distace from the Mea

10 CHAPTER 2. DEVIATIONS For each value, subtract the mea ad square the result. We the fid the average of all these squared differeces: 1 (x i x) 2 1 = 1 18 ((82 80)2 + (63 80) 2 + (81 80) 2 +... + (89 80) 2 + (84 80) 2 ) = 1228 18 3. Fially, take the square root: 1228 s = 18 = 8.260 We say that the etries i the list are aroud 80, give or take about 8.3. Later i this chapter we will see precisely what that statemet meas. 2.2 Variace The stadard deviatio is the root mea square (r.m.s.) of deviatios from average. The quatity iside the square root is the mea square of deviatios from average ad is kow as the variace of the list. Variace has uits that are hard to uderstad, as we have see. But it has excellet mathematical properties. So if you wat to fid a SD, ofte a good move is to first fid the variace ad the take the square root. Defiitio 3 variace The variace of the list x 1,x 2,...,x is s 2 = 1 (x i x) 2 Calculatig the variace based o its formal defiitio ivolves a great deal of computatio which must be carried out with a calculator or computer. I this sectio, we ll develop a formula that allows us to compute variace much faster. Start with the formal defiitio of variace, expad the square iside the sum, ad the collect

2.3. QUESTIONS 11 terms. s 2 = 1 = 1 = 1 = 1 = 1 = 1 s 2 = 1 (x i x) 2 (x 2 i 2x i x + x 2 ) x 2 i 1 x i 2 2 x 2x i x + 1 x i + x 2 x 2 i 2 x 2 + x 2 x 2 i x 2 x 2 i x 2 x 2 Defiitio 4 Computatioal Formula for Variace Variace(x 1,x 2,...,x ) = 1 x 2 i x 2 (2.1) Thus the variace is the average of the squares, mius the square of the average. This formula shows that give ay two of x, s 2, ad x i 2 ; we ca always figure out the third oe. This turs out to be useful whe combiig datasets. Liear Trasformatios A "liear trasformatio" is a fuctio of the form y = ax +b. Its graph is a straight lie. Liear trasformatios arise aturally whe we chage uits of measuremet. For example, legths i cetimeters are 2.5 times legths i iches. Temperatures i degrees Fahreheit are (9/5) times temperature i degrees Celsius, plus 32 degrees. So it is useful to uderstad how averages ad SDs behave uder liear trasformatios of the variable. Let the dataset {x 1,x 2,...,x } have average x ad SD s x. For costats a ad b such that a! = 0, let y i = ax i + b for all i. Defiitio 5 Average ad SD of a Liear Trasformatio ȳ = a x + b ad s y = a s x. The steps of the proof are outlied i Exercise 2. 2.3 Questios 1. Cosider a list of umbers x = {x 1,x 2,...,x }

12 CHAPTER 2. DEVIATIONS a) If all the etries i x are the same, the what is the variace of this list? b) Suppose some proportio p of the umbers i the list are 1 ad the remaiig 1 p proportio of the umbers are 0. For istace, if the list had 10 umbers ad p = 0.4, the 4 of the umbers would be 1 ad the remaiig 6 would be 0. Show that the stadard deviatio of the list is p(1 p). 2. Suppose we have a list x = {x 1,x 2,...,x } ad costats a ad b. Let µ be the mea of the list, ad σ the stadard deviatio. I what follows, we will be creatig ew lists by usig x, a, ad b. The otatio y = f (x) meas that y i = f (x i ) for each i such that 1 i. a) What is the stadard deviatio of y = ax, i terms of a, σ, ad µ? b) What is the stadard deviatio of y = x + b, i terms of b, σ, ad µ? c) What is the stadard deviatio of y = ax + b, i terms of a, b, σ, ad µ? 3. Suppose we have a class cosistig of studets. This class has two sectios, A ad B. Sectio A has m studets ad sectio B has m studets. I the two parts below, you will fid the computatioal formula for variace to be quite useful. a) Let = 100 ad suppose Sectio A had 70 studets. Sectio A s studets have a average score of 60 with a stadard deviatio of 10. Sectio B s studets have a average score of 89 with a stadard deviatio of 6. Fid the mea ad stadard deviatio of studet scores across the etire class. You do ot have to simplify the arithmetic. b) Suppose that sectio A has studets ad B has m studets. The average of sectio A is µ A ad the stadard deviatio is σ A. For Sectio B, the average ad stadard deviatio are µ B ad σ B. Fid the mea ad stadard deviatio of studet scores across the etire class, i terms of, m, µ A, µ B, σ A, ad σ B. 4. Let {x 1,x 2,...,x } be a list of umbers with mea µ ad stadard deviatio σ. True or false (if true, prove it; if false, explai why): σ 2 = 1 x i (x i µ) 5. A populatio cosists of me ad wome (yes, the same umber of each). The heights of the me have a average of µ m ad a SD of σ m. The heights of the wome have a average of µ w ad a SD of σ w. Fid a formula for the SD of the heights of all 2 people, i terms of µ m, µ w, σ m, ad σ w. 6. A list x cosists oly of 0 s ad 1 s. A proportio p of the etries have the value 1 ad the remaiig proportio (1 p) have the value 0. Let a ad b be two costats with b > a. Cosider the list y defied by y = (b a)x + a. This meas that each etry of y is created by first multiplyig the correspodig etry of x by (b a) ad the addig a to the result. a) What are the values i the list y, ad what are their proportios? b) Fid the simplest formula you ca for the average of the list y i terms of a, b, ad p. c) Fid the simplest formula you ca for the SD of the list y i terms of a, b, ad p.

Chapter 3 Bouds 3.1 Markov s Iequality As data scietists, oe questio that we have to be able to aswer is, "If we kow the average of a dataset, what iformatio are we gaiig about that dataset?" I this sectio, we are goig to see what we ca say about a dataset if all we kow is its average. Is half of a dataset above average? For example, suppose you kow that you have scored above the average o a test. Does that mea you are i the top half of scores o the test? Not ecessarily, as we ca see i a simple example with just four studets i a class. Suppose the scores are 10, 70, 80, ad 90. The the average is 62.5, ad 75% of the list is above average. What proportio of the data are far above average? Now suppose you have a set of rocks whose average weight is 2 pouds. Based o this iformatio, what ca use say about the proportio of rocks that weigh 10 pouds or more? Of course you ca t say what the proportio is exactly, because you do t have eough iformatio. But it is atural to thik that the proportio ca t be large, sice 10 pouds is bigger tha the average 2 pouds. While it is ot possible to say exactly what the proportio is, or eve approximately, it turs out that it is possible to say that it ca t be very large. I fact, a famous iequality due to the Russia mathematicia Adrey Markov (1856-1922) says that the proportio ca be o bigger tha 1/5. Here is how it works. Markov s Boud A boud is a upper or lower limit o how large a value ca be. A lower boud is a lower limit; the value ca be o less tha that. A upper boud is a upper limit; the value ca be o more tha tha that. Markov s boud says that if the data are o-egative, the for ay positive umber k, the proportio of the data that are at least as large as k times the average ca be o more tha 1/k. Thus Markov provides a upper boud o the proportio. We will prove the boud later i the sectio. For ow, assume it is true ad apply it to our list of weights of rocks. 13

14 CHAPTER 3. BOUNDS The data are weights, which are o-egative. So Markov s iequality applies. The average weight is 2 pouds, ad we are lookig at the proportio that are 10 pouds or more. That is, we are lookig at the proportio that are at least as large as 5 times the average. Markov s boud is that the proportio ca be o bigger tha 1/5. What proportio have weights greater tha 23 pouds? To use Markov s boud, ote that 23 pouds is 23/2 = 11.5 times the average. Thus Markov s boud says that the proportio of rocks that weigh more tha 23 pouds ca be o more tha 1/11.5. Here is a detail to ote. The proportio of rocks that weigh more tha 23 pouds is less tha the proportio that weigh 23 pouds or more, because the secod set icludes those that weigh exactly 23 pouds as well. Markov gives a upper boud o the proportio i the secod set. So it is also a upper boud o the first. Aother detail: What does Markov say about the proportio that is bigger tha half the average? Plug i k = 1/2 to see that Markov s boud is 2. I other words, the boud says that the proportio of data that are greater tha half the average is o more tha 2. While that is correct, it is also completely useless. Ay proportio is o more tha 1. We do t eed a calculatio to tell us that it ca be o more tha 2. The lesso is the Markov s boud is ot useful for small k, ad especially for k < 1. It is oly iterestig whe you are lookig at data that are quite a bit larger tha average. Markov s Iequality: Formal Statemet Suppose that a list of o-egative umbers x 1,x 2,...,x has average x. Markov s Iequality gives a upper boud o the proportio of etries that are greater tha some positive iteger c: For all positive values c, the proportio of etries that are at least as large as c ca be o more tha x/c. Defiitio 6 Markov s Iequality For ay list of o-egative umbers with mea x, Proportio(x c) x c This is what Markov s Iequality looks like graphically:

3.1. MARKOV S INEQUALITY 15 The graph shows the distributio of the data. Notice that the horizotal axis starts at 0; the data are o-egative. The shaded area is the proportio of etries that are greater tha or equal to c. Markov s Iequality tells us that this area is at most x c. Relatio to our origial statemet of Markov s boud. For a list of o-egative umbers, what ca you say about the proportio of etries that are at least 10 times the mea? Our calculatio usig Markov s boud would say that the proportio ca be o more tha 1/10. To see that this also follows from the formal statemet, let x deote the average of the list. We are lookig for the proportio of etries greater tha 10 x. x Applyig Markov s Iequality with c = 10 x, we get a boud of 10 x = 10 1. Therefore, at most oe-teth of all etries i the list are greater tha te times the mea, which is exactly what we got by our old calculatio. Proof To prove the statemet, we will start by writig the proportio as a cout divided by : Proportio(x c) = #{i : x i c} (3.1) The set {i : x i c} cosists of of all the etries that are greater tha or equal to c. The # sig couts the umber of items i that set, givig us the total umber of etries that are at least c. That cout divided by the umber of total etries gives us the proportio of etries that are at least c. Let x 1,x 2,...,x be o-egative umbers with average x, ad c > 0. We have to show that #{i : x i c} Ready? Here we go. Step 1. We will start by splittig the sum of all the etries ito two pieces: the sum of all the etries that are less tha c, ad the sum of all the etries that are at least c. Remember that the x c

16 CHAPTER 3. BOUNDS sum of all the etries i the dataset is x. x = i:x i <c x i = x i + Step 2. I the first sum, all the etries are at least 0, sice the dataset is o-egative. I the secod sum, all the etries are at least c. So ow our calculatio becomes: x = i:x i <c x i i:x i c = x i + i:x i <c i:x i c 0 + c i:x i c x i x i Step 3. Almost doe! The first sum above is 0. The secod sum is just the costat c multiplied by the umber of terms i the sum. The umber of terms is the umber of idices i for which x i c. I other words, the umber of terms is the umber of data poits that are at least c. x = i:x i <c x i = x i + i:x i <c i:x i c 0 + c = c i:x i c i:x i c = #{i : x i c} c x i Step 4. Fially, divide both sides by ad the by c. You re doe! This is the same as what we are tryig to prove: x c #{i : x i c} #{i : x i c} x c (3.2) (3.3)

3.2. CHEBYCHEV S INEQUALITY 17 3.2 Chebychev s Iequality Markov s iequality gave us a way to boud the tail o-egative distributio, usig oly the mea. "Tails" of lists are sets of etries that start far away from the ceter ad go out eve further. The stadard deviatio of a list measures spread aroud the mea. Could we tighte our boud o the tail ay more if we also kew the SD of the list? The Weatherma Cosider a weatherma i Norther Alaska iterested i examiig temperatures, where temperatures are cold ad stay that way. Suppose that after some ivestigatio, we fid that the average temperature x is -25 C, ad that the SD of the temperatures s is 5 C. Norther Alaskas prefer temperatures betwee -15 C ad -35 C, ad we d like to figure out a way to measure the proportio of days which lie i this zoe. Ituitively, it makes sese that we re less likely to see temperatures further away from the mea (as the mea is a measure of cetrality). Furthermore, we would expect that the smaller the stadard deviatio is, the less likely we are to see temperatures that are far away, sice a small stadard deviatio idicates closeess to the mea. As we saw with Markov s boud, there s o way without lookig at the umbers to calculate the exact proportio, but we ca boud the proportio of days with temperatures betwee -15C ad -35 C. For ispiratio, we look to Markov s metor, Putafy Chebychev 1, whose theorem claims that the proportio of days which do ot fall betwee -15 ad -35 is at most 1/4. Equivaletly, at least 3/4 fall withi the rage. Here is how this works. Chebychev s Boud Chebychev s iequality states that the proportio of etries which are at least k stadard deviatios away from the mea is at most 1. Here k is ay positive umber, ad eed ot be a iteger. k 2 I our weather example, we were lookig for items outside of -15 ad -35. Both of these are 2 stadard deviatios away from our mea, -25 ( 15 ( 25) 5 = 2 ad 35 ( 25) 5 = 2). Thus, the proportio of temperatures which are ot i the rage ( 35, 15) is at most 1 = 1 2 2 4. It is importat to ote that Chebychev s iequality works for all datasets, ot just o-egative datasets like Markov s iequality. Also ote that just as we observed with Markov s iequality, small values of k do t lead to iterestig bouds. For example, Chebychev s iequality says that the proportio of etries that are at least half a SD away from the mea is at most 1/(1/2) 2 = 4. Sice we already kow that the proportio is at most 1, the iequality is t tellig us aythig. Chebychev s boud is iterestig for tails, that is, etries that are far away from the mea. That is, Chebychev s boud is iterestig whe k is large. 1 Chebychev is a trascriptio from Russia: you may see it as Chebyshev, Chebysheff, Chebyshov, Tchebychev, Tchebycheff,Tschebyschev, Tschebyschef, or Tschebyscheff

18 CHAPTER 3. BOUNDS Defiitio 7 Chebychev s Iequality Suppose the list x 1,x 2,...,x has average x ad SD s. Let k be ay positive umber. The the proportio of etries that are at least k SDs away from the mea is at most 1/k 2. That is, Proportio{i : x i x ks} 1 k 2 Proportio{i : x i is outside x ± ks} 1 k 2 We will prove the boud after makig a few observatios about its use. First, ote that i order to fid Chebychev s boud, you eed both the mea ad the SD, whereas to use Markov s boud you just eed the mea of a o-egative list. Whe both iequalities apply, Chebychev ofte provides tighter bouds tha Markov. But Chebychev s boud requires more iformatio tha Markov s. Also ote that the coditio x i x ks is equivalet to x i x k s The quatity x i x s is ofte deoted z i, ad measures "how may SDs above average" the value x i is. If z i is egative, the x i is a egative umber of SDs above average; that meas it is below average. If z i is 0 the x i is exactly at the average. The umber z i is called x i i stadard uits, or the z-score of x i. Proof of Chebychev s Boud: The oly boud we kow so far is Markov s. It says that for a list of o-egative umbers, Proportio(i : x i c) x c How ca we use this to establish Chebychev s boud for all lists? Let s begi by rewritig the proportio i Chebychev s boud: Proportio(i: x i is outside x± ks) = Proportio(i: x i x ks ) = Proportio(i: (x i x) 2 k 2 s 2 ) This works because x i x is a o-egative umber, ad therefore, by squarig both sides, x i x ks is equivalet to (x i x) 2 k 2 s 2. The list (x 1 x) 2,(x 2 x) 2,...,(x x) 2 is the list of squared deviatios from mea, so its average is the variace s 2 (look up the defiitio of variace ad you ll see that this is true). Also, the list of squared deviatios is o-egative, so Markov s iequality applies to it. By applyig Markov s iequality to the list of squared deviatios, we get

3.3. QUESTIONS 19 = Proportio(i: (x i x) 2 k 2 s 2 ) s2 k 2 s 2 = 1 k 2 That s Chebychev s boud. The importace of Chebychev s boud is that it applies to all datasets. Thus for example we ca say that o matter what the list looks like, the proportio of etries that are at least 3 SDs away from the mea is at most 1/9. The proportio that are at least 4 SDs away from the mea is at most 1/16, ad so o. I other words, o matter what the list, the bulk of the etries lie i the rage "average plus or mius a few SDs." That is the power of Chebychev. 3.3 Questios 1. Suppose a list of umbers x = {x 1,...,x } has mea µ x ad stadard deviatio σ x. We say that a umber y is withi z stadard deviatios of the mea if µ x zσ x < y < µ x + zσ x. a) Let c be smallest umber of stadard deviatios away from µ x we must go to esure the rage (µ x cσ x,µ x + cσ x ) cotais at least 50% of the data i x. What is c? b) Suppose that a BART ride from Berkeley to Sa Fracisco takes a mea time of 38 miutes with a stadard deviatio of 4 miutes. If you wat to make the claim At least 90% of BART rides from Berkeley to Sa Fracisco take betwee ad miutes, what umbers should be used to fill i the blaks? 2. At a elemetary school, 45 childre are raisig moey for charity. The teacher has 20 cady bars, ad has promised to give oe cady bar to each child who raises $5 or more. The average amout raised by the childre is $2. Does the teacher have eough cady bars to keep her promise? Why or why ot? 3. A list of icomes has mea $75,000 ad SD $25,000. Give the best upper boud you ca for the proportio of icomes that are more tha $150,000. 4. A list of icomes has a average of $60,000 ad a SD of $40,000. Let p be the proportio of icomes that are over $200,000. a) What, if aythig, does Markov s iequality say about p? b) What, if aythig, does Chebychev s iequality say about p? c) Is either of the aswers to parts (a) ad (b) more iformative about p tha the other? Explai your aswer. 5. A list of test scores has a average of 55 ad ad SD of 10. What ca you say about the proportio of scores that are i the iterval (30,80)? 6. A list of test scores has a average of 55 ad a SD of 10. What ca you say about the proportio of scores i the iterval (25,95)? 7. A class of 58 studets takes a true-false quiz cosistig of 20 questios. Each aswer will get a score of 1 if it is correct ad 1 otherwise; o other score is possible. The GSIs keep track of the umber of aswers each studet gets correct. The average of these 58 umbers is 16.1 ad the SD is 2.3.

20 CHAPTER 3. BOUNDS I each of the followig parts, fid the quatity if it is possible to do so with the iformatio give. If it is ot possible, explai why ot. a) the average umber of aswers that were aythig other tha correc b) the SD of the umber of aswers that were aythig other tha correct c) the average score o the test d) the SD of scores o the test

Chapter 4 Probability Probability theory is a disciplie rooted deeply i the real world ad i mathematics. We use probabilities ad statistics to represet itegral parts of our lives, as diverse as the chace of rai o the weather app, battig averages for our local baseball teams, or the success rate of a medical treatmet. Through the laguage of statistics, we ca cocisely describe a situatio ad make predictios about what s to come. By buildig o the basic structures of probability laid out i this chapter, we will be able to uderstad how these probabilities combie. Uderstadig probability will make us better equipped to calculate likelihoods ad make decisios without takig uecessary risks. A ote to the reader: the examples i this text have bee desiged with the itetio that readers follow alog by doig the calculatios. Please do t just read the text like a ovel. Thaks! 4.1 Probability We ca use probability to measure the likelihood of a evet occurrig. Suppose a experimet ca result i exactly oe of several possible outcomes. I data sciece, we will almost ivariably be lookig at a fiite set of possible outcomes. I what follows, you ca just assume that the set of all possible outcomes is fiite. Oe way to defie the probability of a evet is as a proportio of the umber of favorable outcomes relative to the total umber of outcomes. This defiitio makes sese oly uder the assumptio that all outcomes are equally likely. For example, if you are rollig a six-sided die ad believe that all sides have the same chace of beig rolled, the the set of all possible outcomes is {1,2,3,4,5,6} ad the chace of the evet "the umber of spots is a multiple of 3" is #{3,6} #{1,2,3,4,5,6} = 2 6 = 1 3 (The word favorable i this cotext refers to the evet you are studyig, ad is ot ecessarily a "good" evet.) Probability is always betwee 0 (correspodig to a impossible evet) ad 1 (correspodig to a certai evet). I probability theory it is stadard to deote evets by the "early" letters of the alphabet, such as A, B, ad so o. 21

22 CHAPTER 4. PROBABILITY Defiitio 8 Probability of A, assumig equally likely outcomes P (A) = Number of outcomes favorable to A Total umber of outcomes Example 1. You are drawig a item out of a box. I the box there is oe gree teis ball, oe orage soccer ball, ad two white golf balls. The box cotais othig else. You are equally likely to pick ay of the balls. What s the probability that: 1. You pick a orage ball? 2. You pick a golf ball? 3. You pick a ball? 4. You pick a golf ball that is red? Solutio Favorable Outcomes Probability = Total Outcomes. 1. There is oe orage ball i the box. So there is oly oe favorable outcome out of the four total possible outcomes. Therefore, the probability of pickig a orage ball = 1 4 2. Now, there are two favorable outcomes as there are two golf balls. Thus, the probability of pickig a golf ball is 2 4 = 1 2 3. We kow that there are oly balls i the box. Therefore, pickig a ball is a certai evet. So, the probability of pickig a ball = 1. 4. There is o golf ball i the box that is red. Therefore, pickig a red golf ball is a impossible evet. Thus, the probability = 0. Partitioig evets Whe computig probabilities, it is atural to break evets up ito simpler evets ad the combie the probabilities of the simpler evets. Example 2. suppose the distributio of age (measured i completed years) i a populatio is as follows: Age 20-34 35-49 50-64 65-79 80-100 % of People 20 20 30 20 10 Suppose oe perso is picked at radom. What is the chace that the perso is a seior citize (age 65 or older)? Note o termiology: I this text, the term "at radom" will mea "all outcomes are equally likely". I our example, a "outcome" is a perso. We re assumig that all the people are equally likely to be picked. Uder this assumptio it s quite atural to say that the aswer is 30%, as that s the percet of seior citizes i the populatio from which the draw is made. That s correct, ad we will ow break dow the argumet ito fier detail. The evet "the perso picked is a seior citize" partitios ito two simpler evets: the perso s age is either i the rage 65-79 or 80-100. I a partitio, oly oe of the evets ca occur. Whe

4.1. PROBABILITY 23 age is measured i completed years, a perso ca t be i both age groups 65-70 ad 80-100. We say that the two evets i a partitio are "mutually exclusive". Each excludes the other. Now "seior citize" partitios ito "age 65-79 or 80-100". The chace of pickig a seior citize is the sum of the chaces of the two groups: 20% + 10% = 30%. Defiitio 9 Additio Rule. P (A or B) = P (A) + P (B) if A ad B are mutually exclusive Evets that Satisfy Multiple Coditios Example 3. Suppose you draw two times at radom without replacemet from a box that cotais oe ticket each of the colors Red, Blue, ad Gree. What is the chace that you get the Blue ticket first, ad the the Red? Solutio. "At radom without replacemet" meas that all tickets are equally likely to be draw, ad oce you have draw a ticket, you do t replace it i the box before you draw the ext oe. Uder these assumptios the possible pairs you ca draw are RB, RG, BG, BR, GB, ad GR. You ca t get the same color twice. The outcome we wat is BR. So the chace is 1/6. Easy eough. But oe agai, the aswer merits further examiatio. The chace of gettig the Blue ticket o the first draw is 1/3. So if you imagie ruig this experimet over ad over agai, the Blue ticket will appear o the first draw about 1/3 of the time. Amog those times, the Red ticket will appear o the ext draw about 1/2 the time. So the chace of BR ca be thought of as 1 2 of 1 3 = 1 3 1 2 = 1 6 Thus the probability that two evets both occur (that is, B o the first draw ad R o the secod) is a fractio of a fractio. The more coditios you place o a evet, the smaller its chace becomes. Defiitio 10 Multiplicatio Rule. P (A ad B) = P (A) P (B give that A has happeed) Example 4. What s the probability that you get a head followed by a tail whe you flip two cois? You ca assume the cois are fair. Solutio The chace of gettig a head o the first toss is 1/2. Sice the outcome of the first toss does t affect outcomes for the secod (a atural assumptio, that turs out to be fie i practice), the chace that the secod toss is a tail is 1/2 o matter how the first toss came out. So the aswer is 1 2 1 2 = 1 4 You ca also solve this problem by eumeratig all the outcomes: #{HT } #{HH,HT,T H,T T } = 1 4

24 CHAPTER 4. PROBABILITY Example 5. Solutio What s the probability that you get a head ad a tail whe you flip two cois? Notice the differece betwee this example ad Example 3. I this oe, the order i which the two faces appear is t specified. So the evet icludes them appearig i ay order. So the evet "a head ad a tail" partitios ito "HT or TH", which is a partitio because the two cois ca t show HT as well as TH o the same pair of tosses. So P (a head ad a tail) = 1 4 + 1 4 = 1 2 4.2 Examples: Samplig with Replacemet Suppose you have a fiite populatio from which you sample repeatedly. We will defie radom samplig to mea that at each stage, every elemet has the same chace of beig selected. Formally, this is sometimes called samplig uiformly at radom. Whe you sample repeatedly, you have to specify whether or ot the tickets that you have already draw out cotiue to be part of the populatio. If they do, you are samplig with replacemet. A commo way to visualize this is to imagie each member of the populatio beig represeted by oe ticket i a box. Whe samplig with replacemet, you shuffle all the tickets ad draw oe, the replace i the box ad repeat the process. Oe example where this is a good model is rollig a fair 6-sided die. A atural assumptio is that if the die is rolled oce ad the outcome is a 5, the ext roll ca still yield ay of the umbers 1 through 6 with equal probability. Hece rollig a die is like samplig at radom with replacemet from {1,2,3,4,5,6}. Example 1. Rollig A Die. A fair 6-sided die has umbers from 1 to 6. Each time it is rolled, the outcome will be a umber from 1 to 6. The probability of gettig ay of the six umbers is the same, which is 1/6. No roll affects the outcome of ay other roll. (i) Suppose the die is rolled oce. What is the probability of rollig a 1 ad a 2? (ii) If the die is rolled oce, what is the probability of rollig a 1 or a 2? (iii) If the die is rolled twice, what is the probability of rollig a 1 o the first roll ad a 2 o the secod roll? Solutio (i) The chace of gettig both 1 ad 2 o the same roll is 0 sice the outcome could oly be oe of the two umbers. (ii) The chace of gettig either 1 or 2 o the same roll is #{2,6} #{1,2,3,4,5,6} = 2 6 Aother way to solve this is to ote that "the roll shows 1" ad "the roll shows 2" are mutually exclusive, so by the additio rule, the chace that the roll shows 1 or 2 is 1 6 + 1 6 = 2 6

4.2. EXAMPLES: SAMPLING WITH REPLACEMENT 25 (iii) By the multiplicatio rule, the aswer is the chace of gettig a 1 o the first roll times the chace of gettig a 2 o the secod roll give that 1 appeared o the first roll. Sice o roll affects ay other, both chaces are 1/6. So the aswer is 1/36. Example 2. A die is rolled 3 times. What is the probability that the face 1 ever appears i ay of the rolls? Solutio Let s break the questio ito simpler problems. What is the chace that 1 does ot appear i a sigle roll? The possible faces that ca appear i a sigle roll, excludig 1, are 2,3,4,5, ad 6. Therefore, the probability of ot gettig 1 i a sigle roll of die = 5 6 Sice we are rollig a die, the chace of ot gettig 1 is the same o each subsequet roll. Sice we wat "ot 1" to occur o each of the three rolls, the aswer will be "a fractio of a fractio of a fractio" by the multiplicatio rule: The probability that 1 does ot appear i ay of 3 rolls = 6 5 5 6 6 5 = ( 5 6 )3 1 st roll 2 d roll 3 rd roll Example 3. A die is rolled times. What is the chace that oly faces 2,4 or 6 appear? Solutio The chace that either 2,4, or 6 appears i a sigle roll = 3 6 Sice we are rollig a die, the chace that either 2,4, or 6 appears i a sigle roll is the same i subsequet rolls. Therefore, chace that oly 2,4, or 6 appear i rolls = ( 3 6 ) = ( 1 2 ) Example 4. faces? A die is rolled two times. What is the probability that the two rolls had differet Solutio. To uderstad the problem, we ca thik i the followig way: The first roll ca be ay of 1,2,3,4,5 or 6. Hece, we will accept ay face for the first roll sice all faces are favorable. I the secod roll, the face should be aythig but first roll ad thus, it ca be ay of five differet faces. Probability of gettig ay of the six faces i the first roll = 6 6 = 1 O the secod roll: Probability of gettig ay face but the face that appeared o the first roll = 5 6 Probability that the two rolls had differet faces = 6 6 5 6 = 5 6 Example 5. There are 20 studets i a class. A computer program selects a radom sample of studets by drawig 5 studets at radom with replacemet. What is the chace that a particular studet is amog the 5 selected studets? Solutio. Sice it is difficult to eumerate every possible case that icludes a particular studet, we look at its complemet ad see if it is simpler to work with.

26 CHAPTER 4. PROBABILITY Because we are samplig with replacemet, the probability that the studet is selected o ay particular draw is ot affected by what happeed o other draws. So: The probability that a particular studet is ot selected i a sigle draw = ( 20 1 20 ) = 19 20 The probability that a particular studet is ot selected i all five draws (which is the etire sample) = ( 20 19)5 The probability of a particular studet gettig selected i the sample = 1 - Probability that the studet is ot selected i the sample = 1 ( 19 20 )5 Geeralizatio: Total umber of studets = N Sample size = Probability that a particular studet is ot selected = ( N 1 N ) = (1 1 N ) Probability of a particular studet gettig selected = 1 - probability that a particular studet is ot selected = 1 (1 1 N ) 4.3 The Gambler s Rule So far, we ve oly applied probabilities to small games, fidig the chaces of evets occurrig i dice ad coi games with a small umber of evets. Now, we ll combie all the ideas preseted to examie the mechaics of a real world gamblig sceario. The Game. Say you are playig a game where N people put i a bet, ad oe perso is chose at radom to wi the whole pot. What is the chace that you will wi if you play oce? What is the chace that you will wi at least oce, if you play times? Usig the cocepts we have leared from probability with replacemet, we ca fid a good strategy about how we ca approach this game. Placig Bets We eed to state our assumptios. For what follows, N ad are itegers greater tha 1. The mai assumptio is that whe you are playig this gamblig game times, you have a chace of wiig N 1 each time you play, regardless of the outcomes of all other times. These are the oly assumptios eeded. If you play oce, P (you wi oe bet) = 1 N From this, we ca already coclude that the chace you will lose oe bet is 1 N 1 probability of your losig is the chace that you are ot able to wi. because the P (lose oe bet) = 1 1 N Kowig the probability we ca lose oe bet brigs us to a scary questio: What is the chace that you will lose times straight? I such situatios it s always a good idea to start out with a small fixed value of, ad the see if you ca geeralize. If = 2, we are tryig to fid the chace that you lose 2 bets. Two

4.3. THE GAMBLER S RULE 27 coditios have to be satisfied: you have to lose the first bet, the you have to lose the secod as well. Remember that bets remai uaffected by the results of other bets. So by the multiplicatio rule, the chace is P (lose 2 bets) = (1 1 N ) (1 1 N ) Now, fidig the chace that we lose times straight is simple. Just repeat the reasoig above. Because of our assumptios, we ca coclude that: P (lose all bets) = (1 1 N ) We ow have the chace of losig all bets. But the chace that we had origially set out to fid was the chace of wiig at least oe bet out of the bets. How do we go about fidig that? At this poit, may studets will be quite dumbfouded ad try to use some other facy probabilistic method ivolvig combiatios or what ot, but the aswer to this problem is simple. The complemet of losig all of the bets is wiig at least oe bet. That s all that s eeded! P (wi at least oe bet) = 1 (1 1 N ) How to Get a Fair Chace? Whe you flip a coi, you get a 50% chace of ladig heads ad a 50% chace of ladig tails. We say that this is a fair chace as there is o differece i chace betwee ladig either outcome. How may bets do you thik it will take to give you a fair chace of wiig at least oe out of the bets? Come up with a guess ad save it for the ed, whe we have solved the problem. You ll be able to see how your ituitio matches up with the aswer! We have to solve for the smallest for which the chace that you wi at least oe of the times is at least 1/2. Remember that N is fixed. Mathematically: This is the same as 1 (1 1 N ) 1 2 1 2 (1 1 N ) I order to isolate, let s take the logarithm of both sides. (Note that "logarithm" meas "atural logarithm" here; we wo t be takig logs to the base 10 at this level of math.) Sice the logarithm is a strictly icreasig fuctio, it preserves the iequality. log 1 2 log(1 1 N ) Now, to isolate, we have to divide both sides by log(1 N 1 ). Remember that we must flip the iequality because log(1 N 1 ) is egative! log 1 2 log(1 1 N )

28 CHAPTER 4. PROBABILITY That gives us our boud: log 1 2 log(1 1 N ) We ve ow come up with a solutio to our origial problem, although it does t really give us a good uderstadig of how large this value is. So, let s try to approximate it. Approximatig To fid a approximatio to the smallest that satisfies the coditio above, we ll start by examiig the log fuctio. Let s try to approximate the value of log(1 x) for small, positive x. We kow that the log of umbers close to 1 is close to zero (recall that log(1) = 0 ad log is a cotiuous fuctio). Let us draw a graph of the fuctio f (s) = log(s) alog with its taget lie at s = 1. Now, for a small positive x, plot ad label the three followig poits o this graph: A: ((1 x),0) B: ((1 x),log(1 x) C: (1,0) Do you otice somethig about these poits? They produce a triagle; ot just ay triagle, but approximately a 45-45-90 right triagle. That s because the derivative of the log fuctio at s = 0 is 1, so the taget lie is a 45 degree lie. The two legs of the right triagle are equal, ad oe of them is clearly equal to x. That s the distace betwee A ad C. Therefore, the other leg is also approximately x, ad we already kow that it s log(1 x). So log(1 x) x for small positive x