Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Size: px

Start display at page:

Download "Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview"

Malcolm Stokes
6 years ago
Views:

1 CS 70 Discrete Mathematics ad Probability Theory Fall 2016 Walrad Probability: A Overview Probability is a fasciatig theory. It provides a precise, clea, ad useful model of ucertaity. The successes of Probability Theory i Computer Sciece are remarkable: data sciece, machie learig, artificial itelligece, voice ad image recogitio, ad commuicatio theory are based o that theory. The objective of these otes is to itroduce the key ideas of Probability Theory o simple examples. Hopefully, this overview will help you see the forest as you explore its differet trees i the course. 1 Pick a Marble Setup Imagie a bag with 100 marbles that are idetical, except for their color. Amog those, 10 are blue, 20 are red, 30 are gree, ad 40 are white. You shake the bag ad pick a marble without lookig. Probability You will certaily agree that the odds that you picked a gree marble are 30 out of 100. Similarly, the odds that you picked a blue marble are 10 out of 100. We say that the probability that the marble is gree is 30/100 = 0.3. We write Pr[gree] = 0.3. Iterpretatio What does this mea precisely? Well, this is ot really that obvious. Two iterpretatios are useful. The first iterpretatio is a subjective willigess to bet o the outcome. Imagie the followig game of chace. You bet some amout ad you get $ if the marble is gree. How much are you willig to bet? I would be willig to bet $ The secod iterpretatio is frequetist. It says that if you were to repeat this experimet (shakig the bag with 100 marbles ad pick a marble without lookig), you would pick a gree marble about 30% of the time. Note that this is a iterpretatio at this poit, ot a theorem. Additivity Cosider the evet the marble is blue or gree. The odds of that evet are 40/100. We write Pr[blue or gree] = 0.4. Note that Pr[blue or gree] = Pr[blue] + Pr[gree]. This is ot surprisig sice the umber of marbles that are blue or gree is the sum of the umber of blue marbles plus the umber of gree marbles. We say that probability is additive. Coditioal Probability Assume that the marble you picked is blue or gree. What are the odds that it is blue or red? Well, sice you picked oe of the 40 marbles that are either blue or gree, that marble is blue or red oly if it is oe of the 10 blue marbles. Sice 10 out of the 40 blue or gree marbles are blue, we see that the odds that you picked a blue or red marble, give that you picked a blue or gree marble, is 10/40. We say that the coditioal probability of blue or red give blue or gree is 10/40. We write Pr[blue or red blue or gree] = 10/40. CS 70, Fall 2016, Probability: A Overview 1

2 Note that Pr[blue or red blue or gree] = Pr[(blue or red) ad (blue or gree)] Pr[blue or gree] = Pr[blue] Pr[blue or gree]. Bayes Rule Assume that we pait a black dot o half of the blue ad half of the red marbles, ad also o 20% of the gree ad 20% of the white marbles. You pick a marble at radom ad are told that the marble has a black dot. What are the odds that the marble is red? To aswer this questio, we ote that there are = 29 marbles with a black dot, out of which 10 are red. Thus, the aswer is 10/29. This calculatio is a example of Bayes Rule. The idea is that oe specified Pr[black dot blue] = 0.5 ad similarly for the other colors. Oe also kows Pr[blue] = 0.1, ad similarly for the other colors. The calculatio determies Pr[red black dot], which i a sese is the reverse of the specificatio. A similar calculatio determies the likelihood of a disease (e.g., flu) give a symptom (e.g., fever). Here, the symptom is the back dot ad the disease is the color of the marble. Radom Variable Say that you get $8.00 if you pick a blue marble, $5.00 if it is red, $2.00 if it is gree, ad $2.00 if it is white. The amout you get is the a fuctio of the color of the marble you picked. This fuctio is fixed. Let us call the fuctio X( ). Thus, X(blue) = 10 ad X(white) = 2, ad so o. We call X a radom variable. Thus, we say that a radom variable is a real-valued fuctio of the outcome of a radom experimet. Here, the radom experimet is choosig a marble. The outcome is the color of the marble. We have specified all the possible outcomes: blue, red, gree, white. Also, we kow the probability of each outcome. For istace, Pr[blue] = 0.1. The set of outcomes ad their probability specifies the radom experimet. The fuctio X assigs a real umber to each outcome. Note that the values assiged to differet outcomes do ot have to be differet. Here, X(gree) = X(white) = 2. Distributio Assume that we are iterested oly i how much you get, ot i the details of the experimet that produces that gai. I that case, we ca describe X by sayig that X = 8 with probability 0.1 (which is the probability you pick a blue marble), X = 5 with probability 0.2, ad X = 2 with probability 0.7 (the probability that you pick a gree or white marble). Thus, the possible values of X are 8,5,2 ad their probability is 0.1,0.2,0.7, respectively. These values ad their probability are called the distributio of the radom variable X. Expectatio Imagie that you repeat the experimet (shake, pick, collect X) a very large umber N of times. The frequetist iterpretatio suggests that the fractio of the times that you collect 8 is 0.1, that you collect 5 is 0.2 ad that you collect 2 is 0.7. Thus, you collect 8 about 0.1N times, 5 about 0.2N times, ad 2 about 0.7N times. Hece, the total amout you collect over the N experimets is about 8 0.1N N N = ( )N. Accordigly, the average amout you collect per experimet is We call this value the expectatio of X ad we write it as E[X]. We also call E[X] the mea value or the expected value of X. Thus, E[X] = = 3.2. That is, E[X] is the sum of the values of X multiplied by their probability. Fuctio CS 70, Fall 2016, Probability: A Overview 2

3 Would you rather play the game (pick a marble ad get X) or get $3.20 without playig the game? The aswer depeds o a key factor that the ecoomists call the utility that you have for moey. To make the situatio a bit more dramatic, say that you ca either get $1.00 or play a game ad get $ with probability 0.01 or $0.00 otherwise. What do you prefer? May people ted to choose to play the game. I fact, may people play the Califoria Lottery where the odds of wiig $100M are much less that Let h(x) be the utility that you have for $x. Say that (this is a silly example, but it will illustrate a poit) h(8) = 10 ad h(5) = h(3.2) = h(2) = 0. For istace, for $8.00, you ca by a ticket to go see the latest Pokemo movie you crave ad that you caot do aythig of comparable value with less tha $8.00. The we fid that, after playig the marble game, h(x) = 10 with probability 0.1 ad h(x) = 0 with probability 0.9. Hece, E[h(X)] = = 1. O the other had, if you do t play the game ad get 3.2, the h(3.2) = 0. Thus, you would rather play the game. Similarly, people play the lottery beause wiig would chage their life, presumably for the better, whereas loosig $1.00 does ot affect their life. Note that we calculated E[h(X)] by fidig the distributio (recall that this meas the set of possible values ad their probability) of h(x). We could have calculated E[h(X)] directly from the distributio of X: E[h(X)] = h(8)pr[x = 8] + h(5)pr[x = 5] + h(2)pr[x = 2] = This is simple observatio, but it is coveiet. I a similar way, we could have computed E[h(X)] by lookig at the outcomes of the marble pickig game: E[h(X)] = h(x(blue))pr[blue] + h(x(red))pr[red] + h(x(gree))pr[gree] + h(x(white))pr[white] = h(8)0.1 + h(5)0.2 + h(2)0.3 + h(2)0.5. Ideed, these three differet ways of calculatig E[h(X)] correspod to differet ways of summig the possible ways of gettig the values of h(x): summig over the values of h(x), or the values of X, or the outcomes. Variace We saw that oe ca describe a radom variable X by its distributio. A summary of that distributio is the mea value E[X]. However, our discussio of the utility shows that this descriptio is a bit crude ad may ot suffice to decide whether to play a game of chace. For istace, the expected gai of playig the lottery is egative. You would ot paly a game where you are certai to loose. The mea value does ot say aythig about the ucertaity of X, i.e., its variability. Here, by variability we mea that if we play the game may times, we observe a variety of values of X. The variace is a oe-umber summary of variability. The variace of X is defied by var[x] = E[(X E[X]) 2 ]. The ituitio is that if X is almost always close to E[X], the the variace is small; otherwise, it is large. I our marble example, E[X] = 3.2. Sice X = 8,5, or 2 with probability 0.1,0.2,0.7, respectively, we see that var[x] = E[(X E[X]) 2 ] = (8 3.2) 2 Pr[X = 8] + (5 3.2) 2 Pr[X = 5] + (2 3.2) 2 Pr[X = 2] = (8 3.2) (5 3.2) (2 3.2) = = The square root of the variace is called the stadard deviatio ad we deote it by σ X. Here, σ X = CS 70, Fall 2016, Probability: A Overview 3

I aother cotext, each perso is associated with a height ad a weight. Say that you wat to guess the weight of a perso from his/her height. How do you do it? Here, we wat to guess Y from the value of X.

4 Figure 1: Liear Regressio of Y over X (brow) Figure 2: Quadratic Regressio of Y over X (purple) Liear Regressio Cosider oce agai our bag of marbles. Defie aother radom variable Y by Y (blue) = 1,Y (red) = 1,Y (gree) = 3 ad Y (white) = 4. Thus, each outcome (i.e., color) is assiged two umbers: X ad Y. I aother cotext, each perso is associated with a height ad a weight. Say that you wat to guess the weight of a perso from his/her height. How do you do it? Here, we wat to guess Y from the value of X. Here, a picture helps. Figure 1 shows the values of X ad Y associated with the four possible outcomes. For istace, the blue outcome is associated with X(blue) = 8 ad Y (blue) = 1. The figure also shows the probability of the differet outcomes. We wat a simple formula to provide a guess of Y based o X. I fact, we wat a formula of the form Ŷ = a + bx. Here, Ŷ is our guess for Y based o the value of X. Also, a ad b are some costats. This formula correspods to the lie show i the figure. We choose a ad b so that the guess Ŷ teds to be close to Y. This meas that the lie should be close to the actual poits (X,Y ) i the figure. Thus, Ŷ Y should be small. We make this precise by requirig that E[(Ŷ Y ) 2 ] be as small as possible. That is, we choose a ad b to miimize E[(Ŷ Y ) 2 ] = E[(a + bx Y ) 2 ]. We explai i the lectures that the best choice of a ad b is such that where cov(x,y ) = E[XY ] E[X]E[Y ]. Quadratic Regressio Ŷ = E[Y ] + cov(x,y ) (X E[X]) var[x] I the previous sectio, we estimated Y by usig a liear fuctio a + bx of X, as show i Figure 1. Figure 2 suggests that a quadratic estimate c + dx + ex 2 is better tha a liear estimate, i.e., that it is closer to the pairs (X,Y ). I the lectures, we explai how to fid the best values of c,d,e. CS 70, Fall 2016, Probability: A Overview 4

Figure 3: Coditioal Expectatio of Y give X (gree) Coditioal Expectatio What if we could choose ay fuctio of X istead of beig limited to liear or quadratic fuctios?

5 Figure 3: Coditioal Expectatio of Y give X (gree) Coditioal Expectatio What if we could choose ay fuctio of X istead of beig limited to liear or quadratic fuctios? Figure 3 shows the best possible fuctio g(x) of X to estimate Y. We explai i the lectures how to calculate that fuctio called the coditioal expectatio of Y give X. 2 Flip Cois So far, we looked at oe or two radom variables. I this sectio, we explore may radom variables. Setup You have a coi. Whe you flip it, there are two possible outcomes: heads (H) ad tails (T ). Let p = Pr[H], so that Pr[T ] = 1 p. For istace, the coi could be biased with p = 0.6, so that heads is more likely tha tails. Idepedece Say that you flip the coi twice. There are four possible outcomes for this experimet: HH,HT,T H, ad T T. Here, HT meas that the first flip produces H ad the secod T, ad similarly for the other outcomes. If we recall the defiitio of coditioal probability, we have Pr[first fip yields H secod flip yields H] Pr[(first flip yields H) ad (secod flip yields H)] = Pr[secod flip yields H] = Pr[HH]. p I the last step, we used the fact that the probability that the secod flip yields H is p. Now, it is reasoable to assume that the likelihood that the first flip yields H does ot deped o the fact that the secod flip yields H ad that this likelihood is the p. Hece, we are led to the coclusio that p = Pr[HH]/p, so that Pr[HH] = p 2. This assumptio is called the idepedece of the coi flips. A similar reasoig yield to the coclusio that Pr[HT ] = p(1 p),pr[t H] = (1 p)p,pr[t T ] = (1 p) 2. Let X = 1 whe the first flip is H ad X = 0 whe it is T. Also, let Y = 1 whe the secod flip is H ad Y = 0 whe it is T. The we see that Pr[X = 1] = Pr[Y = 1] = p ad Pr[X = 1,Y = 1] = Pr[X = 1]Pr[Y = 1]. Also, Pr[X = 1,Y = 0] = Pr[X = 1]Pr[Y = 0]. More geerally, Pr[X = a,y = b] = Pr[X = a]pr[y = b] for all a,b. Two radom variables with that property are said to be idepedet. CS 70, Fall 2016, Probability: A Overview 5

6 Variace of Sum Let X ad Y be idepedet radom variables. We show i the lectures that var[x +Y ] = var[x] + var[y ]. More geerally, if X 1,...,X are radom variables such that ay two of them are idepedet, the var[x X ] = var[x 1 ] + + var[x ]. Moreover, we will see that var[ax] = a 2 var[x] for ay radom variable X ad ay costat a. Cosequetly, we see that var[ X X ] = var[x 1] + + var[x ] 2. I particular, if var[x m ] = σ 2 for m = 1,...,, we have var[ X X ] = var[x 1] + + var[x ] 2 = σ 2 2 = σ 2. Chebyshev s Iequality Flip a coi times ad let X m = 1 if flip m yields H ad X m = 0 otherwise. The var[x m ] = E[(X m E[X m ]) 2 ] = E[(X m p) 2 ] = (1 p) 2 Pr[X m = 1] + (0 p) 2 Pr[X m = 0] = (1 p) 2 p + p 2 (1 p) = p(1 p). Accordigly, i view of the previous sectio, var[ X X ] = p(1 p). Thus, whe is large, the variace of A := (X X )/ is very small. This suggests that the radom variable A teds to be very close to its mea value, which happes to be p. Thus, we expect the fractio of heads A i coi flips to be close to p. To make this idea precise, Chebyshev developed a iequality which says that We prove this iequality i the lectures. Pr[ X E[X] 2 > ε] var[x] ε 2. Thus, the likelihood that a radom variable X differs from its mea E[X] by at least ε is small if var[x] is small. If we apply this iequality to A, we fid that Pr[ A p ε] p(1 p) ε 2. Note that p(1 p) 1/4 for ay value of p. Cosequetly, we see that Pr[ A p ε] 1 4ε 2. Cofidece Iterval Say that you do ot kow the value of p = Pr[H]. To estimate it, you flip the coi times ad ote the fractio A of heads. The last iequality holds. Let us choose ε so that the right-had side of the iequality CS 70, Fall 2016, Probability: A Overview 6

7 is 0.05 = 1/20. That is, we choose ε so that 4ε 2 = 20, i.e., ε 2 = 5/ or ε = 5/ 2.25/. Hece, the previous iequality with that value of ε implies that Pr[ A p 2.25 ] 0.05, so that Pr[ A p 2.25 ] = 95%. Now, sice A p δ if ad oly if p [A δ,a + δ], we coclude that Pr[p [A 2.25,A ]] 95%. For istace, say that = 10 4 ad A = We the coclude that so that Pr[p [ , ]] 95%, 100 Pr[p [0.2875, ]] 95%. We say that [0.2875,0.3325] is a 95%-cofidece iterval for p. As you ca see, the width of the cofidece iterval decreases like 1/. This example is the basis for the estimates i public opiio surveys. Time util first H We flip the coi util we get the first H. How may times do we eed to flip the coi, o average? Let β be that average umber of flips. That umber of flips is 1 if the first flip is H, which occurs with probability p. If the first coi is T, which occurs with probability 1 p, the the process starts afresh ad we eed to flip the coi β more times, o average. Thus, β = p 1 + (1 p) (1 + β). Solvig, we fid β = 1/p. Time util two cosecutive Hs We flip the coi util we get two cosecutive Hs. How may times do we eed to flip the coi, o average? Let β be that average umber of flips. Let also β(h) be the average umber of additioal flips util two cosecutive Hs, give that the last flip is H. The we claim that β = p(1 + β(h)) + (1 p)(1 + β) β(h) = p 1 + (1 p)(1 + β). The first idetity ca be see by otig that if the first flip is H, the after that first flip oe eeds β(h) additioal flips, o average, sice the last flip was H. However, if the secod flip is T, the after the first flip oe eeds β additioal flips, o average. The secod idetity ca be justified similarly. Solvig, oe fids β = 1/p + 1/p 2. CS 70, Fall 2016, Probability: A Overview 7

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n, CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 9 Variace Questio: At each time step, I flip a fair coi. If it comes up Heads, I walk oe step to the right; if it comes up Tails, I walk oe