Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18

Size: px

Start display at page:

Download "Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18"

Lindsay Lyons
5 years ago
Views:

1 EECS 70 Discrete Mathematics ad Probability Theory Sprig 2013 Aat Sahai Lecture 18 Iferece Oe of the major uses of probability is to provide a systematic framework to perform iferece uder ucertaity. A few specific applicatios are: commuicatios: Iformatio bits are set over a oisy physical chael (wireless, DSL phoe lie, etc.). From the received symbols, oe wats to make a decisio about what bits are trasmitted. cotrol: A spacecraft eeds to be laded o the moo. From oisy measuremets by motio sesors, oe wats to estimate the curret positio of the spacecraft relative to the moo surface so that appropriate cotrols ca be applied. object recogitio: From a image cotaiig a object, oe wats to recogize what type of object it is. speech recogitio: From hearig oisy utteraces, oe wats to recogize what is beig said. ivestig: By observig past performace of a stock, oe wats to estimate its itrisic quality ad hece make a decisio o whether ad how much to ivest i it. All of the above problems ca be modeled with the followig igrediets: A radom variable X represetig the hidde quatity ot directly observed but i which oe is iterested. X ca be the value of a iformatio bit i a commuicatio sceario, positio of the spacecraft i the cotrol applicatio, or the object class i the recogitio problem. Radom variables Y 1,Y 2,...Y represetig the observatios. They may be the outputs of a oisy chael at differet times, pixel values of a image, values of the stocks o successive days, etc. The distributio of X, called the prior distributio. This ca be iterpreted as the kowledge about X before seeig the observatios. The coditioal distributio of Y 1,...Y give X. This models the oise or radomess i the observatios. Sice the observatios are oisy, there is i geeral o hope of kowig what the exact value of X is give the observatios. Istead, all kowledge about X ca be summarized by the coditioal distributio of X give the observatios. We do t kow what the exact value of X is, but the coditioal distributio tells us which values of X are more likely ad which are less likely. Based o this iformatio, itelliget decisios ca be made. EECS 70, Sprig 2013, Lecture 18 1

2 Iferece Example 1: Multi-armed Badits Questio: You walk ito a casio. There are several slot machies (badits). You kow some have odds very favorable to you, some have less favorable odds, ad some have very poor odds. However, you do t kow which are which. You start playig o some of them, ad by observig the outcomes, you wat to lear which is which so that you ca itelligetly figure out which machie to play o (or ot play at all, which may be the most itelliget decisio.) Stripped-dow versio: Suppose there are biased cois. Coi i has probability p i of comig up Heads; however, you do t kow which is which. You radomly pick oe coi ad flip it. If the coi comes up Heads you wi $1, ad if it comes up Tails you lose $1. What is the probability of wiig? What is the probability of wiig o the ext flip give you have observed a Heads with this coi? Give you have observed two Heads i a row, would you bet o the ext flip? Modelig usig Radom Variables Let X be the coi radomly chose, ad Y j be the idicator r.v. for the evet that the jth flip of this radomly chose coi comes up Heads. Sice we do t kow which coi we have chose, X is the hidde quatity. The Y j s are the observatios. Predictig the first flip The first questio asks for Pr[Y 1 1]. First we calculate the joit distributio of X ad Y 1 : Pr[X i,y 1 H] Pr[X i]pr[y 1 H X i] p i. (1) [Note: We are abusig otatio here by writig Y 1 H" rather tha Y 1 1" for the evet that the first coi toss comes up Heads. We are doig this to make thigs clearer, eve though strictly speakig a radom variable should take o oly real values.] Applyig (??), we get: Pr[Y 1 H] i,y 1 H] Pr[X 1 Note that combiig the above two equatios, we are i effect usig the fact that: Pr[Y 1 H] p i. (2) Pr[X i]pr[y 1 H X i]. (3) This is just the Total Probability Rule for evets applied to radom variables. Oce you get familiar with this type of calculatio, you ca bypass the itermediate calculatio of the joit distributio ad directly write dow equatio (3). Predictig the secod flip after observig the first Now, give that we observed Y 1 H, we have leared somethig about the radomly chose coi X. This kowledge is captured by the coditioal distributio Pr[X i Y 1 H] Pr[X i,y 1 H] Pr[Y 1 H] p i j1 p, j EECS 70, Sprig 2013, Lecture 18 2

3 usig eqs. (1) ad (2). Note that whe we substitute eq. (1) ito the above equatio, we are i effect usig: Pr[X i Y 1 H] Pr[X i]pr[y 1 H X i]. Pr[Y 1 H] This is just Bayes rule for evets applied to radom variables. Just as for evets, this rule has the iterpretatio of updatig kowledge based o the observatio: {(i,pr[x i]) : i 1,...,} is the prior distributio of the hidde X; {(i,pr[x i Y 1 H]) : i 1,...,} is the posterior distributio of X give the observatio. Bayes rule updates the prior distributio to yield the posterior distributio Now we ca calculate the probability of wiig usig this same coi i the secod flip: Pr[Y 2 H Y 1 H] Pr[X i Y 1 H]Pr[Y 2 H X i,y 1 H]. (4) This ca be iterpreted as the total probability rule (3) but i a ew probability space with all the probabilities uder the additioal coditio Y 1 H. You should try to verify this formula from first priciples. Now let us calculate the various probabilities o the right had side of (4). The probability Pr[X i Y 1 H] is just the posterior distributio of X give the observatio, which we have already calculated above. What about the probability Pr[Y 2 H X i,y 1 H]? There are two coditioig evets: X i ad Y 1 H. But here is the thig: oce we kow that the ukow coi is coi i, the kowig the first flip is a Head is redudat ad provides o further statistical iformatio about the outcome of the secod flip: the probability of gettig a Heads o the secod flip is just p i. I other words, Pr[Y 2 H X i,y 1 H] Pr[Y 2 H X i] p i. (5) The evets Y 1 H ad Y 2 H are said to be idepedet coditioal o the evet X i. Sice i fact Y 1 a ad Y 2 b are idepedet give X i for all a,b,i, we will say that the radom variables Y 1 ad Y 2 are idepedet give the radom variable X. Defiitio 18.1 (Coditioal Idepedece): Two evets A ad B are said to be coditioally idepedet give a third evet C if Pr[A B C] Pr[A C] Pr[B C]. Two radom variables X ad Y are said to be coditioally idepedet give a third radom variable Z if for every a,b,c, Pr[X a,y b Z c] Pr[X a Z c] Pr[Y b Z c]. Goig back to our coi example, ote that the r.v. s Y 1 ad Y 2 are defiitely ot idepedet. Kowig the outcome of Y 1 tells us some iformatio about the idetity of the coi (X) ad hece allows us to ifer somethig about Y 2. However, if we already kow X, the the outcomes of the differet flips Y 1 ad Y 2 are idepedet. Now substitutig (5) ito (4), we get the probability of wiig usig this coi i the secod flip: Pr[Y 2 H Y 1 H] Pr[X i Y 1 H]Pr[Y 2 H X i] p2 i p. i It ca be show (usig the Cauchy-Schwarz iequality) that i p 2 i ( i p i ) 2, which implies that Pr[Y 2 H Y 1 H] p2 i p i p i Pr[Y 1 H]. EECS 70, Sprig 2013, Lecture 18 3

4 Figure 1: The coditioal distributios of X give o observatios, 1 Heads, ad 2 Heads. Thus our observatio of a Heads o the first flip icreases the probability that the secod toss is Heads. This, of course, is ituitively reasoable, because the posterior distributio puts larger weight o the cois with larger values of p i. Predictig the third flip after observig the first two Usig Bayes rule ad the total probability rule, we ca compute the posterior distributio of X give that we observed two Heads i a row: Pr[X i Y 1 H,Y 2 H] Pr[X i]pr[y 1 H,Y 2 H X i] Pr[Y 1 H,Y 2 H] Pr[X i]pr[y 1 H,Y 2 H X i] j1 Pr[X j]pr[y 1 H,Y 2 H X j] Pr[X i]pr[y 1 H X i]pr[y 2 H X i] j1 Pr[X j]pr[y 1 H X j]pr[y 2 H X j] p 2 i j1 p2 j The probability of gettig a wi o the third flip usig the same coi is the: Pr[Y 3 H Y 1 H,Y 2 H] Pr[X i Y 1 H,Y 2 H]Pr[Y 3 H X i,y 1 H,Y 2 H] Pr[X i Y 1 H,Y 2 H]Pr[Y 3 H X i] p3 i. p2 i Agai, it ca be show that p3 i p2 i p2 i p, so the probability of seeig aother Heads o the ext flip has i agai icreased. If we cotiue this process further (coditioig o havig see more ad more Heads), the probability of Heads o the ext flip will keep icreasig towards the limit p max max i p i. As a umerical illustratio, suppose 3 ad the three cois have Heads probabilities p 1 2/3, p 2 1/2, p 3 1/5. The coditioal distributios of X after observig o flip, oe Heads ad two Heads i a row EECS 70, Sprig 2013, Lecture 18 4

5 Figure 2: The system diagram for the commuicatio problem. are show i Figure 1. Note that as more Heads are observed, the coditioal distributio is icreasigly cocetrated o coi 1 with p 1 2/3: we are icreasigly certai that the coi chose is the best coi. The correspodig probabilities of wiig o the ext flip after observig o flip, oe Heads ad two Heads i a row are 0.46, 0.54 ad 0.58 respectively. The coditioal probability of wiig gets better ad better (approachig 2/3 i the limit). Iferece Example 2: Commuicatio over a Noisy Chael Questio: I have oe bit of iformatio that I wat to commuicate over a oisy chael. The oisy chael flips each oe of my trasmitted symbols idepedetly with probability p < 0.5. How much improvemet i performace do I get by repeatig my trasmissio times? Commet: I a earlier lecture ote, we also cosidered a commuicatio problem ad gave some examples of error-correctig codes. However, the models for the commuicatio chael are differet. There, we put a boud o the maximum umber of flips the chael ca make. Here, we do ot put such bouds a priori but istead impose a boud o the probability that each bit is flipped (so that the expected umber of bits flipped is p). Sice there is o boud o the maximum umber of flips the chael ca make, there is o guaratee that the receiver will always decode correctly. Istead, oe has to be satisfied with beig able to decode correctly with high probability, e.g., probability of error < Modelig The situatio is show i Figure 2. Let X ( 0 or 1) be the value of the iformatio bit I wat to trasmit. Assume that X is equally likely to be 0 or 1 (this is the prior). The received symbol o the ith repetitio of X is Y i X + Z i mod 2, i 1,2,..., with Z i 1 with probability p ad Z i 0 with probability 1 p. Note that Y i is differet from X if ad oly if Z i 1. Thus, the trasmitted symbol is flipped with probability p. The Z i s are assumed to be mutually idepedet across differet repetitios of X ad also idepedet of X. The Z i s ca be iterpreted as oise. Note that the received symbols Y i s are ot idepedet; they all cotai iformatio about the trasmitted bit X. However, give X, they are (coditioally) idepedet sice they the oly deped o the oise Z i. EECS 70, Sprig 2013, Lecture 18 5

6 Decisio rule First, we have to figure out what decisio rule to use at the receiver, i.e., give each of the 2 possible received sequeces, Y 1 b 1,Y 2 b 2,...Y b, how should the receiver guess what value of X was trasmitted? A atural rule is the maximum a posteriori (MAP) rule: guess the value a for which the coditioal probability of X a give the observatios is the largest amog all a. More explicitly: { a 0 if Pr[X 0 Y1 b 1,...,Y b ] Pr[X 1 Y 1 b 1,...Y b ] 1 otherwise Now, let s reformulate this rule so that it looks cleaer. By Bayes rule, we have Pr[X 0 Y 1 b 1,...Y b ] Pr[X 0]Pr[Y 1 b 1,...,Y b X 0] Pr[Y 1 b 1,...,Y b ] Pr[X 0]Pr[Y 1 b 1 X 0]Pr[Y 2 b 2 X 0]...Pr[Y b X 0] (7) Pr[Y 1 b 1,...,Y b ] I the secod step, we are usig the fact that the observatios Y i s are coditioally idepedet give X. (Why?) Similarly, Pr[X 1 Y 1 b 1,...Y b ] Pr[X 1]Pr[Y 1 b 1,...,Y b X 1] Pr[Y 1 b 1,...,Y b ] Pr[X 1]Pr[Y 1 b 1 X 1]Pr[Y 2 b 2 X 1]...Pr[Y b X 1] (9). Pr[Y 1 b 1,...,Y b ] A equivalet way of describig the MAP rule is that it computes the ratio of these coditioal probabilities ad checks if it is greater tha or less tha 1. If it is greater tha (or equal to) 1, the guess that a 0 was trasmitted; otherwise guess that a 1 was trasmitted. (This ratio idicates how likely a 0 is compared to a 1, ad is called the likelihood ratio.) Dividig (7) ad (9), ad recallig that we are assumig Pr[X 1] Pr[X 0], the likelihood ratio L is: L Pr[Y i b i X 0] Pr[Y i b i X 1]. (10) Note that we did t have to compute Pr[Y 1 b 1,...,Y b ], sice it appears i both of the coditioal probabilities ad got caceled out whe computig the ratio. Now, { p Pr[Y i b i X 0] Pr[Y i b i X 1] 1 p if b i 1 1 p p if b i 0 I other words, L has a factor of p/(1 p) < 1 for every 1 received ad a factor of (1 p)/p > 1 for every 0 received. So the likelihood ratio L is greater tha 1 if ad oly if the umber of 0 s is greater tha the umber of 1 s. Thus, the decisio rule is simply a majority rule: guess that a 0 was trasmitted if the umber of 0 s i the received sequece is at least as large as the umber of 1 s, otherwise guess that a 1 was trasmitted. Note that i derivig this rule, we assumed that Pr[X 0] Pr[X 1] 0.5. Whe the prior distributio is ot uiform, the MAP rule is o loger a simple majority rule. Exercise: derive the MAP rule i the geeral case. (6) (8) EECS 70, Sprig 2013, Lecture 18 6

7 Error probability aalysis What is the probability that the guess is icorrect? This is just the evet E that the umber of flips by the oisy chael is greater tha /2. So the error probability of our majority rule is: [ Pr[E] Pr Z i > ] 2 k /2 ( ) p k (1 p) k, k recogizig that the radom variable S : Z i has a biomial distributio with parameters ad p. This gives a expressio for the error probability that ca be umerically evaluated for give values of. Give a target error probability of, say, 0.01, oe ca the compute the smallest umber of repetitios eeded to achieve the target error probability. 1 As i the hashig applicatio we looked at earlier i the course, we are iterested i a more explicit relatioship betwee ad the error probability to get a better ituitio of the problem. The above expressio is too cumbersome for this purpose. Istead, otice that /2 is greater tha the mea p of S ad hece the error evet is related to the tail of the distributio of S. Oe ca therefore apply Chebyshev s iequality to boud the error probability: [ Pr S > ] < Pr [ S p > ( 1 ] 2 2 p) Var(S) p(1 p) 2 ( 1 2 p)2 ( p)2, usig the fact that Var(S) Var(Z i ) p(1 p). The importat thig to ote is that the error probability decreases with, so ideed by repeatig more times the performace improves (as oe would expect!). For a give target error probability of, say, 0.01, oe eeds to repeat o more tha times. For p 0.25, this evaluates to 300. p(1 p) 100 ( 1 2 p)2 Exercise: compare the boud with the actual error probability. You will see that the boud is rather pessimistic, ad actually oe ca repeat may fewer times to get a error probability of I a upperdivisio course such as CS 174 or EECS 126, you ca lear about much better bouds o error probabilities like this. 1 Needless to say, oe does ot wat to repeat more times tha is ecessary as we are usig more time to trasmit each iformatio bit ad the rate of commuicatio is slowed dow. EECS 70, Sprig 2013, Lecture 18 7

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Lecture 16. Multiple Random Variables and Applications to Inference

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Lecture 16. Multiple Random Variables and Applications to Inference CS 70 Discrete Mathematics ad Probability Theory Fall 2009 Satish Rao,David Tse Lecture 16 Multiple Radom Variables ad Applicatios to Iferece I may probability problems, we have to deal with multiple r.v.