Probability and MLE.

Size: px

Start display at page:

Download "Probability and MLE."

Lesley Harrington
5 years ago
Views:

1 Probability ad MLE

2 (brief) itro to probability

3 Basic otatios Radom variable - referrig to a elemet / evet whose status is ukow: A = it will rai tomorrow Domai (usually deoted by ) - The set of values a radom variable ca take: - A = The stock market will go up this year : Biary - A = Number of Steelers wis i 2015 : Discrete - A = % chage i Google stock i 2015 : Cotiuous

4 Axioms of probability (Kolmogorov s axioms) A variety of useful facts ca be derived from just three axioms: 1. 0 P(A) 1 2. P(true) = 1, P(false) = 0 3. P(A B) = P(A) + P(B) P(A B) There have bee several other attempts to provide a foudatio for probability theory. Kolmogorov s axioms are the most widely used.

5 Priors Degree of belief i a evet i the absece of ay other iformatio No rai Rai P(rai tomorrow) = 0.2 P(o rai tomorrow) = 0.8

6 Coditioal probability P(A = 1 B = 1): The fractio of cases where A is true if B is true P(A = 0.2) P(A B = 0.5)

7 Coditioal probability I some cases, give kowledge of oe or more radom variables we ca improve upo our prior belief of aother radom variable For example: p(slept i movie) = 0.5 p(slept i movie liked movie) = 1/4 p(did t sleep i movie liked movie) = 3/4 Slept Liked

8 Joit distributios The probability that a set of radom variables will take a specific value is their joit distributio. Notatio: P(A B) or P(A,B) Example: P(liked movie, slept) If we assume idepedece the P(A,B)=P(A)P(B) However, i may cases such a assumptio may be too strog (more later i the class)

9 Joit distributio (cot) P(class size > 20) = 0.6 P(summer) = 0.4 P(class size > 20, summer) =? Evaluatio of classes Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

10 Joit distributio (cot) P(class size > 20) = 0.6 P(summer) = 0.4 P(class size > 20, summer) = 0.1 Evaluatio of classes Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

11 Joit distributio (cot) P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3 Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

12 Joit distributio (cot) Evaluatio of classes P(class size > 20) = 0.6 P(eval = 1) = 0.3 P(class size > 20, eval = 1) = 0.3 Size Time Eval 30 R 2 70 R 1 12 S 2 8 S 3 56 R 1 24 S 2 10 S 3 23 R 3 9 R 2 45 R 1

13 Chai rule The joit distributio ca be specified i terms of coditioal probability: P(A,B) = P(A B)*P(B) Together with Bayes rule (which is actually derived from it) this is oe of the most powerful rules i probabilistic reasoig

14 Bayes rule Oe of the most importat rules for this class. Derived from the chai rule: P(A,B) = P(A B)P(B) = P(B A)P(A) Thus, P( A B) P( B A) P( A) P( B) Thomas Bayes was a Eglish clergyma who set out his theory of probability i 1764.

15 Bayes rule (cot) Ofte it would be useful to derive the rule a bit further: A A P A B P A P A B P B P A P A B P A B P ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( This results from: P(B) = A P(B,A) A B A B P(B,A=1) P(B,A=0)

Recall: Your first cosultig job A billioaire from the suburbs of Seattle asks you a questio: He says: I have a coi, if I flip it, what s the probability it will fall with the head up?

16 Recall: Your first cosultig job A billioaire from the suburbs of Seattle asks you a questio: He says: I have a coi, if I flip it, what s the probability it will fall with the head up? You say: Please flip it a few times: You say: The probability is: 3/5 because frequecy of heads i all flips He says: But ca I put moey o this estimate? You say: ummm. Maybe ot. Not eough flips (less tha sample complexity)

What about prior kowledge? Billioaire says: Wait, I kow that the coi is close to 50-50. What ca you do for me ow?

17 What about prior kowledge? Billioaire says: Wait, I kow that the coi is close to What ca you do for me ow? You say: I ca lear it the Bayesia way Rather tha estimatig a sigle, we obtai a distributio over possible values of Before data After data 50-50

18 Bayesia Learig Use Bayes rule: Or equivaletly: posterior likelihood prior 18

19 AIDS test (Bayes rule) Data Approximately 0.1% are ifected Test detects all ifectios Test reports positive for 1% healthy people Probability of havig AIDS if test is positive: Oly 9%!... 10

20 AIDS test (Bayes rule) Data Approximately 0.1% are ifected Test detects all ifectios Test reports positive for 1% healthy people Probability of havig AIDS if test is positive: Oly 9%!... 10

21 Prior distributio From where do we get the prior? - Represets expert kowledge (philosophical approach) - Simple posterior form (egieer s approach) Uiformative priors: - Uiform distributio Cojugate priors: - Closed-form represetatio of posterior - P(q) ad P(q D) have the same algebraic form as a fuctio of \theta

22 Cojugate Prior P(q) ad P(q D) have the same form as a fuctio of theta Eg. 1 Coi flip problem Likelihood give Beroulli model: If prior is Beta distributio, The posterior is Beta distributio 22

23 Beta distributio More cocetrated as values of b H, b T icrease

24 Beta cojugate prior As = a H + a T icreases As we get more samples, effect of prior is washed out

25 Cojugate Prior P() ad P( D) have the same form Eg. 2 Dice roll problem (6 outcomes istead of 2) Likelihood is ~ Multiomial( { 1, 2,, k }) If prior is Dirichlet distributio, The posterior is Dirichlet distributio For Multiomial, cojugate prior is Dirichlet distributio.

26 Posterior Distributio The approach see so far is what is kow as a Bayesia approach Prior iformatio ecoded as a distributio over possible values of parameter Usig the Bayes rule, you get a updated posterior distributio over parameters, which you provide with flourish to the Billioaire But the billioaire is ot impressed - Distributio? I just asked for oe umber: is it 3/5, 1/2, what is it? - How do we go from a distributio over parameters, to a sigle estimate of the true parameters?

27 Maximum A Posteriori Estimatio Choose that maximizes a posterior probability MAP estimate of probability of head: Mode of Beta distributio 27

28 Desity estimatio

29 Desity Estimatio A Desity Estimator lears a mappig from a set of attributes to a Probability Iput data for a variable or a set of variables Desity Estimator Probability

30 Desity estimatio Estimate the distributio (or coditioal distributio) of a radom variable Types of variables: - Biary coi flip, alarm - Discrete dice, car model year - Cotiuous height, weight, temp.,

31 Whe do we eed to estimate desities? Desity estimators are critical igrediets i several of the ML algorithms we will discuss I some cases these are combied with other iferece types for more ivolved algorithms (i.e. EM) while i others they are part of a more geeral process (learig i BNs ad HMMs)

32 Desity estimatio Biary ad discrete variables: Cotiuous variables: Easy: Just cout! Harder (but just a bit): Fit a model

33 Learig a desity estimator for discrete variables P ˆ (x u) #records i which x i u i total umber of records A trivial learig algorithm! But why is this true?

34 Maximum Likelihood Priciple We ca defie the likelihood of the data give the model as follows: k1 P ˆ (dataset M) P ˆ (x x x M) P ˆ (x M) 1 2 k For example M is - The probability of head for a coi flip - The probabilities of observig 1,2,3,4 ad 5 for a dice - etc. M is our model (usually a collectio of parameters)

35 Maximum Likelihood Priciple Our goal is to determie the values for the parameters i M We ca do this by maximizig the probability of geeratig the observed samples For example, let be the probabilities for a coi flip The L(x 1,,x ) = p(x 1 ) p(x ) The observatios (differet flips) are assumed to be idepedet For such a coi flip with P(H)=q the best assigmet for h is Why? P ˆ (dataset M) P ˆ (x x x M) P ˆ (x M) 1 2 k argmax q = #H/#samples k1

36 Maximum Likelihood Priciple: Biary variables For a biary radom variable A with P(A=1)=q argmax q = #1/#samples Why? Data likelihood: P( D M ) q (1 q 1 ) 2 We would like to fid: arg max q q (1 q 1 ) 2 Omittig terms that do ot deped o q

37 Data likelihood: We would like to fid: Maximum Likelihood Priciple 2 1 ) (1 ) ( q q M D P 2 1 ) (1 arg max q q q ) (1 0 ) ) (1 ( ) (1 0 ) (1 ) (1 0 ) (1 ) (1 ) ( q q q q q q q q q q q q q q q q q q q q q

38 Log Probabilities Whe workig with products, probabilities of etire datasets ofte get too small. A possible solutio is to use the log of probabilities, ofte termed log likelihood log ˆ P (dataset M) log P ˆ (x M) log P ˆ (x M) k k k1 Maximizig this likelihood fuctio is the same as maximizig P(dataset M) k1 Log values betwee 0 ad 1 I some cases movig to log space would also make computatio easier (for example, removig the expoets)

39 How much do grad studets sleep? Lets try to estimate the distributio of the time studets sped sleepig (outside class).

40 Possible statistics X Sleep time 12 Sleep Mea of X: 10 E{X} 7.03 Variace of X: Frequecy Sleep Var{X} = E{(X-E{X})^2} Hours

41 Covariace: Sleep vs. GPA Co-Variace of X1, X2: Covariace{X1,X2} = E{(X1-E{X1})(X2-E{X2})} = Sleep / GPA 4 GPA 3.5 Sleep / GPA Sleep hours

42 Statistical Models Statistical models attempt to characterize properties of the populatio of iterest For example, we might believe that repeated measuremets follow a ormal (Gaussia) distributio with some mea µ ad variace 2, x ~ N(µ, 2 ) where ( x x ) ( ) e 2 2 p 2 2 ad =(µ, 2 ) defies the parameters (mea ad variace) of the model. 1 2

43 The Parameters of Our Model A statistical model is a collectio of distributios; the parameters specify idividual distributios x ~ N(µ, 2 ) We eed to adjust the parameters so that the resultig distributio fits the data well

44 The Parameters of Our Model A statistical model is a collectio of distributios; the parameters specify idividual distributios x ~ N(µ, 2 ) We eed to adjust the parameters so that the resultig distributio fits the data well

45 Computig the parameters of our model Lets assume a Guassia distributio for our sleep data How do we compute the parameters of the model? 12 Sleep 10 8 Frequecy 6 4 Sleep Hours

46 Maximum Likelihood Priciple We ca fit statistical models by maximizig the probability of geeratig the observed samples: L(x 1,,x ) = p(x 1 ) p(x ) (the samples are assumed to be idepedet) I the Gaussia case we simply set the mea ad the variace to the sample mea ad the sample variace: 1 i 1 xi 2 1 ( i 1 xi ) 2 Why?

47 Desity estimatio Biary ad discrete variables: Cotiuous variables: Easy: Just cout! Harder (but just a bit): Fit a model But what if we oly have very few samples?

48 MLE vs. MAP Maximum Likelihood estimatio (MLE) Choose value that maximizes the probability of observed data Maximum a posteriori (MAP) estimatio Choose value that is most probable give observed data ad prior belief

49 Importat poits Radom variables Chai rule Bayes rule Joit distributio, idepedece, coditioal idepedece MLE

50 Assume we performed coi flips ad used the outcome to lear the probability of heads, defied as q. I the questios below assume that 0 < q < 1 uless stated otherwise. 1. We have performed a additioal coi flip ad leared a ew probability for heads, q1, based o the +1 observatios. The followig holds: a. q1 = q b. q1 q c. it depeds o q ad the value of the ew observatio 2. We have performed two additioal coi flips ad leared a ew probability for heads, q1, based o the +2 observatios. The followig holds: a. q1 = q b. q1 q c. it depeds o q ad the values of the ew observatios 3. Now assume that 0.6 < q < 1. Similar to (2) we have performed two additioal coi flips ad leared a ew probability for heads, q1, based o the +2 observatios. The followig holds: 1. q1 = q 2. q1 q 3. it depeds o q ad the values of the ew observatios

51 Probability Desity Fuctio Discrete distributios Cotiuous: Cumulative Desity Fuctio (CDF): F(a) f(x) a x

52 Total probability Cumulative Desity Fuctios Probability Desity Fuctio (PDF) Properties: F(x)

53 Expectatios Mea/Expected Value: Variace: I geeral:

54 Multivariate Joit for (x,y) Margial: Coditioals: Chai rule:

55 Bayes Rule Stadard form: Replacig the bottom:

56 Biomial Distributio: Mea/Var:

57 Uiform Aythig is equally likely i the regio [a,b] Distributio: Mea/Var a b

58 Gaussia (Normal) If I look at the height of wome i coutry xx, it will look approximately Gaussia Small radom oise errors, look Gaussia/Normal Distributio: Mea/var

59 Why Do People Use Gaussias Cetral Limit Theorem: (loosely) - Sum of a large umber of IID radom variables is approximately Gaussia

60 Multivariate Gaussias Distributio for vector x PDF:

61 Multivariate Gaussias 1 cov( x, x ) ( x )( x ) 1 2 1, i 1 2, i 2 i1

62 Covariace examples Ati-correlated Correlated Idepedet (almost) Covariace: -9.2 Covariace: Covariace: 0.6

63 Sum of Gaussias The sum of two Gaussias is a Gaussia:

Machine Learning.

Machine Learning. 10-701 Machie Learig http://www.cs.cmu.edu/~epxig/class/10701-15f/ Orgaizatioal ifo All up-to-date ifo is o the course web page (follow liks from my page). Istructors - Eric Xig - Ziv Bar-Joseph TAs: See