Maximum A Psteriri (MAP) CS 109 Lecture 22 May 16th, 2016
Previusly in CS109
Game f Estimatrs Maximum Likelihd Nn spiler: this didn t happen
Side Plt argmax argmax f lg Mther f ptimizatins?
Reviving an Old Stry Line The Multinmial Distributin Mult(p1,, pk) n! p(x1,..., xk ) = px1 1... pxkk x1!... xk!
Machine Learning S Far
Maximum Likelihd f Data Cnsider n I.I.D. randm variables X 1, X 2,..., X n X i is a sample frm density functin f(x i θ) n L( θ ) = f ( θ ) i= 1 X i n i= 1 LL( θ ) = lg L( θ ) = lg f ( X θ ) = lg f ( X θ ) i n i= 1 i MLE = argmaxll( )
MLE t Linear Regressin Hw d yu fit this line? Assume: Y Y = X + Z Z N(0, 2 ) Calculate MLE f ˆ = argmin mx (Y i X i ) 2 i=1 X This is an algrithm called linear regressin. Learn mre abut it later
Watch it Online
Episde 22 The Sng f The Last Estimatr
The Sng f the Last Estimatr
Smething rtten in the wrld f MLE
Freshadwing..
Need a Vlunteer S gd t see yu again!
I have tw envelpes, will allw yu t have ne One cntains $X, the ther cntains $2X Select an envelpe Open it! Nw, wuld yu like t switch fr ther envelpe? T help yu decide, cmpute E[$ in ther envelpe] Let Y = $ in envelpe yu selected 1 Y 1 E $in ther envelpe] = 2 2 + 2 2 Befre pening envelpe, think either equally gd S, what happened by pening envelpe? Tw Envelpes [ Y = And des it really make sense t switch? 5 4 Y
Thinking Deeper Abut Tw Envelpes The tw envelpes prblem set-up Tw envelpes: ne cntains $X, ther cntains $2X Yu select an envelpe and pen it Let Y = $ in envelpe yu selected Let Z = $ in ther envelpe 1 Y 1 5 E[ Z Y] = 2 2 + 2 2Y = 4 Y E[Z Y] abve assumes all values X (where 0 < X < ) are equally likely Nte: there are infinitely many values f X S, nt true prbability distributin ver X (desn t integrate t 1)
All Values are Equally Likely? p(x) Infinite pwers f tw 0 10 20 40 60 80 100 X
Subjectivity f Prbability Belief abut cntents f envelpes Since implied distributin ver X is nt a true prbability distributin, what is ur distributin ver X? Frequentist: play game infinitely many times and see hw ften different values cme up. Prblem: I nly allw yu t play the game nce Bayesian prbability Have prir belief f distributin fr X (r anything fr that matter) Prir belief is a subjective prbability By extensin, all prbabilities are subjective Allws us t answer questin when we have n/limited data E.g., prbability a cin yu ve never flipped lands n heads As we get mre data, prir belief is swamped by data
Subjectivity f Prbability p(x) 0 10 20 40 60 80 100 X
The Envelpe, Please Bayesian: have prir distributin ver X, P(X) Let Y = $ in envelpe yu selected Let Z = $ in ther envelpe Open yur envelpe t determine Y If Y > E[Z Y], keep yur envelpe, therwise switch N incnsistency! Opening envelpe prvides data t cmpute P(X Y) and thereby cmpute E[Z Y] Of curse, there s the issue f hw yu determined yur prir distributin ver X Bayesian: Desn t matter hw yu determined prir, but yu must have ne (whatever it is) Imagine if envelpe yu pened cntained $20.01
The Dreaded Half Cent
Envelpe Summary: Prbabilities are beliefs Incrprating prir beliefs is useful
Especially fr ne sht learning
One Sht Learning Single training example: Test set:
Prirs fr Parameter Estimatin?
Flash Back: Bayes Therem Bayes Therem (θ = mdel parameters, D = data): Psterir Likelihd Prir Likelihd: yu ve seen this befre (in cntext f MLE) Prbability f data given prbability mdel (parameter θ) Prir: befre seeing any data, what is belief abut mdel I.e., what is distributin ver parameters θ Psterir: after seeing data, what is belief abut mdel P(θ D) = P(D θ) P(θ) P(D) After data D bserved, have psterir distributin p(θ D) ver parameters θ cnditined n data. Use this t predict new data.
Cmputing P(θ D) Bayes Therem (θ = mdel parameters, D = data): P(θ D) = P(D θ) P(θ) P(D) We have prir P(θ) and can cmpute P(D θ) But hw d we calculate P(D)? Cmplicated answer: P ( D) = P( D θ) P( θ) dθ Easy answer: It des nt depend n θ, s ignre it Just a cnstant that frces P(θ D) t integrate t 1
Mst imprtant slide f tday
Recall Maximum Likelihd Estimatr (MLE) f θ Maximum A Psteriri (MAP) estimatr f θ: where g(θ) is prir distributin f θ. As befre, can ften be mre cnvenient t use lg: MAP estimate is the mde f the psterir distributin = = n i MLE X i f 1 ) ( max arg θ θ θ ),...,, ( ) ( ),...,, ( arg max ),...,, ( arg max 2 1 2 1 2 1 n n n MAP X X X h g X X X f X X X f θ θ θ θ θ θ = = + = = n i MAP X i f g 1 )) ( lg( )) ( lg( max arg θ θ θ θ ) ( ) ( arg max ),...,, ( ) ( ) ( arg max 1 2 1 1 = = = = n i i n n i i X f g X X X h g X f θ θ θ θ θ θ Maximum A Psteriri
Maximum A Psteriri Estimated parameter Lg prir MAP =argmax lg(g( )) + nx i=1 lg(f(x i ))! Chse the value f theta that maximizes: Sum f lg likelihd
Gtta get that intuitin
l Prir: θ ~ Beta(a, b);; D = {n heads, m tails} l Estimate p l By definitin, f(θ D) is Beta(a + n, b + m) ) ( ) ( ) ( ) ( D f p f p D f D p f D D D = = = = θ θ θ θ θ θ 1 1 3 ) (1 + + = b m a n p p C 1 1 2 1 2 1 1 1 ) (1 ) (1 ) (1 ) (1 = = + + b a m n b a m n p p p p C C C C p p p p n m n n m n P(θ D) Fr Beta and Bernulli MAP = argmax f( D) = argmax (n + a 1) lg +(m + b 1) lg(1 )
Hyper Parameters a b Hyperparameter a, b are fixed p Prir p Beta(a, b) Data distributin X i Bern(p) X 1 X 2 X n MAP will estimate the mst likely value f p fr this mdel
Where d Ya Get Them P(θ)? l l l θ is the prbability a cin turns up heads Mdel θ with 2 different prirs: l l P 1 (θ) is Beta(3,8) (blue) P 2 (θ) is Beta(7,4) (red) They lk pretty different! l Nw flip 100 cins;; get 58 heads and 42 tails l What d psterirs lk like?
It s Like Having Twins argmax returns the mde l As lng as we cllect enugh data, psterirs will cnverge t the true value!
Cnjugate Distributins Withut Tears Just fr review Have cin with unknwn prbability θ f heads Our prir (subjective) belief is that θ ~ Beta(a, b) Nw flip cin k = n + m times, getting n heads, m tails Psterir density: (θ n heads, m tails) ~Beta(a+n, b+\m) Beta is cnjugate fr Bernulli, Binmial, Gemetric, and Negative Binmial a and b are called hyperparameters Saw (a + b 2) imaginary trials, f thse (a 1) are successes Fr a cin yu never flipped befre, use Beta(x, x) t dente yu think cin likely t be fair Hw strngly yu feel cin is fair is a functin f x
M Beta
Gnna Need Prirs Parameter Bernulli p Binmial p Pissn Expnential Multinmial p i Nrmal µ Distributin fr Parameter Beta Beta Gamma Gamma Dirichlet Nrmal Nrmal 2 Inverse Gamma Dn t need t knw Inverse Gamma. But it will knw yu
Multinmial is Multiple Times the Fun Dirichlet(a 1, a 2,..., a m ) distributin Cnjugate fr Multinmial Dirichlet generalizes Beta in same way Multinmial generalizes Bernulli f(x 1 = x 1,X 2 = x 2,...,X m = x m )=K Intuitive understanding f hyperparameters: Saw a i m imaginary trials, with (a i 1) f utcme i Updating t get the psterir distributin m i= 1 After bserving n 1 + n 2 +... + n m, new trials with n i f utcme i...... psterir distributin is Dirichlet(a 1 + n 1, a 2 + n 2,..., a m + n m ) my i=1 x a i 1 i
Best Shrt Film in the Dirichlet Categry And nw a cl animatin f Dirichlet(a, a, a) This is actually lg density (but yu get the idea ) Thanks Wikipedia!
Example: Estimating Die Parameters
Yur Happy Laplace Recall example f 6-sides die rlls: X ~ Multinmial(p 1, p 2, p 3, p 4, p 5, p 6 ) Rll n = 12 times Result: 3 nes, 2 tws, 0 threes, 3 furs, 1 fives, 3 sixes MLE: p 1 =3/12, p 2 =2/12, p 3 =0/12, p 4 =3/12, p 5 =1/12, p 6 =3/12 Dirichlet prir allws us t pretend we saw each X i utcme k times befre. MAP estimate: p = + k i n + mk Laplace s law f successin : idea abve with k = 1 X i + 1 Laplace estimate: pi = n + m Laplace: p 1 =4/18, p 2 =3/18, p 3 =1/18, p 4 =4/18, p 5 =2/18, p 6 =4/18 N lnger have 0 prbability f rlling a three!
Gd Times with Gamma Gamma(k, θ) distributin Cnjugate fr Pissn Als cnjugate fr Expnential, but we wn t delve int that Intuitive understanding f hyperparameters: Saw k ttal imaginary events during θ prir time perids Updating t get the psterir distributin After bserving n events during next t time perids...... psterir distributin is Gamma(k + n, θ + t) Example: Gamma(10, 5) Saw 10 events in 5 time perids. Like bserving at rate = 2 Nw see 11 events in next 2 time perids à Gamma(21, 7) Equivalent t updated rate = 3
Is Peer Grading Accurate Enugh? Peer Grading n Cursera HCI. 31,067 peer grades fr 3,607 students. Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller
Is Peer Grading Accurate Enugh? 1. Defined randm variables fr: True grade (s i ) fr assignment i Observed (z ij ) scre fr assign i Bias (b j ) fr each grader j Variance (r j ) fr each grader j 2. Designed a prbabilistic mdel that defined the distributins fr all randm variables z j i N (µ = s i + b j, = p r j ) s i N(µ 0, 0) b i N(0, 0 ) = hyperparameter r i InvGamma( 0, 0 ) Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller
Is Peer Grading Accurate Enugh? 1. Defined randm variables fr: True grade (s i ) fr assignment i Observed (z ij ) scre fr assign i Bias (b j ) fr each grader j Variance (r j ) fr each grader j 2. Designed a prbabilistic mdel that defined the distributins fr all randm variables 3. Fund variable assignments using MAP estimatin given the bserved data Tuned Mdels f Peer Assessment. C Piech, J Huang, A Ng, D Kller
The last estimatr has risen
Next time: Machine Learning algrithms
It s Nrmal t Be Nrmal Nrmal(µ 0, σ 02 ) distributin Cnjugate fr Nrmal (with unknwn µ, knwn σ 2 ) Intuitive understanding f hyperparameters: A priri, believe true µ distributed ~ N(µ 0, σ 02 ) Updating t get the psterir distributin After bserving n data pints...... psterir distributin fr µ is: + + + = 1 2 2 0 2 2 0 2 1 2 0 0 1, 1 σ σ σ σ σ σ µ n n x N n i i