MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University PDF Free Download

MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1

Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: polynomal curve fttng M M = + + + = j Notatons: f ( x, w) w0 w1 x wm x wjx w 0 +w 1 x j= 0 w T x Now, gven a tranng set, how do we pck, or learn, the parameters w?

Least Squares Loss Functon y f(x, w) N 1 L( w) = ( f ( x, w) y = 1 ) 3

4 Learnng by Gradent Descent How to choose w n order to mnmze L(w)? General dea s to start wth some ntal guess for w, and that repeatedly changes w to make L(w) smaller, utl we converge to a value of w that mmze L(w). Gradent descent s a natural search algorthm that update w n the drecton of steepest decrease of L: where s the learnng rate Now let us calculate the partal dervatve. Frst consder one tranng nstance (x, y), so the sum n L can be gnored: Intuton: The update s proportonal to the error term (f(x, w) - y). Thus for the tranng examples wth predcton score close to the actual value y, there s lttle need to change the parameters; n contrast, a larger change to the parameters wll be made. j j j w L w w Δ = (w) j M j j j j j j x y x f y x w w y x f y x f w L w ) ), ( ( ) ), ( ( ) ), ( ( 1 ) ( 0 = = = = w w w w j j j x y x f w w ) ), ( Δ( = w The rule s called LMS (least mean squares) update rule, aka Wdrow-Hoff learnng rule.

Learnng by Gradent Descent (cont.) Then consder a tranng data set rather than only one example by two ways: Batch gradent descent Scan through the entre tranng set before takng a sngle step of update Repeat untl convergence { w j = w j + ( f (x, w) y ) x j } Stochastc gradent descent Start makng progress rght away, and contnues to make progress wth each example t looks at. Much faster than batch gradent descent. The parameters wll keep oscllatng around the mnmum of L, but n practce most of the values near the mnmum wll be reasonably good approxmatons to the true mnmum. Repeat untl convergence { Foreach x { w j = w j + ( f (x, w) y ) x j } } Partcularly when the tranng set s large, stochastc gradent descent s often preferred. 5

0 th Order Polynomal 6 * A number of the followng pages are from Bshop s sldes

7 1 st Order Polynomal

8 3 rd Order Polynomal

9 9 th Order Polynomal

Over-fttng Root-Mean-Square (RMS) Error: E RMS = L( w) N 10

11 Polynomal Coeffcents

Data Set Sze 9 th Order Polynomal 1

Data Set Sze 9 th Order Polynomal 13

Regularzaton Penalze large coeffcent values N 1 λ L( w) = ( f ( x, w) y ) + w = 1 14

15 Regularzaton

16 Regularzaton

Regularzaton: vs. 17

18 Polynomal Coeffcents

Let us back to the smple case Lnear functon And the loss functon f ( x, w) = w + w x + + w = p 0 j=0 = 1 w j x 1 j Should all example be treated equally? How about f we assgn dfferent weghts to dfferent examples? 1 = w N 1 L( w) = ( f ( x, w) y ) T x p x p 19

Locally Weghted Lnear Regresson Ftng w to mnmze N 1 L( w) = α ( f ( x, w) = 1 y ) where a s a non-negatve valued weght for x α ( x x exp τ q = here x q s the example we want to predct ts y ) Wth ths, we have the frst example of a non-parametrc algorthm The number of parameters s not fxed and may grows wth the ncrease of the sze of the tranng set. 0

1 Probablstc Interpretaton Let us assume that each example has an error term to represent unmodeled effects. Also assume that ε satsfes IID, e.g., Then we have Gven data X={x }, and ther correspondng Y T y ε + = x w = exp 1 ) ( σ ε πσ ε p = ) ( exp 1 ) ; ( σ πσ T y x y p x w w = = = = = N T N y x y p X Y p L 1 1 ) ( exp 1 ) ; ( ) ; ( ) ( σ πσ x w w w w

Maxmum (log-)lkelhood Log lkelhood * Under certan probablstc assumpton, mnmzng least squares regresson corresponds to maxmzng lkelhood estmaton of w = = = = = = = N T N T N y N y x y p X Y p L 1 1 1 ) ( 1 1 1 log ) ( exp 1 log ) ; ( log ) ; ( log ) ( x w x w w w w σ πσ σ πσ Whch s the same as mnmzng least squares

3 Probablty Dstrbutons

The Rules of Probablty Sum Rule Product Rule 4

5 Probablty Denstes

6 The Gaussan Dstrbuton

7 Gaussan Mean and Varance

8 The Multvarate Gaussan

Examples 1 0 = 0 1 0.6 0 = 0 0.6 0 = 0 9

30 Examples

31 Examples

Gaussan Parameter Estmaton Lkelhood functon 3

33 Maxmum (Log) Lkelhood

Propertes of and 34

Bnary Varables (1) Con flppng: heads=1, tals=0 Bernoull Dstrbuton 35

N con flps: Bnary Varables () Bnomal Dstrbuton 36

37 Bnomal Dstrbuton

ML for Bernoull Gven: Parameter Estmaton (1) 38

Example: Parameter Estmaton () Predcton: all future tosses wll land heads up Overfttng to D 39

Beta Dstrbuton Dstrbuton over. 40

Bayesan Bernoull The Beta dstrbuton provdes the conjugate pror for the Bernoull dstrbuton. 41

4 Beta Dstrbuton

43 Pror Lkelhood = Posteror

Propertes of the Posteror As the sze of the data set, N, ncrease 44

Predcton under the Posteror What s the probablty that the next con toss wll land heads up? 45

Multnomal Varables 1-of-K codng scheme: 46

ML Parameter estmaton Gven: Ensure, use a Lagrange multpler 47

48 The Multnomal Dstrbuton

The Drchlet Dstrbuton Conjugate pror for the multnomal dstrbuton. 49

50 Bayesan Multnomal (1)

51 Bayesan Multnomal ()

Thanks! Je Tang, DCST http://keg.cs.tsnghua.edu.cn/jetang/ http://arnetmner.org Emal: jetang@tsnghua.edu.cn 5

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012