MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1
Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: polynomal curve fttng M M = + + + = j Notatons: f ( x, w) w0 w1 x wm x wjx w 0 +w 1 x j= 0 w T x Now, gven a tranng set, how do we pck, or learn, the parameters w?
Least Squares Loss Functon y f(x, w) N 1 L( w) = ( f ( x, w) y = 1 ) 3
4 Learnng by Gradent Descent How to choose w n order to mnmze L(w)? General dea s to start wth some ntal guess for w, and that repeatedly changes w to make L(w) smaller, utl we converge to a value of w that mmze L(w). Gradent descent s a natural search algorthm that update w n the drecton of steepest decrease of L: where s the learnng rate Now let us calculate the partal dervatve. Frst consder one tranng nstance (x, y), so the sum n L can be gnored: Intuton: The update s proportonal to the error term (f(x, w) - y). Thus for the tranng examples wth predcton score close to the actual value y, there s lttle need to change the parameters; n contrast, a larger change to the parameters wll be made. j j j w L w w Δ = (w) j M j j j j j j x y x f y x w w y x f y x f w L w ) ), ( ( ) ), ( ( ) ), ( ( 1 ) ( 0 = = = = w w w w j j j x y x f w w ) ), ( Δ( = w The rule s called LMS (least mean squares) update rule, aka Wdrow-Hoff learnng rule.
Learnng by Gradent Descent (cont.) Then consder a tranng data set rather than only one example by two ways: Batch gradent descent Scan through the entre tranng set before takng a sngle step of update Repeat untl convergence { w j = w j + ( f (x, w) y ) x j } Stochastc gradent descent Start makng progress rght away, and contnues to make progress wth each example t looks at. Much faster than batch gradent descent. The parameters wll keep oscllatng around the mnmum of L, but n practce most of the values near the mnmum wll be reasonably good approxmatons to the true mnmum. Repeat untl convergence { Foreach x { w j = w j + ( f (x, w) y ) x j } } Partcularly when the tranng set s large, stochastc gradent descent s often preferred. 5
0 th Order Polynomal 6 * A number of the followng pages are from Bshop s sldes
7 1 st Order Polynomal
8 3 rd Order Polynomal
9 9 th Order Polynomal
Over-fttng Root-Mean-Square (RMS) Error: E RMS = L( w) N 10
11 Polynomal Coeffcents
Data Set Sze 9 th Order Polynomal 1
Data Set Sze 9 th Order Polynomal 13
Regularzaton Penalze large coeffcent values N 1 λ L( w) = ( f ( x, w) y ) + w = 1 14
15 Regularzaton
16 Regularzaton
Regularzaton: vs. 17
18 Polynomal Coeffcents
Let us back to the smple case Lnear functon And the loss functon f ( x, w) = w + w x + + w = p 0 j=0 = 1 w j x 1 j Should all example be treated equally? How about f we assgn dfferent weghts to dfferent examples? 1 = w N 1 L( w) = ( f ( x, w) y ) T x p x p 19
Locally Weghted Lnear Regresson Ftng w to mnmze N 1 L( w) = α ( f ( x, w) = 1 y ) where a s a non-negatve valued weght for x α ( x x exp τ q = here x q s the example we want to predct ts y ) Wth ths, we have the frst example of a non-parametrc algorthm The number of parameters s not fxed and may grows wth the ncrease of the sze of the tranng set. 0
1 Probablstc Interpretaton Let us assume that each example has an error term to represent unmodeled effects. Also assume that ε satsfes IID, e.g., Then we have Gven data X={x }, and ther correspondng Y T y ε + = x w = exp 1 ) ( σ ε πσ ε p = ) ( exp 1 ) ; ( σ πσ T y x y p x w w = = = = = N T N y x y p X Y p L 1 1 ) ( exp 1 ) ; ( ) ; ( ) ( σ πσ x w w w w
Maxmum (log-)lkelhood Log lkelhood * Under certan probablstc assumpton, mnmzng least squares regresson corresponds to maxmzng lkelhood estmaton of w = = = = = = = N T N T N y N y x y p X Y p L 1 1 1 ) ( 1 1 1 log ) ( exp 1 log ) ; ( log ) ; ( log ) ( x w x w w w w σ πσ σ πσ Whch s the same as mnmzng least squares
3 Probablty Dstrbutons
The Rules of Probablty Sum Rule Product Rule 4
5 Probablty Denstes
6 The Gaussan Dstrbuton
7 Gaussan Mean and Varance
8 The Multvarate Gaussan
Examples 1 0 = 0 1 0.6 0 = 0 0.6 0 = 0 9
30 Examples
31 Examples
Gaussan Parameter Estmaton Lkelhood functon 3
33 Maxmum (Log) Lkelhood
Propertes of and 34
Bnary Varables (1) Con flppng: heads=1, tals=0 Bernoull Dstrbuton 35
N con flps: Bnary Varables () Bnomal Dstrbuton 36
37 Bnomal Dstrbuton
ML for Bernoull Gven: Parameter Estmaton (1) 38
Example: Parameter Estmaton () Predcton: all future tosses wll land heads up Overfttng to D 39
Beta Dstrbuton Dstrbuton over. 40
Bayesan Bernoull The Beta dstrbuton provdes the conjugate pror for the Bernoull dstrbuton. 41
4 Beta Dstrbuton
43 Pror Lkelhood = Posteror
Propertes of the Posteror As the sze of the data set, N, ncrease 44
Predcton under the Posteror What s the probablty that the next con toss wll land heads up? 45
Multnomal Varables 1-of-K codng scheme: 46
ML Parameter estmaton Gven: Ensure, use a Lagrange multpler 47
48 The Multnomal Dstrbuton
The Drchlet Dstrbuton Conjugate pror for the multnomal dstrbuton. 49
50 Bayesan Multnomal (1)
51 Bayesan Multnomal ()
Thanks! Je Tang, DCST http://keg.cs.tsnghua.edu.cn/jetang/ http://arnetmner.org Emal: jetang@tsnghua.edu.cn 5