MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Size: px

Start display at page:

Download "MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012"

Marion Berry
5 years ago
Views:

1 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1

2 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: polynomal curve fttng M M = = j Notatons: f ( x, w) w0 w1 x wm x wjx w 0 +w 1 x j= 0 w T x Now, gven a tranng set, how do we pck, or learn, the parameters w?

3 Least Squares Loss Functon y f(x, w) N 1 L( w) = ( f ( x, w) y = 1 ) 3

4 4 Learnng by Gradent Descent How to choose w n order to mnmze L(w)? General dea s to start wth some ntal guess for w, and that repeatedly changes w to make L(w) smaller, utl we converge to a value of w that mmze L(w). Gradent descent s a natural search algorthm that update w n the drecton of steepest decrease of L: where s the learnng rate Now let us calculate the partal dervatve. Frst consder one tranng nstance (x, y), so the sum n L can be gnored: Intuton: The update s proportonal to the error term (f(x, w) - y). Thus for the tranng examples wth predcton score close to the actual value y, there s lttle need to change the parameters; n contrast, a larger change to the parameters wll be made. j j j w L w w Δ = (w) j M j j j j j j x y x f y x w w y x f y x f w L w ) ), ( ( ) ), ( ( ) ), ( ( 1 ) ( 0 = = = = w w w w j j j x y x f w w ) ), ( Δ( = w The rule s called LMS (least mean squares) update rule, aka Wdrow-Hoff learnng rule.

5 Learnng by Gradent Descent (cont.) Then consder a tranng data set rather than only one example by two ways: Batch gradent descent Scan through the entre tranng set before takng a sngle step of update Repeat untl convergence { w j = w j + ( f (x, w) y ) x j } Stochastc gradent descent Start makng progress rght away, and contnues to make progress wth each example t looks at. Much faster than batch gradent descent. The parameters wll keep oscllatng around the mnmum of L, but n practce most of the values near the mnmum wll be reasonably good approxmatons to the true mnmum. Repeat untl convergence { Foreach x { w j = w j + ( f (x, w) y ) x j } } Partcularly when the tranng set s large, stochastc gradent descent s often preferred. 5

6 0 th Order Polynomal 6 * A number of the followng pages are from Bshop s sldes

7 7 1 st Order Polynomal

8 8 3 rd Order Polynomal

9 9 9 th Order Polynomal

10 Over-fttng Root-Mean-Square (RMS) Error: E RMS = L( w) N 10

11 11 Polynomal Coeffcents

12 Data Set Sze 9 th Order Polynomal 1

13 Data Set Sze 9 th Order Polynomal 13

14 Regularzaton Penalze large coeffcent values N 1 λ L( w) = ( f ( x, w) y ) + w = 1 14

15 15 Regularzaton

16 16 Regularzaton

17 Regularzaton: vs. 17

18 18 Polynomal Coeffcents

19 Let us back to the smple case Lnear functon And the loss functon f ( x, w) = w + w x + + w = p 0 j=0 = 1 w j x 1 j Should all example be treated equally? How about f we assgn dfferent weghts to dfferent examples? 1 = w N 1 L( w) = ( f ( x, w) y ) T x p x p 19

20 Locally Weghted Lnear Regresson Ftng w to mnmze N 1 L( w) = α ( f ( x, w) = 1 y ) where a s a non-negatve valued weght for x α ( x x exp τ q = here x q s the example we want to predct ts y ) Wth ths, we have the frst example of a non-parametrc algorthm The number of parameters s not fxed and may grows wth the ncrease of the sze of the tranng set. 0

21 1 Probablstc Interpretaton Let us assume that each example has an error term to represent unmodeled effects. Also assume that ε satsfes IID, e.g., Then we have Gven data X={x }, and ther correspondng Y T y ε + = x w = exp 1 ) ( σ ε πσ ε p = ) ( exp 1 ) ; ( σ πσ T y x y p x w w = = = = = N T N y x y p X Y p L 1 1 ) ( exp 1 ) ; ( ) ; ( ) ( σ πσ x w w w w

22 Maxmum (log-)lkelhood Log lkelhood * Under certan probablstc assumpton, mnmzng least squares regresson corresponds to maxmzng lkelhood estmaton of w = = = = = = = N T N T N y N y x y p X Y p L ) ( log ) ( exp 1 log ) ; ( log ) ; ( log ) ( x w x w w w w σ πσ σ πσ Whch s the same as mnmzng least squares

23 3 Probablty Dstrbutons

24 The Rules of Probablty Sum Rule Product Rule 4

25 5 Probablty Denstes

26 6 The Gaussan Dstrbuton

27 7 Gaussan Mean and Varance

28 8 The Multvarate Gaussan

29 Examples 1 0 = = = 0 9

30 30 Examples

31 31 Examples

32 Gaussan Parameter Estmaton Lkelhood functon 3

33 33 Maxmum (Log) Lkelhood

34 Propertes of and 34

35 Bnary Varables (1) Con flppng: heads=1, tals=0 Bernoull Dstrbuton 35

36 N con flps: Bnary Varables () Bnomal Dstrbuton 36

37 37 Bnomal Dstrbuton

38 ML for Bernoull Gven: Parameter Estmaton (1) 38

39 Example: Parameter Estmaton () Predcton: all future tosses wll land heads up Overfttng to D 39

40 Beta Dstrbuton Dstrbuton over. 40

41 Bayesan Bernoull The Beta dstrbuton provdes the conjugate pror for the Bernoull dstrbuton. 41

42 4 Beta Dstrbuton

43 43 Pror Lkelhood = Posteror

44 Propertes of the Posteror As the sze of the data set, N, ncrease 44

45 Predcton under the Posteror What s the probablty that the next con toss wll land heads up? 45

46 Multnomal Varables 1-of-K codng scheme: 46

47 ML Parameter estmaton Gven: Ensure, use a Lagrange multpler 47

48 48 The Multnomal Dstrbuton

49 The Drchlet Dstrbuton Conjugate pror for the multnomal dstrbuton. 49

50 50 Bayesan Multnomal (1)

51 51 Bayesan Multnomal ()

52 Thanks! Je Tang, DCST Emal: 5

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton