The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above

Size: px

Start display at page:

Download "The conjugate prior to a Bernoulli is. A) Bernoulli B) Gaussian C) Beta D) none of the above"

Laureen Foster
5 years ago
Views:

1 The conjugate pror to a Bernoull s A) Bernoull B) Gaussan C) Beta D) none of the above

2 The conjugate pror to a Gaussan s A) Bernoull B) Gaussan C) Beta D) none of the above

3 MAP estmates A) argmax θ p(θ D) B) argmax θ p(d θ) C) argmax θ p(d θ)p(θ) D) None of the above

4 MLE estmates A) argmax θ p(θ D) B) argmax θ p(d θ) C) argmax θ p(d θ)p(θ) D) None of the above

5 Consstent estmator A consstent estmator (or asymptotcally consstent estmator) s an estmator a rule for computng estmates of a parameter θ havng the property that as the number of data ponts used ncreases ndefntely, the resultng sequence of estmates converges n probablty to the true parameter θ.

6 Whch s consstent for our conflppng example? A) MLE B) MAP C) Both D) Nether P(D θ) P(θ D) ~ P(D θ)p(θ) Slde 6

7 Covarance Gven random varables X and Y wth jont densty p(x, y) and means E(X) = µ 1, E(Y) = µ The covarance of X and Y s cov(x,y) = E[(X µ 1 )(Y µ )] cov(x, Y) = E(XY) E(X) E(Y) Proof follows easly from the defnton cov(x, X) = var(x)

8 Covarance If X and Y are ndependent then cov(x, Y) = 0. A) True B) False If cov(x, Y) = 0 then X and Y are ndependent. A) True B) False

9 Covarance If X and Y are ndependent then cov(x, Y) = 0 Proof: Independence of X and Y mples that E(XY) = E(X)E(Y). Remark: The converse f NOT true n general. It can happen that the covarance s 0 but X and Y are hghly dependent. (Try to thnk of an example.) For the bvarate normal case the converse does hold.

10 Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n gvng your own lectures. Feel free to use these sldes verbatm, or to modfy them to ft your own needs. PowerPont orgnals are avalable. If you make use of a sgnfcant porton of these sldes n your own lecture, please nclude ths message, or the followng lnk to the source repostory of Andrew s ntroducton tutorals: to regresson Comments and correctons gratefully receved. Mostly by Andrew W. Moore But wth modfcatons by Lyle Ungar Copyrght 001, 003, Andrew W. Moore Slde 10

11 Two nterpretatons of regresson Lnear regresson ŷ = w. x Probablstc/Bayesan (MLE and MAP) y ~ N(w. x, σ ) argmax w p(d w) argmax w p(d w)p(w) Error mnmzaton y - w. X p p + λ w q q here: argmax w p(y w,x) But frst, we ll look at Gaussans

12 Sngle-Parameter Lnear Regresson Copyrght 001, 003, Andrew W. Moore

13 Lnear Regresson nputs outputs 1 w x 1 = 1 y 1 = 1 x = 3 y =. x 3 = y 3 = x 4 = 1.5 y 4 = 1.9 x 5 = 4 y 5 = 3.1 Lnear regresson assumes that the expected value of the output gven an nput, E[y x], s lnear. Smplest case: Out(x) = wx for some unknown w. Copyrght 001, 003, Andrew W. Moore Gven the data, we can estmate w.

14 Copyrght 001, 003, Andrew W. Moore 1-parameter lnear regresson Assume that the data s formed by y = wx + nose where the nose sgnals are ndependent the nose has a normal dstrbuton wth mean 0 and unknown varance σ p(y w,x) has a normal dstrbuton wth mean wx varance σ

15 Bayesan Lnear Regresson Copyrght 001, 003, Andrew W. Moore p(y w,x) = Normal (mean: wx, varance: σ ) y ~ N(wx, σ ) We have a set of data (x 1,y 1 ) (x,y ) (x n,y n ) We want to nfer w from the data. p(w x 1, x, x 3, x n, y 1, y y n ) = P(w D) You can use BAYES rule to work out a posteror dstrbuton for w gven the data. Or you could do Maxmum Lkelhood Estmaton

16 Maxmum lkelhood estmaton of w MLE asks : For whch value of w s ths data most lkely to have happened? <=> For what w s p(y 1, y y n w, x 1, x, x 3, x n ) maxmzed? <=> n For what w s ( y w, x ) maxmzed? Copyrght 001, 003, Andrew W. Moore p = 1

17 For what w s n For what w s Copyrght 001, 003, Andrew W. Moore p( y = 1 For what w s For what w s w, x 1 ( ) n y exp( wx = 1 σ n 1 = 1 σ n = 1 y ) maxmzed? wx ) maxmzed? maxmzed? ( ) mnmzed? y wx

18 Frst result MLE wth Gaussan nose s the same as mnmzng the L error argmn n =1 ( y wx ) Copyrght 001, 003, Andrew W. Moore

19 Lnear Regresson The maxmum lkelhood w s the one that mnmzes sum-of-squares of resduals We want to mnmze a quadratc functon of w. Copyrght 001, 003, Andrew W. Moore Ε = = E(w) y ( y wx ) w ( x y ) w + ( x ) w

20 Easy to show the sum of squares s mnmzed when x y w = x Copyrght 001, 003, Andrew W. Moore Lnear Regresson The maxmum lkelhood model s Out x We can use t for predcton ( ) = wx

21 Lnear Regresson Easy to show the sum of squares s mnmzed when x y w = x The maxmum lkelhood model s Out ( x ) = wx We can use t for predcton p(w) w Note: In Bayesan stats you d have ended up wth a prob dstrbuton of w And predctons would have gven a prob dsrbuton of expected output Often useful to know your confdence. Max lkelhood can gve some knds of confdence too. Copyrght 001, 003, Andrew W. Moore

22 But what about MAP? MLE n argmax p(y w, x ) =1 MAP argmax n =1 p(y w, x )p(w) Copyrght 001, 003, Andrew W. Moore

23 MAP argmax We assumed y ~ N(w x, σ ) But what about MAP? Now add a pror that assumpton that w ~ N(0, γ ) n =1 p(y w, x )p(w)

24 For what w s n =1 p(y For what w s n =1 exp( 1 For what w s n =1 1 For what w s n =1 # % $ w, x ) p(w) maxmzed? ( y wx σ y wx σ ( y wx ) ) ) exp( 1 ( w γ ) )maxmzed? & ( ' 1 ( w γ ) maxmzed? + ( σ w γ ) mnmzed? Copyrght 001, 003, Andrew W. Moore

25 Second result MAP wth a Gaussan pror on w s the same as mnmzng the L error plus an L penalty on w argmn Ths s called Rdge regresson Shrnkage Regularzaton Copyrght 001, 003, Andrew W. Moore n =1 ( y wx ) + λw

26 The speed of lectures s A) too slow B) good C) too fast Copyrght Andrew W. Moore

27 Multvarate Lnear Regresson Copyrght 001, 003, Andrew W. Moore

28 Multvarate Regresson What f the nputs are vectors? x -d nput example Copyrght 001, 003, Andrew W. Moore x 1 Dataset has form x 1 y 1 x y x 3 y 3.: :. x n y n

29 Multvarate Regresson Wrte matrx X and Y thus: x =...x x...!...x n... = x 11 x 1... x 1p x 1 x... x p! x n1 x n... x np y = y 1 y! y n (There are R data ponts. Each nput has m components) The lnear regresson model assumes a vector w such that Out(x) = x. w = w 1 x[1] + w x[] +.w p x[p] Copyrght 001, 003, Andrew W. Moore The max. lkelhood w s w = (X T X) -1 (X T y)

30 Multvarate Regresson Wrte matrx X and Y thus: x =... x... x!... x 1 R = x x x 11 1 R1 x x x 1 R......!... x x x 1m m Rm y = y y! y 1 R (There are R dataponts. Each nput has m components) The lnear regresson model assumes a vector w such that Out(x) = w T x = w 1 x[1] + w x[] +.w m x[d] Copyrght 001, 003, Andrew W. Moore IMPORTANT EXERCISE: PROVE IT!!!!! The max. lkelhood w s w = (X T X) -1 (X T Y)

31 Multvarate Regresson (con t) The max. lkelhood w s w = (X T X) -1 (X T y) X T X s an m x m matrx:,j th element s X T Y s an m-element vector: th element R k = 1 R k=1 x k x kj x k y k Copyrght 001, 003, Andrew W. Moore

32 Constant Term n Lnear Regresson Copyrght 001, 003, Andrew W. Moore

33 What about a constant term? We may expect lnear data that does not go through the orgn. Statstcans and Neural Net Folks all agree on a smple obvous hack. Can you guess?? Copyrght 001, 003, Andrew W. Moore

34 The constant term The trck s to create a fake nput X 0 that always takes the value 1 X 1 X Y Before: Y=w 1 X 1 + w X has to be a poor model Copyrght 001, 003, Andrew W. Moore In ths example, You should be able to see the MLE w 0, w 1 and w by nspecton X 0 X 1 X Y After: Y= w 0 X 0 +w 1 X 1 + w X = w 0 +w 1 X 1 + w X has a fne constant term

35 Heteroscedastcty... Lnear Regresson wth varyng nose Copyrght 001, 003, Andrew W. Moore

36 Regresson wth varyng nose Suppose you know the varance of the nose that was added to each datapont. x y σ ½ ½ /4 3 4 y=3 y= y=1 y=0 σ= σ=1 3 1/4 ~ (, Assume y N wx σ ) σ= σ=1/ x=0 x=1 x= x=3 σ=1/ Copyrght 001, 003, Andrew W. Moore

37 Copyrght 001, 003, Andrew W. Moore MLE estmaton wth varyng nose = ),,...,,,,...,,,...,, ( log argmax w x x x y y y p w R R R σ σ σ = = R y wx w 1 ) ( argmn σ = = = 0 ) ( such that 1 R wx y x w σ = = R R x y x 1 1 σ σ Assumng ndependence among nose and then pluggng n equaton for Gaussan and smplfyng. Settng dll/dw equal to zero Trval algebra

38 Ths s Weghted Regresson We are askng to mnmze the weghted sum of squares argmn w R = 1 ( y wx σ ) y=3 y= y=1 y=0 where weght for th datapont s σ= σ=1 σ= σ=1/ x=0 x=1 x= x=3 1 σ σ=1/ Copyrght 001, 003, Andrew W. Moore

39 Non-lnear Regresson Copyrght 001, 003, Andrew W. Moore

40 Non-lnear Regresson Suppose you know that y s related to a functon of x n such a way that the predcted values have a non-lnear dependence on w, e.g: x ½ y ½ y=3 y= y=1 y=0 x=0 x=1 x= x=3 Assume y ~ N( w + x, σ ) Copyrght 001, 003, Andrew W. Moore

41 argmax w Non-lnear MLE estmaton log p( y1, y,..., yr x1, x,..., x argmn w w such that R = 1 R = 1 ( y w + ) = y x w + x w + x R = 0 =, σ, w) = Assumng..d. and then pluggng n equaton for Gaussan and smplfyng. Settng dll/dw equal to zero Copyrght 001, 003, Andrew W. Moore

42 argmax w Non-lnear MLE estmaton log p( y1, y,..., yr x1, x,..., x argmn w w such that R = 1 R = 1 ( y w + ) = y x w + x w + x R = 0 =, σ, w) = Assumng..d. and then pluggng n equaton for Gaussan and smplfyng. Settng dll/dw equal to zero Copyrght 001, 003, Andrew W. Moore We re down the algebrac tolet So guess what we do?

43 argmax Copyrght 001, 003, Andrew W. Moore Non-lnear MLE estmaton log p( y1, y,..., yr x1, x,..., xr, σ, w) = w Common (but not only) approach: R Numercal Solutons: argmn ( y w + x ) = = 1 Lne Search w Smulated Annealng R y + w x Gradent Descent w such that = = 0 = 1 w + x Conjugate Gradent Levenberg Marquart Newton s Method Also, specal purpose statstcaloptmzaton-specfc trcks such as E.M. (See Gaussan Mxtures lecture for ntroducton) Assumng..d. and then pluggng n equaton for Gaussan and smplfyng. Settng dll/dw equal to zero We re down the algebrac tolet So guess what we do?

44 Polynomal Regresson Copyrght 001, 003, Andrew W. Moore

45 X 1 X Y Polynomal Regresson So far we ve manly been dealng wth lnear regresson X= 3 y= 1 1 : : : : : x 1 =(3,).. y 1 =7.. Z= 1 3 y= : : : β=(z T Z) -1 (Z T y) z 1 =(1,3,).. z k =(1,x k1,x k ) y 1 = : y est = β 0 + β 1 x 1 + β x Copyrght 001, 003, Andrew W. Moore

46 Quadratc Regresson It s trval to do lnear fts of fxed nonlnear bass functons X 1 X Y X= 3 y= : : : : : x 1 =(3,).. y 1 = y= Z= β=(z : : T Z) -1 (Z T y) : z=(1, x 1, x, x 1, x 1 x,x y, ) est = β 0 + β 1 x 1 + β x + β 3 x 1 + β 4 x 1 x + β 5 x : Copyrght 001, 003, Andrew W. Moore

47 Quadratc Regresson It s Each trval component to do lnear of a fts z vector of fxed s called nonlnear a term. bass functons X 1 X Y X= 3 y= Each column of the Z matrx s called a term 7 column How many terms n a quadratc regresson wth m nputs? : : : 1 constant term : : : x 1 =(3,).. y 1 =7.. 1 m 3 lnear terms y= Z= 7 1 (m+1)-choose = 1 m(m+1)/ 1 quadratc terms 3 β=(z : (m+)-choose- terms n : total = O(m ) T Z) -1 (Z T y) : z=(1, x 1, x, x 1, x 1 x,x y, ) est = β 0 + β 1 x 1 + β x + β 3 x 1 + β 4 x 1 x + β 5 x Note that solvng β=(z T Z) -1 (Z T y) s thus O(m 6 ) Copyrght 001, 003, Andrew W. Moore

48 Q th -degree polynomal Regresson X 1 X Y X= 3 y= 1 1 : : : : : x 1 =(3,).. y 1 = y= 7 Z= : : z=(all products of powers of nputs n whch sum of powers s q or less, ) 7 3 : β=(z T Z) -1 (Z T y) y est = β 0 + β 1 x 1 + Copyrght 001, 003, Andrew W. Moore

49 m nputs, degree Q: how many terms? = the number of unque terms m of the form q1 q qm x x... x where q Q 1 m =1 = the number of unque terms m of the form q0 q1 q qm 1 x x... x where q = Q 1 m = the number of lsts of non-negatve ntegers [q 0,q 1,q,..q m ] n whch Σq = Q = the number of ways of placng Q red dsks on a row of squares of length Q+m = (Q+m)-choose-Q =0 q 0 = q 1 = q =0 q 3 =4 q 4 =3 Q=11, m=4 Copyrght 001, 003, Andrew W. Moore

50 What we have seen MLE wth Gaussan nose s the same as mnmzng the L error Other nose models wll gve other loss functons MLE wth a Gaussan pror adds a penalty to the L error, gvng Rdge regresson Other prors wll gve dfferent penaltes One can make nonlnear relatons lnear by transformng the features Polynomal regresson Radal Bass Functons (RBF) wll be covered later

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example: