Statistical learning

Size: px

Start display at page:

Download "Statistical learning"

Aubrey Hardy
6 years ago
Views:

1 Statstcal learnng Model the data generaton process Learn the model parameters Crteron to optmze: Lkelhood of the dataset (maxmzaton) Maxmum Lkelhood (ML) Estmaton: Dataset X Statstcal model p(x;θ) (θ parameters) argmax p( X; ), p( X; ) p( x,..., x ; ) ML 1 n

2 Bayesan Learnng Assgn prors p(θ) on parameters of model p(x;θ) Prors: a flexble way to mpose constrants on the model parameters. Bayesan learnng: Gven dataset X, compute posteror p(θ Χ). MAP soluton: P( X) P( X ) P( ) PX ( ) MAP arg max P( X ) arg max P( X ) P( ) It s possble that pror contans hyperparameters p(θ;λ).

3 Graphcal Models h p(h f) p(f) f p(g,h,f)=p(g h,f)p(h f)p(f) g p(g h,f) Graphcal representaton of the data generaton process. Represent dependences between random varables (r.v.) Each node corresponds to one r.v. Each drected edge denotes dependence of the rv at the end of edge from the rv n the begnnng of the edge.

4 Graphcal Models At each node, the condtonal probablty p( X pa ) s provded. For drected acyclc graphs (Bayesan networks) t holds: d P( X,..., X ) P( X pa ; ) 1 d 1 The model parameters θ exst n the condtonal probabltes.

5 Types of problems Graphcal Models - Inference: compute dstrbutons among RVs (ont, margnal, condtonal) (use Bayes theorem and margnalzaton) - Parameter estmaton - Combnaton of the above Gven a dataset, RVs are dstngushed nto observed and hdden (or latent).

6 Maxmum Lkelhood (ML) Estmaton n Graphcal models Dataset: Υ={y 1,,y n } (observatons) p( Y; ) p( y,..., y ; ) p( y ; ) 1 n 1 n Log-lkelhood: n L( Y; ) ln p( y ; ) 1 n arg max ln py ( ; ) 1 Lkelhood can defned by margnalzng hdden RVs x: ln p( y; ) ln p( y, x; ) dx P(Y,X): complete lkelhood x

7 Αλγόριθμος ΕΜ (Expectaton-Maxmzaton) Iteratve method for maxmzng lkelhood P(Y;Θ) (wrt Θ) when there exst hdden varables x=(x 1,,x L ). Startng from a parameter vector Θ (0), two steps at each teraton: Ε-step: computer posteror P(x Y;Θ (t) ) (nference) M-step: estmate Θ by maxmzng expected complete log-lkelhood: arg max P( x Y; )ln p( Y, x; ) dx ( t1) ( t) x At each EM teraton the lkelhood ncreases. Typcally we termnate at local maxmum. Strong dependence on ntal parameters Θ (0). Expected complete log-lkelhood s easy to compute n some cases (e.g. mxture models). Problem: n several models the posteror P(x Y;Θ) cannot be computed EM cannot be appled.

8 Gaussan dstrbuton: x=(x 1,, x d ) T : 1 1 T 1 N( x;, ) exp ( x ) ( x ) d / 2 1/ 2 (2 ) 2 Mean μ=(μ 1,, μ d ) T, μ=ε[x] Covarance matrx Σ Σ=Ε[(x-μ)(x - μ) T ] (symmetrc, postve defnte) T = Σ -1 (precson matrx) Models cloud-shaped data., x px ( ;, ) N(, )

9 Cases for Σ (a) Σ full (b) Σ dagonal: statstcal ndependence among x 1 d 2 2 ( x) ( x ), ( z) exp( ( z ) /(2 )) 2 1/ 2 1 (2 ) (c) Σ=σ 2 Ι (sphercal): 1 (2 ) d 2 2 ( x) exp x μ /(2 ) 2 / 2

10 Mxture models M f ( x) ( x; ) M pdf components φ (x) Mxng weghts: π 1, π 2,, π M (prors) Data generaton: select a component (usng prors) sample from φ (x) Gaussan mxture models (GMMs): components φ (x) are Gaussan wth θ =(μ, Σ ) M fz gx p( x z ;, ) N(, ) k, k k k

11 GMMs can approxmate an arbtrary pdf f the number of components becomes arbtrarly large. ( x; ) Posteror dstrbuton: P( x) f( x) Can be used for clusterng

12 EM for GMMs Dataset X={x 1,,x N }: M s gven n advance Hdden varables: z =(z 1,,z M ): z =1 x has been generated by φ. GMMs can be traned through EM. ΕΜ apples easly snce we can computer P(z= x). ( x; ) P( x) f( x) p( x ) fz gx p( x z ;, ) N(, ) k, k k k

13 EM for Mxture Models At each teraton t: E-βήμα: Compute the dstrbuton of z P ( t) ( x ) ( t) M m 1 ( x ( t) m m ; θ ( x ( t) ) ; θ ( t) m ) Defne expected complete log-lkelhood: n M ( t) ( t) ( ; ) ( )log ( ; ) 1 1 Q P x x M-βήμα: Update parameters arg max Q( ; ) ( t1) ( t)

14 ΕΜ for GMMs M-step for Gaussan components (closed form soluton): n t n t t P P 1 ) ( 1 ) ( 1) ( ) ( ) ( x x x μ n t n T t t t t P P 1 ) ( 1 1) ( 1) ( ) ( 1) ( ) ( ) )( )( ( x μ x μ x x Σ n t t P n 1 ) ( 1) ( ) ( 1 x EM guarantees that: ) ( ) ( ) ( 1) ( t t L L

15 EM local maxma

16 Other Mxture Models Mxture of multnomals (dscrete data) Mxture of Student-t (robust to outlers) Regresson mxture models (tme seres clusterng) Spatal mxture models (mage segmentaton)

17 Lkelhood maxmzaton: The varatonal approach Let x the hdden RVs and Y the observatons n a graphcal model. For every pdf q(x) t holds that: ; ( ) ( ; ) ln (, ; ) ln ( ) L Y KL q x p x Y E p Y x E q x KLq( x) p( x Y; ) 0 q (Κullback-Lebler dstance) q F(Y;q,Θ) = E q (ln p(y,x;θ)) - E q (ln q(x)): lower bound (varatonal bound) of L(Υ;Θ). q(x): varatonal approxmaton of p(x Y;Θ). L(Y;Θ) F(Y;q,Θ) Equalty holds when q(x)=p(x Y;Θ). Then F(Υ;Θ) = expected complete lkelhood + quantty ndependent of Θ.

18 Varatonal ΕΜ (Neal & Hnton, 1998) Maxmze the varatona bound F(Y;q,Θ) wrt q and Θ, nstead of maxmzng L(Y;Θ) wrt Θ. VE-step: q argmax F( q, ) ( t1) ( t) q ( t1) ( t1) VM-step: argmax Fq (, ) The maxmum of F(q,Θ) wrt q occurs for q(x)=p(x Y;Θ) (then VE-step E- step, VM-step M-step ΕΜ algorthm). When p(x Y) cannot be computed analytcally, then n the VE-step we use approxmatons that ust ncrease F (wthout maxmzng F).

19 Update q (VΕ-step): F( Y; q, ) q( x)ln p( Y, x; ) dx - q( x)ln q( x) dx x x 1) Parametrc form q(x;λ): parameters λ are updated n E-step. 2) Mean feld approxmaton q( x) q( x, x,..., x ) q ( x ) 1 2 L Soluton : exp ln p( x, Y) q( x) exp ln p( x, Y) k k dx Non-lnear system of equatons < > k : expectaton wrt all q k (x k ), except for q (x )

20 Bayesan GMMs M f ( y) ( y;, ) 1 M 1 1 Bayesan GMM a Prors on parameters θ: π, μ={μ }, Τ={Τ } whch become RVs Conugate prors are selected: p(θ Y) same form as p(θ). ms, x ( 1,..., ) Drchlet ( a1,..., a M ) N( m, S), p( ) p( ) T Wshart( v, V ), p( T) p( T ) 1 1,V T p( y x,, T ) N(, T) k y k

21 Bayesan GMMs Pror p(μ) (almost unform) dscourages solutons havng GMM components n the same regon (desrable feature). Pror p(t)=wshart(v,v) dscourages solutons wth GMM components havng covarance Σ very dfferent from V (undesrable feature). Pror p(π) prevents redundant GMM components to be elmnated (undesrable feature).

22 Varatonal learnng of Bayesan GMMs Maxmze F wrt q (no model parameters θ to be learnt). Μπεϋζιανή μικτή κατανομή a Dataset Υ={y } =1,,N. ms, M components: N(μ, T ), (=1,,M), π=(π 1, π 2,..., π Μ ) x Hdden RVs: x={x } (=1,,N), π, μ={μ }, T={T } ( = 1,,M),V T y p( y x,, T ) N(, T) k k

23 Varatonal learnng of Bayesan GMMs (Attas, NIPS 1999): Mean feld approxmaton q(x,π,μ,t)=q(x)q(θ): q( x) q( x ) n 1 q( ) q(,, T) q( ) q( ) q( T) q( ) q( ) q( T ) M M 1 1 Non-lnear System of equatons for q(z)=q(x)q(θ). Solved usng an teratve update method. M x 1 M 1 q( x ) x ( x,..., x ) q( ) Dr( a,..., a ) 1 q( ) N( m, T ) T T q( T ) W ( v, V ) q ( z ) l l exp ln p( z, y) exp ln p( z, y) kl kl dz l

24 Varatonal learnng of Bayesan GMMs In mxture models by settng some π =0 we can change the model order (number of components) Pror p(π) prevents elmnaton of redundant components. Corduneanu & Bshop, AISTATS 2001: π are consdered as parameters and not RVs (Drchlet pror on π s removed) Startng from a large number of components. We maxmze F(q(x,μ,Τ);π) wrt - q(x,μ,τ) (VE-step) (mean feld approxmaton) - π (VΜ-step) For the redundant components, VM-step update wll gve π =0 (component wll be elmnated from GMM).

26 Unsupervsed Dmensonalty Reducton Feature Extracton: new features are created by combnng the orgnal features The new features are usually fewer than the orgnal No class labels avalable (unsupervsed) Purpose: Avod curse of dmensonalty Reduce amount of tme and memory requred by data mnng algorthms Allow data to be more easly vsualzed May help to elmnate rrelevant features or reduce nose Technques Prncpal Component Analyss (optmal lnear approach) Non-lnear approaches (e.g. autoencoders, Kernel PCA)

27 PCA: Prncpal Component Analyss Prncpal components: Vectors orgnatng from the center of the dataset (data centerng: use (x-m) nstead of x, m=mean(x)) Prncpal component #1 ponts n the drecton of the largest varance. Each subsequent prncpal component s orthogonal to the prevous ones, and ponts n the drectons of the largest varance of the resdual subspace

28 PCA: 2D Gaussan dataset

29 1 st PCA axs

30 2 nd PCA axs

31 PCA algorthm Gven data {x 1,, x n }, compute covarance matrx : n 1 ( x m)( x m) n 1 PCA bass vectors = the egenvectors of (dxd) {, u } =1..N = egenvectors/egenvalues of 1 2 d (egenvalue orderng) Select {, u } =1.. q (top q prncpal components) Usually few λ are large and the remanng qute small Larger egenvalue more mportant egenvectors How to select q? : percentage of explaned varance T m 1 n x n 1

32 PCA algorthm W=[u 1 u 2 u q ] T (kxd proecton matrx) Z=X*W T (nxq data proectons) Xrec=Z*W (nxd reconstructons of X from Z) PCA s optmal Lnear Proecton: mnmzes the reconstructon error: x xrec In autoencoders we also mnmze the same crteron, but the model s non-lnear. 2

33 PCA example: face analyss Orgnal dataset: mages 256x256

34 PCA example: face analyss Prncpal components: 25 top egenfaces

35 PCA example: face analyss Reconstructon examples

36 z=(z 1,,z q ) x=(x 1,,x d )

38 Matrx C contans the parameters W and σ to be adusted.

39 Settng σ0 we obtan the determnstc PCA soluton.

42 Probablstc PCA: EM algorthm E-step: M-step: Parameters W, σ should be ntalzed (randomly).

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder