Probabilistic Unsupervised Learning

Size: px

Start display at page:

Download "Probabilistic Unsupervised Learning"

Rudolf Mosley
5 years ago
Views:

1 HT2015: SC4 Statistical Data Miig ad Machie Learig Dio Sejdiovic Departmet of Statistics Oxford Probabilistic Methods Algorithmic approach: Data Probabilistic modellig approach: Uobserved process Algorithm Geerative Model Aalysis Iterpretatio Aalysis/ Iterpretatio Data Mixture models suppose that our dataset X was created by samplig iid from K distict populatios (called mixture compoets). Typical samples i populatio ca be modelled usig a distributio F µ with desity f (x µ ). For a cocrete example, cosider a Gaussia with uow mea µ ad ow diagoal covariace σ 2 I, ( f (x µ ) 2πσ 2 p 2 exp 1 ) 2σ 2 x µ 2 2. Geerative model: for i 1, 2,..., : First determie which populatio item i came from (idepedetly): Z i Discrete(π 1,..., π K ) i.e., P(Z i ) π where mixig proportios are π 0 for each ad 1 π 1. If Z i, the X i (X i1,..., X ip ) is sampled (idepedetly) from correspodig populatio distributio: X i Z i F µ We observe that X i x i for each i, ad would lie to lear about the uow parameters of the process.

2 - Posterior Distributio Uows to lear give data are Parameters: π 1,..., π K [0, 1], µ 1,..., µ K R p, as well as Latet variables: z 1,..., z. The joit probability over all cluster idicator variables {Z i } are: K p Z ((z i ) ) π zi 1 π 1(z i) The joit desity at observatios X i x i give Z i z i are: K p X ((x i ) (Z i z i ) ) f (x i µ ) 1(z i) 1 1 So the joit probability/desity 1 is: K p X,Z ((x i, z i ) ) (π f (x i µ )) 1(z i) 1 I this course we will treat probabilities ad desities equivaletly for otatioal simplicity. I geeral, the quatity is a desity with respect to the product base measure, where the base measure is the coutig measure for discrete variables ad Lebesgue for cotiuous variables. - Maximum Liehood Suppose we ow the parameters (π, µ ) K 1. Z i is a radom variable ad its posterior distributio give data set X is: Q i : p(z i x i ) p(z i, x i ) p(x i ) where the margial probability of i-th istace is: p(x i ) p(z i j, x i ) j1 π j f (x i µ j ). j1 π f (x i µ ) j1 π jf (x i µ j ) The posterior probability Q i of Z i is called the resposibility of mixture compoet for data poit x i. The posterior distributio softly partitios the dataset amog the compoets. - Maximum Liehood How ca we lear about the parameters θ (π, µ ) K 1 from data? Stadard statistical methodology ass for the maximum lielihood estimator (MLE). The goal is to maximize the margial probability of the data over the parameters Margial log-lielihood: l((π, µ ) K 1) : log p(x (π, µ ) K 1) log π f (x i µ ) 1 ˆθ ML argmax p(x θ) argmax θ (π,µ ) K 1 argmax (π,µ ) K 1 argmax p(x i (π, µ ) K 1) 1 (π,µ ) K 1 log π f (x i µ ) π f (x i µ ). 1 } {{ } :l((π,µ ) K 1 ) The gradiet w.r.t. µ : µ l((π, µ ) K 1) π f (x i µ ) j1 π jf (x i µ j ) µ log f (x i µ ) Q i µ log f (x i µ ). Difficult to solve, as Q i depeds implicitly o µ.

3 - Maximum Liehood - Maximum Liehood Q i µ log f (x i µ ) 0 What if we igore the depedece of Q i o the parameters? Taig the mixture of Gaussia with covariace σ 2 I as example, 1 σ 2 ( Q i µ p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ 2 2 Q i (x i µ ) 1 σ 2 ( Q i x i µ ( Q i) ) 0 The estimate is a weighted average of data poits, where the estimated mea of cluster uses its resposibilities to data poits as weights. µ ML? Q ix i Q. i Maes sese: Suppose we ew that data poit x i came from populatio z i. The Q izi 1 ad Q i 0 for z i ad: µ ML? i:z i x i i:z i 1 avg{x i : z i } µ ML? Q ix i Q i Our best guess of the origiatig populatio is give by Q i. - Maximum Liehood - The Gradiet w.r.t. mixig proportio π (icludig a Lagrage multiplier λ ( π 1 ) to eforce costrait π 1). Note: ( π l((π, µ ) K 1) λ( ) K 1 π 1) 1 Q i f (x i µ ) j1 π jf (x i µ j ) λ Q i π λ 0 π 1 Q i }{{} 1 Q i π ML? Q i Agai maes sese: the estimate is simply (our best guess of) the proportio of data poits comig from populatio. Puttig all the derivatios together, we get a iterative algorithm for learig about the uows i the mixture model. Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: Maximizatio Step: Q (t) i : π (t) Q(t) i Will the algorithm coverge? What does it coverge to? π (t 1) j1 π(t 1) j ) j ) µ (t) Q(t) i x i Q(t) i

4 Lielihood Surface for a Simple Example Example: Mixture of 3 Gaussias compbody.tex After 1st E ad M step. Iteratio mu mu 1 (a) (b) (left) 200 data poits from a mixture of two 1D Gaussias with Figure π : πleft: 2 N 0.5, 200 σ data 5poits ad sampled µ 1 from10, a mixture µ 2 of Gaussias i 1d, with π 0.5, σ 5, µ 1 10 ad µ Right: (right) Lielihood Logsurface lielihood p(d µ 1, µ 2 surface ), with all other l (µ parameters 1, µ 2 ), set all to their true other values. parameters We see the two symmetric beigmodes, reflectig the uidetifiability assumedofow. the parameters. Produced by mixgausslisurfacedemo. data[,2] data[,1] Uidetifiability Note that mixture models are ot idetifiable, which meas there are may settigs of the parameters which have the same lielihood. Specifically, i a mixture model with K compoets, there are K! equivalet parameter settigs, which differ merely byafter permutig 5ththeE labels adofmthestep. hidde states. See Figure 11.6 for a illustratio. The existece of equivalet global modes does ot matter whe computig a sigle poit estimate, such as the ML or MAP estimate, but it does complicate Bayesia iferece, Iteratio 5 as we will i Sectio Ufortuately, eve fidig just oe of these global modes is computatioally difficult. The EM algorithm is oly guarateed to fid a local mode. A variety of methods ca be used to icrease the chace of fidig a good local optimum. The simplest, ad most widely used, is to perform multiple radom restarts K-meas algorithm Example: Mixture of 3 Gaussias There is a variat of the EM algorithm for GMMs ow as the K-meas algorithm, which we ow discuss. Cosider a GMM i which we mae the followig assumptios: Σ σ 2 I D is fixed, ad π 1/K is fixed, so oly the cluster ceters, µ R D, have to be estimated. Now cosider a approximatio to EM i which we mae the approximatio data[,2] p(z i x i, θ) I( z i ) (11.61) where z i arg max p(z i x i, θ). This is sometimes called hard EM, sice we are maig a hard assigmet of poits to clusters. Sice we assumed a equal spherical covariace matrix for each cluster, the most probable cluster for x i ca be computed by fidig the earest prototype: zi arg mi x i µ 2 (11.62) Hece i each E step, we must fid the Euclidea distace betwee N data poits ad K cluster ceters, which taes O(NKD) data[,1] time. However, this ca be sped up usig various techiques, such as applyig the triagle iequality to avoid some redudat computatios [El03]. Give the hard cluster assigmets, the M step updates each cluster ceter by computig the mea of all I a maximum lielihood framewor, the objective fuctio is the log lielihood, l(θ) log π f (x i µ ) 1 Direct maximizatio is ot feasible. Cosider aother objective fuctio F(θ, q) such that: F(θ, q) l(θ) for all θ, q, max F(θ, q) l(θ) q F(θ, q) is a lower boud o the log lielihood. We ca costruct a alteratig maximizatio algorithm as follows: For t 1, 2... util covergece: q (t) : argmax F(θ (t 1), q) q θ (t) : argmax F(θ, q (t) ) θ

5 - Solvig for q Gradiet of F w.r.t q (with Lagrage multiplier for z q(z) 1): The lower boud we use is called the variatioal free eergy. q is a probability mass fuctio for a distributio over z : (z i ). F(θ, q) E q [log p(x, z θ) log q(z)] [( ) ] E q 1(z i ) (log π + log f (x i µ )) log q(z) 1 [( ) ] q(z) 1(z i ) (log π + log f (x i µ )) log q(z) z 1 q(z) F(θ, q) q (z) 1 1(z i ) (log π + log f (x i µ )) log q(z) 1 λ (log π zi + log f (x i µ zi )) log q(z) 1 λ 0 q (z) π zi f (x i µ zi ). π z i f (x i µ zi ) z π z i f (x i µ z i ) π zi f (x i µ zi ) π f (x i µ ) Optimal q is simply the posterior distributio for fixed θ. Pluggig i the optimal q ito the variatioal free eergy, F(θ, q ) log π f (x i µ ) l(θ) 1 p(z i x i, θ). - Solvig for θ Settig derivative with respect to µ to 0, µ F(θ, q) q(z) 1(z i ) µ log f (x i µ ) z q(z i ) µ log f (x i µ ) 0 This equatio ca be solved quite easily. E.g., for mixture of Gaussias, µ q(z i )x i q(z i ) If it caot be solved exactly, we ca use gradiet ascet algorithm: µ µ + α q(z i ) µ log f (x i µ ). Similar derivatio for optimal π as before. Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: q (t) (z i ) : π (t 1) j1 π(t 1) j Maximizatio Step: π (t) q(t) (z i ) Each step icreases the log lielihood: ) j ) E p(z i x i,θ (t 1) ) [1(z i )] µ (t) q(t) (z i )x i q(t) (z i ) l(θ (t 1) ) F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) l(θ (t) ). Additioal assumptio, that 2 θ F(θ(t), q (t) ) are egative defiite with eigevalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE.

6 Notes o Probabilistic Approach ad Flexible Gaussia Some good thigs: Guarateed covergece to locally optimal parameters. Formal reasoig of ucertaities, usig both Bayes Theorem ad maximum lielihood theory. Rich laguage of probability theory to express a wide rage of geerative models, ad straightforward derivatio of algorithms for ML estimatio. Some bad thigs: Ca get stuc i local miima so multiple starts are recommeded. Slower ad more expesive tha K-meas. Choice of K still problematic, but rich array of methods for model selectio comes to rescue. We ca allow each cluster to have its ow mea ad covariace structure allows greater flexibility i the model. Differet covariaces Idetical covariaces Differet, but diagoal covariaces Idetical ad spherical covariaces PPCA latets A probabilistic model related to PCA has the followig geerative model: for i 1, 2,..., : Let <, p be give. Let Y i be a (latet) -dimesioal ormally distributed radom variable with 0 mea ad idetity covariace: Y i N (0, I ) PCA projectio We model the distributio of the ith data poit give Y i as a p-dimesioal ormal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p ad σ 2 > 0. pricipal subspace figures by M. Sahai

7 Mixture of s PPCA latets PPCA posterior PPCA oise PPCA latet prior PPCA projectio We have leart two types of usupervised learig techiques: Dimesioality reductio, e.g. PCA, MDS, Isomap. Clusterig, e.g. K-meas, liage ad mixture models. Probabilistic models allow us to costruct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clusterig ad dimesioality reductio at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i, Y i y i N (µ + Ly i, σ 2 I p ) pricipal subspace Allows flexible modellig of covariace structure without usig too may parameters. figures by M. Sahai Ghahramai ad Hito 1996 Further Readig Usupervised Learig Hastie et al, Chapter 14. James et al, Chapter 10. Ripley, Chapter 9. Tuey, Joh W. (1980). We eed both exploratory ad cofirmatory. The America Statisticia 34 (1):

Probabilistic Unsupervised Learning

Probabilistic Unsupervised Learning Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods