Probabilistic Unsupervised Learning

Statistical Data Miig ad Machie Learig Hilary Term 2016 Dio Sejdiovic Departmet of Statistics Oxford Slides ad other materials available at: http://www.stats.ox.ac.u/~sejdiov/sdmml Probabilistic Methods Algorithmic approach: Data Probabilistic modellig approach: Uobserved process Algorithm Geerative Model Aalysis Iterpretatio Aalysis/ Iterpretatio Data Mixture models suppose that our dataset X was created by samplig iid from K distict populatios (called mixture compoets). Samples i populatio ca be modelled usig a distributio F µ with desity f (x µ ), where µ is the model parameter for the -th compoet. For a cocrete example, cosider a Gaussia with uow mea µ ad ow diagoal covariace σ 2 I, f (x µ ) 2πσ 2 p 2 exp ( 1 2σ 2 x µ 2 2 Geerative model: for i 1, 2,..., : First determie the assigmet variable idepedetly for each data item i: Z i Discrete(π 1,..., π K ) ). i.e., P(Z i ) π where mixig proportios / additioal model parameters are π 0 for each ad 1 π 1. Give the assigmet Z i, the X i (X (1) i (idepedetly) from the correspodig -th compoet:,..., X (p) i ) is sampled X i Z i f (x µ ) We observe X i x i for each i but ot Z i s (latet variables), ad would lie to ifer the parameters.

Uows to lear give data are Parameters: θ (π, µ ) K 1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, ad Latet variables: z 1,..., z. The joit probability over all cluster idicator variables {Z i } are: : Joit pmf/pdf of observed ad latet variables Uows to lear give data are Parameters: θ (π, µ ) K 1, where π 1,..., π K [0, 1], µ 1,..., µ K R p, ad Latet variables: z 1,..., z. The joit probability mass fuctio/desity 1 is: p Z ((z i ) ) π zi K 1 π 1(z i) 11.2. Mixture models 339 The joit desity at observatios X i x i give Z i z i are: p(x i z i ) p(z i ) Name Sectio MVN Discrete Mixture of Gaussias K 11.2.1 Prod. pdiscrete X ((x i ) Discrete (Z i z i ) ) Mixture f (xof i µ multiomials zi ) f (x i µ ) 11.2.2 1(z i) Prod. Gaussia Prod. Gaussia Factor aalysis/ probabilistic PCA 12.1.5 1 Prod. Gaussia Prod. Laplace Probabilistic ICA/ sparse codig 12.6 Prod. Discrete Prod. Gaussia Multiomial PCA 27.2.3 Prod. Discrete Dirichlet Latet Dirichlet allocatio 27.3 Prod. Noisy-OR Prod. Beroulli BN20/ QMR 10.2.3 Prod. Beroulli Probabilistic Prod. Usupervised Beroulli Learig Sigmoid Mixturebelief Modelset 27.7 Mixture Table 11.1 Models: Gaussia Mixtures with Uequal Prod. Discrete i the lielihood meas a factored distributio of the form j Covariaces Gaussia meas a factored distributio of the form j Summary of some popular directed latet variable models. Here Prod meas product, so Cat(xij zi), ad Prod. N (xij zi). PCA stads for pricipal compoets aalysis. ICA stads for idepededet compoets aalysis. p X,Z ((x i, z i ) ) p Z ((z i ) )p X ((x i ) (Z i z i ) ) 1 K (π f (x i µ )) 1(z i) Ad the margial desity of x i (resultig model o the observed data) is: p(x i ) p(z i j, x i ) j1 π j f (x i µ j ). 1 I this course we will treat probability mass fuctios ad desities i the same way for otatioal simplicity. Strictly speaig, p X,Z is a desity with respect to the product base measure, where the base measure is the coutig measure for discrete variables ad Lebesgue for cotiuous variables. : Resposibility j1 0.8 0.7 0.6 0.5 0.4 0.3 Suppose we ow the parameters θ (π, µ ) K 1. Z i is a radom variable ad its coditioal distributio give data set X is: Q i : p(z i x i ) p(z i, x i ) p(x i ) π f (x i µ ) j1 π jf (x i µ j ) 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) (b) figure from Murphy, 2012, Ch. 11. Here Figure θ 11.3(π A mixture of 3 Gaussias i 2d. (a) We show the cotours of costat probability for each, µ, Σ ) K 1 are all the model parametes ad compoet i the mixture. (b) A surface plot of the overall desity. Based o Figure 2.23 of (Bishop 2006a). ) Figure geerated by mixgaussplotdemo. ( f (x (µ, Σ )) (2π) p 2 Σ 1 2 exp 1 2 (x µ ) Σ 1 (x µ ), The coditioal probability Q i is called the resposibility of mixture compoet for data poit x i. These coditioals softly partitios the dataset amog the compoets: 1 Q i 1. 11.2.1 Mixtures of Gaussias p(x) π The most widely used mixture model f (x (µ is the mixture, Σ )) of Gaussias (MOG), also called a Gaussia 1 mixture model or GMM. I this model, each base distributio i the mixture is a multivariate Gaussia with mea μ ad covariace matrix Σ. Thus the model has the form

: Maximum Liehood : Maximum Liehood How ca we lear about the parameters θ (π, µ ) K 1 from data? Stadard statistical methodology ass for the maximum lielihood estimator (MLE). The goal is to maximise the margial probability of the data over the parameters Margial log-lielihood: l((π, µ ) K 1) : log p(x (π, µ ) K 1) log π f (x i µ ) 1 ˆθ ML argmax p(x θ) argmax θ (π,µ ) K 1 argmax (π,µ ) K 1 argmax p(x i (π, µ ) K 1) 1 (π,µ ) K 1 log π f (x i µ ) π f (x i µ ). 1 } {{ } :l((π,µ ) K 1 ) The gradiet w.r.t. µ : µ l((π, µ ) K 1) π f (x i µ ) j1 π jf (x i µ j ) µ log f (x i µ ) Q i µ log f (x i µ ). Difficult to solve, as Q i depeds implicitly o µ. Lielihood Surface for a Simple Example : Maximum Liehood 01 02 If latet variables z i s were all observed, we would have a uimodal lielihood surface but whe we margialise out the latets, the lielihood surface becomes multimodal: o uique MLE. 320 compbody.tex 35 19.5 Recall we would lie to solve: µ l((π, µ ) K 1) Q i µ log f (x i µ ) 0 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 30 25 20 15 10 5 0 30 20 10 0 10 20 30 (a) mu 2 14.5 9.5 4.5 0.5 5.5 10.5 15.5 15.5 10.5 5.5 0.5 4.5 9.5 14.5 19.5 mu 1 (left) 200 data poits from a mixture of two 1D Gaussias with Figure 11.6: Left: N 200 data poits sampled from a mixture of 2 Gaussias i 1d, with π 0.5, σ 5, µ 1 10 ad µ 2 10. Right: π 1 Lielihood π 2 surface 0.5, p(d µ σ 1, 5µ 2), ad withµ all 1 other 10, parameters µ 2 set 10. to their true values. We see the two symmetric modes, reflectig the uidetifiability of the parameters. Produced by mixgausslisurfacedemo. (right) Observed data log lielihood surface l (µ 1, µ 2 ), all the other parameters beig assumed ow. 11.4.2.5 Uidetifiability Note that mixture models are ot idetifiable, which meas there are may settigs of the parameters which have the same lielihood. Specifically, i a mixture model with K compoets, there are K! equivalet parameter settigs, which differ merely (b) What if we igore the depedece of Q i o the parameters? Taig the mixture of Gaussia with covariace σ 2 I as example, 1 σ 2 ( Q i µ p 2 log(2πσ2 ) 1 ) 2σ 2 x i µ 2 2 Q i (x i µ ) 1 σ 2 ( µ ML? Q ix i Q i Q i x i µ ( Q i) ) 0

: Maximum Liehood : Maximum Liehood The estimate is a weighted average of data poits, where the estimated mea of cluster uses its resposibilities to data poits as weights. µ ML? Q ix i Q. i Maes sese: Suppose we ew that data poit x i came from populatio z i. The Q izi 1 ad Q i 0 for z i ad: µ ML? i:z i x i i:z i 1 avg{x i : z i } Our best guess of the origiatig populatio is give by Q i. Soft K-Meas algorithm? Gradiet w.r.t. mixig proportio π (icludig a Lagrage multiplier λ ( π 1 ) to eforce costrait π 1). Note: ( π l((π, µ ) K 1) λ( ) K 1 π 1) 1 Q i f (x i µ ) j1 π jf (x i µ j ) λ Q i π λ 0 π 1 Q i }{{} 1 Q i π ML? Q i Agai maes sese: the estimate is simply (our best guess of) the proportio of data poits comig from populatio. : The Puttig all the derivatios together, we get a iterative algorithm for learig about the uows i the mixture model. Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: Maximizatio Step: Q (t) i : π (t) Q(t) i Will the algorithm coverge? What does it coverge to? π (t 1) j1 π(t 1) j f (x i µ (t 1) ) f (x i µ (t 1) j ) µ (t) Q(t) i x i Q(t) i A example with 3 clusters. X2 X1

After 1st E ad M step. Iteratio 1 After 2d E ad M step. Iteratio 2 After 3rd E ad M step. Iteratio 3 After 4th E ad M step. Iteratio 4

After 5th E ad M step. Iteratio 5 I a maximum lielihood framewor, the objective fuctio is the log lielihood, l(θ) log π f (x i µ ) Direct maximisatio is ot feasible. Cosider aother objective fuctio F(θ, q), where q is ay probability distributio o latet variables z, such that: 1 F(θ, q) l(θ) for all θ, q, max F(θ, q) l(θ) q F(θ, q) is a lower boud o the log lielihood. We ca costruct a alteratig maximisatio algorithm as follows: For t 1, 2... util covergece: q (t) : argmax F(θ (t 1), q) q θ (t) : argmax F(θ, q (t) ) θ - Solvig for q The lower boud we use is called the variatioal free eergy. q is a probability mass fuctio for a distributio over z : (z i ). F(θ, q) E q [log p(x, z θ) log q(z)] [( ) ] E q 1(z i ) (log π + log f (x i µ )) log q(z) 1 [( ) ] q(z) 1(z i ) (log π + log f (x i µ )) log q(z) z 1 Lemma F(θ, q) l(θ) for all q ad for all θ. Lemma F(θ, q) l(θ) for q(z) p(z x, θ). I combiatio with previous Lemma, this implies that q(z) p(z x, θ) maximizes F(θ, q) for fixed θ, i.e., the optimal q is simply the coditioal distributio give the data ad that fixed θ. I mixture model, q (z) p(z, x θ) p(x θ) π z i f (x i µ zi ) z π z i f (x i µ z i ) π zi f (x i µ zi ) π f (x i µ ) p(z i x i, θ).

- Solvig for θ Settig derivative with respect to µ to 0, µ F(θ, q) q(z) 1(z i ) µ log f (x i µ ) z q(z i ) µ log f (x i µ ) 0 This equatio ca be solved quite easily. E.g., for mixture of Gaussias, µ q(z i )x i q(z i ) If it caot be solved exactly, we ca use gradiet ascet algorithm: µ µ + α q(z i ) µ log f (x i µ ). Similar derivatio for optimal π as before. Notes o Probabilistic Approach ad Start with some iitial parameters (π (0), µ (0) ) K 1. Iterate for t 1, 2,...: Expectatio Step: Theorem q (t) (z i ) : p(z i x i, θ (t 1) ) Maximizatio Step: π (t) q(t) (z i ) π (t 1) j1 π(t 1) j EM algorithm mootoically icreases the log lielihood. f (x i µ (t 1) ) f (x i µ (t 1) j ) µ (t) q(t) (z i )x i q(t) (z i ) Proof: l(θ (t 1) ) F(θ (t 1), q (t) ) F(θ (t), q (t) ) F(θ (t), q (t+1) ) l(θ (t) ). Additioal assumptio, that 2 θ F(θ(t), q (t) ) are egative defiite with eigevalues < ɛ < 0, implies that θ (t) θ where θ is a local MLE. Flexible Gaussia Some good thigs: Guarateed covergece to locally optimal parameters. Formal reasoig of ucertaities, usig both Bayes Theorem ad maximum lielihood theory. Rich laguage of probability theory to express a wide rage of geerative models, ad straightforward derivatio of algorithms for ML estimatio. Some bad thigs: Ca get stuc i local miima so multiple starts are recommeded. Slower ad more expesive tha K-meas. Choice of K still problematic, but rich array of methods for model selectio comes to rescue. We ca allow each cluster to have its ow mea ad covariace structure to eable greater flexibility i the model. Differet covariaces Idetical covariaces Differet, but diagoal covariaces Idetical ad spherical covariaces

PPCA latets A probabilistic model related to PCA has the followig geerative model: for i 1, 2,..., : Let <, p be give. Let Y i be a (latet) -dimesioal ormally distributed radom variable with 0 mea ad idetity covariace: Y i N (0, I ) We model the distributio of the ith data poit give Y i as a p-dimesioal ormal: X i N (µ + LY i, σ 2 I) where the parameters are a vector µ R p, a matrix L R p ad σ 2 > 0. pricipal subspace figures from M. Sahai s UCL course o Usupervised Learig PPCA latets PPCA latets PCA projectio PPCA oise PPCA latet prior pricipal subspace pricipal subspace figures from M. Sahai s UCL course o Usupervised Learig figures from M. Sahai s UCL course o Usupervised Learig

Mixture of s PPCA latets PPCA posterior PPCA oise PPCA latet prior PPCA projectio We have leart two types of usupervised learig techiques: Dimesioality reductio, e.g. PCA, MDS, Isomap. Clusterig, e.g. K-meas, liage ad mixture models. Probabilistic models allow us to costruct more complex models from simpler pieces. Mixture of probabilistic PCAs allows both clusterig ad dimesioality reductio at the same time. Z i Discrete(π 1,..., π K ) Y i N (0, I d ) X i Z i, Y i y i N (µ + Ly i, σ 2 I p ) pricipal subspace Allows flexible modellig of covariace structure without usig too may parameters. figures from M. Sahai s UCL course o Usupervised Learig Ghahramai ad Hito 1996 Further Readig Usupervised Learig Hastie et al, Chapter 14. James et al, Chapter 10. Ripley, Chapter 9. Tuey, Joh W. (1980). We eed both exploratory ad cofirmatory. The America Statisticia 34 (1): 23-25.