Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, PDF Free Download

Outlie CSCI-567: Machie Learig Sprig 209 Gaussia mixture models Prof. Victor Adamchik 2 Desity estimatio U of Souther Califoria Mar. 26, 209 3 Naive Bayes Revisited March 26, 209 / 57 March 26, 209 2 / 57 Outlie Gaussia mixture models Gaussia mixture models GMM is a probabilistic approach for clusterig. Gaussia mixture models Motivatio ad Model EM algorithm EM applied to GMMs 2 Desity estimatio 3 Naive Bayes Revisited We wat to come up with a probabilistic model p to explai how the data is geerated. We will model each regio with a Gaussia distributio. To geerate a poit, we first radomly pick oe of the Gaussia models, the draw a poit accordig this Gaussia. March 26, 209 3 / 57 March 26, 209 4 / 57

GMM: formal defiitio A example A GMM has the followig desity fuctio: px = where ω k Nx µ k, Σ k = k= k= ω k 2π D Σ k e 2 x µ k T Σ k x µ k K: the umber of Gaussia compoets same as #clusters we wat µ k ad Σ k : mea ad covariace matrix of the k-th Gaussia ω,..., ω K : mixture weights, they represet how much each compoet cotributes to the fial distributio. It satisfies two properties: k, ω k > 0, ad ω k = k The coditioal distributios are px z = red = Nx µ, Σ px z = blue = Nx µ 2, Σ 2 px z = gree = Nx µ 3, Σ 3 Here z is the hidde latet variable. The margial distributio is px = prednx µ, Σ + pbluenx µ 2, Σ 2 + pgreenx µ 3, Σ 3 March 26, 209 5 / 57 March 26, 209 6 / 57 Learig GMMs Preview of EM for learig GMMs Learig a GMM meas fidig all the parameters = {ω k, µ k, Σ k } K k=. How to lear these parameters? A obvious attempt is maximum-likelihood estimatio MLE: fid l N = px ; = N = l px ; P The problem is itractable i geeral o-cocave problem, also there is a latet parameter. Oe solutio is to still apply GD/SGD, but a much more effective approach is the Expectatio Maximizatio EM algorithm. Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, 209 7 / 57 March 26, 209 8 / 57

EM algorithm EM algorithm I geeral EM is a heuristic to solve MLE with latet variables ot just GMM, i.e. fid the maximizer of P = N l px ; = is the parameters for a geeral probabilistic model x s are observed radom variables z s are latet variables Agai, directly solvig the objective is itractable. A geeral algorithm for dealig with hidde data. EM is a optimizatio strategy for objective fuctios that ca be iterpreted as likelihoods i the presece of missig data. EM is much simpler tha gradiet methods: o eed to choose step size. EM is a iterative algorithm with two steps: E-step: fill-i hidde values usig iferece M-step: apply stadard MLE method to completed data We will prove that EM always coverges to a local optimum of the likelihood. March 26, 209 9 / 57 March 26, 209 0 / 57 High level idea Derivatio of EM Keep maximizig a lower boud of P that is more maageable Fidig the lower boud of P : px, z ; l px ; = l pz x ; [ ] px, z ; = E z q l pz x ; true for ay z true for ay dist. q Let us recall the defiitio of expectatio E z q [fz] = z qzfz ad etropy Hz = E z q [l qz] = z qz l qz March 26, 209 / 57 March 26, 209 2 / 57

Derivatio of EM Fidig the lower boud of P : px, z ; l px ; = l true for ay z pz x ; [ ] px, z ; = E z q l true for ay dist. q pz x ; [ ] pz x ; = E z q [l px, z ; ] E z q [l qz] E z q l qz [ ] pz x ; = E z q [l px, z ; ] + Hq E z q l H is etropy qz [ ] pz x ; E z q [l px, z ; ] + Hq l E z q qz Jese s iequality Jese s iequality Claim: E [l X] l E[X] Proof. By the defiitio of E[X] = N x + x 2 +... + x, the It follows, E [l X] = N l x + l x 2 +... + l x = N l N N N l = x l N N N x N = N = N = x x = This is the AGM iequality. For N = 2, it is just x x 2 2 0. x March 26, 209 3 / 57 March 26, 209 4 / 57 Derivatio of EM Alteratively maximize the lower boud After applyig Jese s iequality, we obtai l px ; E z q [l px, z ; ] + Hq l E z q [ pz x ; qz Next, we observe that [ ] pz x ; E z q = qz z pz x ; qz = qz z ] pz x ; = We have foud a lower boud for the log-likelihood fuctio P = N l px ; = N = E z q [l px, z ; ] + Hq = F, {q } This holds for ay {q }, so how do we choose? It follows, l px ; E z q [l px, z ; ] + Hq Naturally, the oe that maximizes the lower boud i.e. the tightest lower boud! This is similar to K-meas: we will alteratively maximizig F over {q } ad. March 26, 209 5 / 57 March 26, 209 6 / 57

Pictorial explaatio P is o-cocave, but F maximize., {q t } ofte is cocave ad easy to Maximizig over {q } Fix t, ad maximize F over {q } ] E z q [l px, z ; t F, {q } = q = q k= q q k l px, z = k ; t q k l q k + Hq subject to coditios: q k 0 ad q k = k Next, write dow the Lagragia ad the apply KKT coditios. March 26, 209 7 / 57 March 26, 209 8 / 57 Maximizig over {q } Maximizig over The solutio to [ ] F, {q } = E z q l px, z ; t q q is you have to verify it by yourself q t z = pz = k x ; t i.e., the posterior distributio of z give x ad t. So at t, we foud the tightest lower boud F, {q t } : F, {q t } P for all. F t, {q t } = P t + Hq Fix {q t }, maximize over ote, Hq t is idepedet of : F =, {q t } N = E z q t Q ; t [l px, z ; ] {q t } are computed via t Q is called a complete likelihood ad is usually more tractable, sice z are ot latet variables aymore. March 26, 209 9 / 57 March 26, 209 20 / 57

Geeral EM algorithm Pictorial explaatio Step 0 Iitialize, t = Step E-Step update the posterior of latet variables q t = p x ; t ad obtai Expectatio of complete likelihood Q ; t = N = E z q t [l px, z ; ] Step 2 M-Step update the model parameter via Maximizatio t+ Q ; t P is o-cocave, but Q; t ofte is cocave ad easy to maximize. P t+ F t+ ; {q t } F t ; {q t } = P t So EM always icreases the objective value ad will coverge to some local maximum similar to K-meas. Step 3 t t + ad retur to Step if ot coverged March 26, 209 2 / 57 March 26, 209 22 / 57 Apply EM to lear GMMs E-Step: z = k = p z = k x ; t = p z = k ; t px z = k ; t = ω t k N x µ t k, Σt k q t This computes the soft assigmet γ k = q t z = k, i.e. coditioal probability of x belogig to cluster k. Apply EM to lear GMMs M-Step: Q, t = To fid ω,..., ω K, solve ω N = k= = = N = N = E z q t E z q t N {ω k,µ k,σ k } = k= γ k l ω k [l px, z ; ] [l pz ; + l px z ; ] γ k l ω k + l Nx µ k, Σ k To fid each µ k, Σ k, solve µ k,σ k N γ k l Nx µ k, Σ k = March 26, 209 23 / 57 March 26, 209 24 / 57

M-Step cotiued Solutios to previous two problems are very atural see slide 8, for each k ω k = γ k N i.e. weighted fractio of examples belogig to cluster k µ k = γ kx γ k i.e. weighted average of examples belogig to cluster k Σ k = γ γ k x µ k x µ k T k i.e weighted covariace of examples belogig to cluster k March 26, 209 25 / 57 GMM: puttig it together EM for clusterig: Step 0 Iitialize ω k, µ k, Σ k for each k [K] Step E-Step update the soft assigmet fixig parameters γ k = pz = k x ω k N x µ k, Σ k Step 2 M-Step update the model parameter fixig assigmets ω k = γ k µ k = γ kx N γ k Σ k = γ γ k x µ k x µ k T k Step 3 retur to Step if ot coverged March 26, 209 26 / 57 Coectio to K-meas K-meas is i fact a special case of EM for a simplified GMM: Let Σ k = σ 2 I for some fixed σ, so oly ω k ad µ k are parameters. EM becomes K-meas: N px ; = = N = k= pz = knx µ k If we assume hard assigmets pz = k =, if k = C, the Outlie Gaussia mixture models 2 Desity estimatio Parametric models Noparametric models N = = N px ; = = N exp 2σ 2 x µ C 2 2 = Nx µ C = µ,c N x µ C 2 2 GMM is a soft versio of K-meas ad it provides a probabilistic iterpretatio of the data. March 26, 209 27 / 57 = 3 Naive Bayes Revisited March 26, 209 28 / 57

Desity estimatio Parametric geerative models Observe what we have doe idirectly for clusterig with GMMs is: Give a traiig set x,..., x N, estimate a desity fuctio p that i.i.d. could have geerated this dataset via x p. This is exactly the problem of desity estimatio, aother importat usupervised learig problem. Useful for may dowstream applicatios we have see clusterig already, will see more applicatios today these applicatios also provide a way to measure quality of the desity estimator Parametric estimatio assumes a geerative model parametrized by : Examples: px = px ; GMM: px ; = K k= ω knx µ k, Σ k where = {ω k, µ k, Σ k } Multiomial for D examples with K possible values px = k ; = k where is a distributio over K elemets. Size of is idepedet of the traiig set size, so it s parametric. March 26, 209 29 / 57 March 26, 209 30 / 57 Parametric methods MLE for multiomial Agai, we apply MLE to lear the parameters : = N l px ; For some cases this is itractable ad we ca use EM to approximately solve MLE e.g. GMMs. = For some other cases this admits a simple closed-form solutio e.g. multiomial. = = N l px = x ; = = k= :x =k l k = N l x = z k l k where z k = { : x = k} is the umber of examples with value k. The solutio your TA4 is simply k = z k N z k, i.e. the fractio of examples with value k. k= March 26, 209 3 / 57 March 26, 209 32 / 57

Noparametric models Ca we estimate without assumig a fixed geerative model? High level idea Costruct somethig similar to a histogram: for each data poit, create a hump via a kerel sum up all the humps; more data - a higher hump picture from Wikipedia Kerel desity estimatio KDE is a commo approach for oparametric desity estimatio. Here kerel meas somethig differet from what we have see for kerel fuctio. We focus o the D cotiuous case. March 26, 209 33 / 57 March 26, 209 34 / 57 Kerel Differet kerels Kx KDE with a kerel Kx: R R cetered at x : px = N N Kx x = May choices for K, for example, Kx = 2π e x2 2, the stadard Gaussia desity e x2 2 2π 2 I[ x ] 3 4 max{ x2, 0} Properties of a kerel: symmetry: Kx = K x Kxdx =, this isures p is a desity fuctio. March 26, 209 35 / 57 March 26, 209 36 / 57

Badwidth Effect of badwidth picture from Wikipedia If Kx is a kerel, the for ay h > 0 K h u h K x h stretchig the kerel A larger h will smooth a desity. A small h will yield a desity that is spiky ad very hard to iterpret. ca be used as a kerel too verify the two properties yourself So, geeral KDE is determied by both the kerel K ad the badwidth h px = N N = K h x x = Nh x cotrols the ceter of each hump N x x K h = Assume Gaussia kerel. Gray curve is groud-truth Red: h = 0.05 Black: h = 0.337 Gree: h = 2 h cotrols the width/variace of the humps March 26, 209 37 / 57 March 26, 209 38 / 57 Badwidth selectio Outlie Selectig h is a deep topic oe ca also do cross-validatio based o dowstream applicatios there are theoretically-motivated approaches Fid a value of h that miimizes the error betwee the estimated desity ad the true desity: E [ p KDE x px 2] = E [p KDE x px] 2 + V ar [p KDE x] Gaussia mixture models 2 Desity estimatio 3 Naive Bayes Revisited Setup ad assumptio Coectio to logistic regressio Geerative ad Discrimiative Models This expressio is a example of the bias-variace tradeoff, which we saw i the earlier lecture. March 26, 209 39 / 57 March 26, 209 40 / 57

Bayes optimal classifier Discrete features Suppose the data x, y is draw from a joit distributio px, y, the Bayes optimal classifier is f x = pc x i.e. predict the class with the largest coditioal probability. For a label c [C], py = c = { : y = c} N px, y is of course ukow, but we ca estimate it, which is exactly a desity estimatio problem! Observe that px, y = pypx y For each possible value k of a discrete feature d, px d = k y = c = { : x d = k, y = c} { : y = c} To estimate px y = c for some c [C], we are doig desity estimatio usig data with label y = c. March 26, 209 4 / 57 March 26, 209 42 / 57 Cotiuous features If the feature is cotiuous, we ca do parametric estimatio, e.g. via a Gaussia px d = x y = c = exp x µ cd 2 2πσcd 2σ 2 cd How to predict? Usig Naive Bayes assumptio: D px y = c = px d y = c the predictio for a ew example x is d= where µ cd ad σcd 2 are the empirical mea ad variace of feature d amog all examples with label c. or oparametric estimatio, e.g. via a kerel K ad badwidth h: px d = x y = c = { : y = c} :y =c K h x x d py = c x = = = px y = cpy = c px D py = c px d y = c d= l py = c + D l px d y = c d= March 26, 209 43 / 57 March 26, 209 44 / 57

Naive Bayes Naive Bayes For discrete features, pluggig i previous MLE estimatios gives = = py = c x l py = c + D l px d y = c d= l { : y = c} + D d= l { : x d = x d, y = c} { : y = c} For cotiuous features with a Gaussia model, = = = py = c x l py = c + D l px d y = c d= l { : y = c} + l { : y = c} D l exp x d µ cd 2 2πσcd d= D l σ cd + x d µ cd 2 d= 2σ 2 cd 2σ 2 cd March 26, 209 45 / 57 March 26, 209 46 / 57 Coectio to logistic regressio Coectio to logistic regressio Let us fix the variace for each feature to be σ i.e. ot a parameter of the model ay more, the the predictio becomes = = = py = c x l { : y = c} D l σ + x d µ cd 2 d= l { : y = c} x 2 D 2 2σ 2 d= D w c0 + w cd x d = wc T x d= where we deote w c0 = l { : y = c} D µ 2 cd d= 2σ 2 2σ 2 µ 2 D cd 2σ 2 + d= µ cd σ 2 x d liear classifier! ad w cd = µ cd σ 2. You ca verify py = c x e wt c x This is exactly the softmax fuctio, the same model we used for a probabilistic iterpretatio of logistic regressio! So what is differet the? They lear the parameters i differet ways: both via MLE, oe o py = c x, the other o px, y solutios are differet: logistic regressio has o closed-form, aive Bayes admits a simple closed-form March 26, 209 47 / 57 March 26, 209 48 / 57

Two differet modelig paradigms Geerative model v.s discrimiative model Suppose the traiig data is from a ukow joit probabilistic model px, y. There are two kids of classificatio models i machie learig geerative models ad discrimiative models. Discrimiative model Geerative model Differeces i assumig models for the data the geerative approach requires we specify the model for the joit distributio such as Naive Bayes, ad thus, maximize the joit likelihood log px, y the discrimiative approach discrimiative requires oly specifyig a model for the coditioal distributio such as logistic regressio, ad thus, maximize the coditioal likelihood log py x Sometimes, modelig by discrimiative approach is easier Sometimes, parameter estimatio by geerative approach is easier Example logistic regressio aive Bayes Model coditioal py x joit px, y might have same py x Learig MLE MLE Accuracy usually better for large N usually better for small N Remark more flexible, ca geerate data after learig March 26, 209 49 / 57 March 26, 209 50 / 57 Determiig sex ma or woma based o measuremets Example: Geerative approach Propose a model of the joit distributio of x = height, y =sex 280 red = female, blue=male our data 280 260 red = female, blue=male weight 260 240 220 200 80 60 40 20 00 Sex Height 6 2 5 2 5 6 6 2 2 5.7 80 55 60 65 70 75 80 height 80 55 60 65 70 75 80 weight 240 220 200 80 60 40 20 00 height Ituitio: we will model how heights vary accordig to a Gaussia i each sub-populatio male ad female. Note: This is similar to Naive Bayes for detectig spam emails. March 26, 209 5 / 57 March 26, 209 52 / 57

Model of the joit distributio Parameter estimatio px, y = pypx y p = 2πσ e x µ 2 2πσ2 e x µ 2 2 p 2 2σ 2 if y = 2σ 2 2 if y = 2 where p + p 2 = represets two prior probabilities that x is give the label or 2 respectively. px y is assumed to be Gaussias. weight 280 260 240 220 200 80 60 40 20 00 red = female, blue=male 80 55 60 65 70 75 80 height Likelihood of the traiig data D = {x, y } N = with y {, 2} log P D = log px, y = log p e x µ 2 2σ 2 2πσ :y = + :y =2 log p 2 e x µ 2 2 2σ 2 2 2πσ2 Maximize the likelihood fuctio p, p 2, µ, µ 2, σ, σ2 = log P D March 26, 209 53 / 57 March 26, 209 54 / 57 Decisio boudary Example of oliear decisio boudary The decisio boudary betwee two classes is defied by py = x py = 2 x which is equivalet to px y = py = px y = 2py = 2 Namely, 2 0 Parabolic Boudary x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 2 It is quadratic i x. It follows for some a, b ad c, that The decisio boudary is ot liear! ax 2 + bx + c 0 March 26, 209 55 / 57 2 0 2 Note: the boudary is characterized by a quadratic fuctio, givig rise to the shape of parabolic curve. March 26, 209 56 / 57

A special case What if we assume the two Gaussias have the same variace? We will get a liear decisio boudary From the previous slide: x µ 2 2σ 2 log 2πσ + log p x µ 2 2 2σ 2 2 log 2πσ 2 + log p 2 Settig σ = σ 2, we obtai bx + c 0 Note: equal variaces across two differet categories could be a very strog assumptio. For example, the plot suggests that the male populatio has slightly bigger variace i.e., bigger eclipse tha the female populatio. March 26, 209 57 / 57

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019