Motivatio Expectatio maximizatio Subhrasu Maji CMSCI 689: Machie Learig 14 April 015 Suppose you are builig a aive Bayes spam classifier. After your are oe your boss tells you that there is o moey to label the ata. You have a probabilistic moel that assumes labelle ata, but you o't have ay labels. Ca you still o somethig? Amazigly you ca Treat the labels as hie variables a try to lear them simultaeously alog with the parameters of the moel Expectatio Maximizatio (EM) A broa family of algorithms for solvig hie variable problems I toay s lecture we will erive EM algorithms for clusterig a aive Bayes classificatio a lear why EM works /19 Gaussia mixture moel for clusterig Suppose ata comes from a Gaussia Mixture Moel (GMM) you have K clusters a the ata from the cluster k is raw from a Gaussia with mea μk a variace σk We will assume that the ata comes with labels (we will soo remove this assumptio) Geerative story of the ata: For each example = 1,,.., Choose a label y Mult( 1,,..., K ) Choose example x (µ k, k) Likelihoo of the ata: p(d) = p(y )p(x y ) = y (x ; µ y, p(d) = =1 =1 3/19 =1 y y D x µ y exp y y ) GMM: kow labels Likelihoo of the ata: p(d) = y y D x µ y exp If you kew the labels y the the maximum-likelihoo estimates of the parameters is easy: k = 1 [y = k] µ k = =1 [y = k]x [y = k] k = [y = k] x µ k [y = k] y fractio of examples with label k mea of all the variace of all the 4/19
GMM: ukow labels GMM: parameter estimatio ow suppose you i t have labels y. Aalogous to k-meas, oe solutio is to iterate. Start by guessig the parameters a the repeat the two steps: Estimate labels give the parameters Estimate parameters give the labels I k-meas we assige each poit to a sigle cluster, also calle as har assigmet (poit 10 goes to cluster ) I expectatio maximizatio (EM) we will will use soft assigmet (poit 10 goes half to cluster a half to cluster 5) Lets efie a raom variable z = [z1, z,, zk] to eote the assigmet vector for the th poit Har assigmet: oly oe of zk is 1, the rest are 0 Soft assigmet: zk is positive a sum to 1 Formally z,k is the probability that the th poit goes to cluster k 5/19 6/19 z,k = p(y = k x ) = (y = k, x ) (x ) / (y = k) (x y )= k (x ; µ k, Give a set of parameters (θk,μk,σk ), z,k is easy to compute Give z,k, we ca upate the parameters (θk,μk,σk ) as: k = 1 µ k = k = z,k z,kx z,k z,k x µ k z,k k) fractio of examples with label k mea of all the fractioal variace of all the fractioal GMM: example We have replace the iicator variable [y = k] with p(y=k) which is the expectatio of [y=k]. This is our guess of the labels. Just like k-meas the EM is susceptible to local miima. Clusterig example: k-meas GMM http://bviewer.ipytho.org/github/icta/mlss/tree/master/clusterig/ 7/19 The EM framework We have ata with observatios x a hie variables y, a woul like to estimate parameters θ The likelihoo of the ata a hie variables: Oly x are kow so we ca compute the ata likelihoo by margializig out the y: p(d) = Y p( ) = Y p(x,y ) p(x,y ) arameter estimatio by maximizig log-likelihoo: ML arg max y log p(x,y ) y har to maximize sice the sum is isie the log 8/19
Jese s iequality Give a cocave fuctio f a a set of weights λi 0 a ᵢ λᵢ = 1 Jese s iequality states that f( ᵢ λᵢ xᵢ) ᵢ λᵢ f(xᵢ) This is a irect cosequece of cocavity f(ax + by) a f(x) + b f(y) whe a 0, b 0, a + b = 1 f(y) f(ax+by) a f(x) + b f(y) f(x) 9/19 The EM framework Costruct a lower bou the log-likelihoo usig Jese s iequality L( ) = log p(x,y ) y = f log x q(y ) p(x,y ) Jese s iequality q(y y ) λ p(x,y ) q(y ) log q(y y ) = [q(y ) log p(x,y ) q(y ) log q(y )] y, ˆL( ) Maximize the lower bou: iepeet of θ arg max q(y ) log p(x,y ) y 10/19 Lower bou illustrate Maximizig the lower bou icreases the value of the origial fuctio if the lower bou touches the fuctio at the curret value ˆL( t ) L( ) ˆL( t+1 ) A optimal lower bou Ay choice of the probability istributio q(y) is vali as log as the lower bou touches the fuctio at the curret estimate of θ" We ca the pick the optimal q(y) by maximizig the lower bou arg max [q(y ) log p(x,y ) q(y ) log q(y )] q y This gives us q(y ) p(y x, t ) roof: use Lagragia multipliers with sum to oe costrait L( t )= ˆL( t ) This is the istributios of the hie variables coitioe o the ata a the curret estimate of the parameters This is exactly what we compute i the GMM example t t+1 11/19 1/19
The EM algorithm We have ata with observatios x a hie variables y, a woul like to estimate parameters θ of the istributio p(x θ) EM algorithm Iitialize the parameters θ raomly Iterate betwee the followig two steps: E step: Compute probability istributio over the hie variables q(y ) p(y x, ) M step: Maximize the lower bou arg max q(y ) log p(x,y ) y EM algorithm is a great caiate whe M-step ca oe easily but p(x θ) caot be easily optimize over θ For e.g. for GMMs it was easy to compute meas a variaces give the memberships 13/19 aive Bayes: revisite Cosier the biary preictio problem Let the ata be istribute accorig to a probability istributio: aive Bayes assumptio: p (y, x) =p (y, x 1,x,...,x D ) We ca simplify this usig the chai rule of probability: p (y, x) =p (y)p (x 1 y)p (x x 1,y)...p (x D x 1,x,...,x D = p (y) p (x x 1,x,...,x 1,y) p (x x 0,y)=p (x y), 8 0 6= E.g., The wors free a moey are iepeet give spam 1,y) 14/19 aive Bayes: a simple case Case: biary labels a biary features robability of the ata: p (y) =Beroulli( 0 ) p (x y = 1) = Beroulli( + ) p (x y = 1) = Beroulli( ) p (y, x) =p (y) p (x y) }1+D parameters = [y=+1] 0 (1 [y= 0 ) 1]... +[x,y=+1] (1 + )[x =0,y=+1]... [x,y= 1] (1 ) [x =0,y= 1] // label +1 // label -1 15/19 aive Bayes: parameter estimatio Give ata we ca estimate the parameters by maximizig ata likelihoo The maximum likelihoo estimates are: ˆ 0 = [y = +1] ˆ + = [x, =1,y = +1] [y = +1] ˆ = [x, =1,y = 1] [y = 1] // fractio of the ata with label as +1 // fractio of the istaces with 1 amog +1 // fractio of the istaces with 1 amog -1 16/19
aive Bayes: EM ow suppose you o t have labels y Iitialize the parameters θ raomly E step: compute the istributio over the hie variables q(y) q(y = 1) = p(y =+1 x, ) / 0 + +[x,=1] (1 + )[x,=0] M step: estimate θ give the guesses 0 = q(y = 1) + = [x, = 1]q(y = 1) q(y = 1) = [x, = 1]q(y = 1) q(y = 1) // fractio of the ata with label as +1 // fractio of the istaces with 1 amog +1 // fractio of the istaces with 1 amog -1 Summary Expectatio maximizatio A geeral techique to estimate parameters of probabilistic moels whe some observatios are hie EM iterates betwee estimatig the hie variables a optimizig parameters give the hie variables EM ca be see as a maximizatio of the lower bou of the ata log-likelihoo we use Jese s iequality to switch the log-sum to sum-log EM ca be use for learig: mixtures of istributios for clusterig, e.g. GMM parameters for hie Markov moels (ext lecture) topic moels i L probabilistic CA. 17/19 18/19 Slies creit Some of the slies are base o CIML book by Hal Daume III The figure for the EM lower bou is base o https:// cxwagyi.worpress.com/008/11/ Clusterig k-meas vs GMM is from http://bviewer.ipytho.org/ github/icta/mlss/tree/master/clusterig/ 19/19