Introduction To Machine Learning

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14

Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p(x, z; θ) in such a way that it is linear in z 2 Initialize θ 0, e.g. at random or using a good first guess 3 Repeat until convergence: θ t+1 = arg max θ M E p(zm xm;θ t)[log p(x m, Z; θ)] m=1 Notice that log p(x m, Z; θ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms E step corresponds to computing the objective (i.e., the expectations) M step corresponds to maximizing the objective David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 2 / 14

Derivation of EM algorithm L(θ n+1 ) l(θ n+1 θ n ) L(θ n )=l(θ n θ n ) L(θ) l(θ θ n ) L(θ) l(θ θ n ) θ n θ n+1 θ Figure 2: Graphical (Figure interpretation from tutorial of aby single Seaniteration Borman) of the EM algorithm: The function l(θ θ n )isboundedabovebythelikelihoodfunctionl(θ). The functions are equal at θ = θ n.theemalgorithmchoosesθ n+1 as the value of θ David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 3 / 14

Application to mixture models θ Prior distribution over topics Topic-word distributions β z d Topic of doc d w id Word i =1toN d =1toD This model is a type of (discrete) mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 4 / 14

EM for mixture models θ Prior distribution over topics Topic-word distributions β z d Topic of doc d w id Word i =1toN d =1toD The complete likelihood is p(w, Z; θ, β) = D d=1 p(w d, Z d ; θ, β), where p(w d, Z d ; θ, β) = θ Zd N β Zd,w id i=1 Trick #1: re-write this as p(w d, Z d ; θ, β) = K k=1 θ 1[Z d =k] k N K i=1 k=1 β 1[Z d =k] k,w id David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 5 / 14

EM for mixture models Thus, the complete log-likelihood is: ( D K ) N K log p(w, Z; θ, β) = 1[Z d = k] log θ k + 1[Z d = k] log β k,wid d=1 k=1 i=1 k=1 In the E step, we take the expectation of the complete log-likelihood with respect to p(z w; θ t, β t ), applying linearity of expectation, i.e. E p(z w;θ t,β t )[log p(w, z; θ, β)] = ( D K p(z d = k w; θ t, β t ) log θ k + d=1 k=1 N i=1 k=1 ) K p(z d = k w; θ t, β t ) log β k,wid In the M step, we maximize this with respect to θ and β David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 6 / 14

EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from ( D K ) N K p(z d = k w; θ t, β t ) log θ k + p(z d = k w; θ t, β t ) log β k,wid d=1 k=1 i=1 k=1 to K log θ k D K W p(z d = k w d ; θ t, β t )+ log β k,w k=1 d=1 k=1 w=1 d=1 D N dw p(z d = k w d ; θ t, β t ) We then have that θ t+1 k = D d=1 p(z d = k w d ; θ t, β t ) Kˆk=1 D d=1 p(z d = ˆk w d ; θ t, β t ) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 7 / 14

Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ Latent Dirichlet David+Sontag,+Daniel+Roy+ allocation (LDA) (NYU,+Cambridge)+ W66+ Topic models are powerful tools for exploring large data sets and for Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ making inferences about the content of documents inferences+about+the+content+of+documents+ Documents+!"#$%&'() *"+,#) Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ +"/,9#)1.&/,0,"'1 )+".()1 president+.0095+ hindu+.0092+ baseball+.0100+ +.&),3&'(1 2,'3$1 65)&65//1 obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ "65%51 4$3,5)%1 ethics+.0075+ )"##&.1 basketball+.0050+ :5)2,'0("'1 religion+.0060+ &(2,#)1 buddhism+.0016+ 65)7&(65//1 football+.0045+.&/,0,"'1 6$332,)%1 8""(65//1 - + - - t = p(w z = t) ManyAlmost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ applications in information retrieval, document summarization, and classification retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ + What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ + Words+w 1,+,+w N+ Distribu6on+of+topics + LDA is one of the simplest and most widely used topic models David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 8 / 14

Generative model for a document in LDA 1 Sample the document s topic distribution θ (aka topic vector) θ Dirichlet(α 1:T ) where the {α t } T t=1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / t α t 2 For i = 1 to N, sample the topic z i of the i th word z i θ θ 3... and then sample the actual word w i from the z i th topic w i z i β zi where {β t } T t=1 words) are the topics (a fixed collection of distributions on David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 9 / 14

Generative model for a document in LDA 1 Sample the document s topic distribution θ (aka topic vector) θ Dirichlet(α 1:T ) where the {α t } T t=1 are hyperparameters.the Dirichlet density, defined over = { θ R T : t θ t 0, T t=1 θ t = 1}, is: T p(θ 1,..., θ T ) For example, for T =3 (θ 3 = 1 θ 1 θ 2 ): t=1 θ αt 1 t α1 = α2 = α3 = α 1 = α 2 = α 3 = log Pr(θ) log Pr(θ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 10 / 14

Generative model for a document in LDA 3... and then sample the actual word w i from the z i th topic Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ w i z i β zi (NYU,+Cambridge)+ where {β t } T t=1 are the topics (a fixed collection of distributions W66+ on Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ words) inferences+about+the+content+of+documents+ Documents+ poli6cs+.0100+ president+.0095+ obama+.0090+ washington+.0085+ religion+.0060+ Topics+ Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ + religion+.0500+ hindu+.0092+ judiasm+.0080+ ethics+.0075+ buddhism+.0016+ + t = p(w z = t) What+is+this+document+about?+ sports+.0105+ baseball+.0100+ soccer+.0055+ basketball+.0050+ football+.0045+ David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 11 / 14 +

Example of using LDA β 1 Topics gene 0.04 dna 0.02 genetic 0.01.,, Documents Topic proportions and assignments z 1d life 0.02 evolve 0.01 organism 0.01.,, θ d β T brain 0.04 neuron 0.02 nerve 0.01... data 0.02 number 0.02 computer 0.01.,, z Nd (Blei, Introduction to Probabilistic Topic Models, 2011) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 12 / 14

Plate notation for LDA model α Dirichlet hyperparameters θ d Topic distribution for document Topic-word distributions β z id Topic of word i of doc d w id Word i =1toN d =1toD Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 13 / 14

Comparison of mixture and admixture models α Dirichlet hyperparameters θ Prior distribution over topics θ d Topic distribution for document Topic-word distributions β z d Topic of doc d Topic-word distributions β z id Topic of word i of doc d w id Word w id Word i =1toN i =1toN d =1toD d =1toD Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 14 / 14