Introduction To Machine Learning

Similar documents
RECSM Summer School: Facebook + Topic Models. github.com/pablobarbera/big-data-upf

Latent Dirichlet Allocation (LDA)

CS Lecture 18. Topic Models and LDA

Topic Modelling and Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Gaussian Mixture Model

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Expectation maximization tutorial

Latent Variable Models

CS145: INTRODUCTION TO DATA MINING

Lecture 13 : Variational Inference: Mean Field Approximation

Document and Topic Models: plsa and LDA

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Lecture 8: Graphical models for Text

13: Variational inference II

Measuring Topic Quality in Latent Dirichlet Allocation

Topic Modeling: Beyond Bag-of-Words

Non-Parametric Bayes

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Text mining and natural language analysis. Jefrey Lijffijt

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Statistical Debugging with Latent Topic Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Latent Dirichlet Allocation Introduction/Overview

CS6220: DATA MINING TECHNIQUES

Latent Dirichlet Allocation (LDA)

CPSC 540: Machine Learning

A Note on the Expectation-Maximization (EM) Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Note 1: Varitional Methods for Latent Dirichlet Allocation

Introduction to Bayesian inference

Study Notes on the Latent Dirichlet Allocation

ECE 5984: Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Dynamic Approaches: The Hidden Markov Model

Introduction to Probabilistic Machine Learning

Introduction to Machine Learning

Evaluation Methods for Topic Models

Lecture 6: Gaussian Mixture Models (GMM)

CS 6140: Machine Learning Spring 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

LDA Collapsed Gibbs Sampler, VariaNonal Inference. Task 3: Mixed Membership Models. Case Study 5: Mixed Membership Modeling

Collaborative Topic Modeling for Recommending Scientific Articles

Latent Dirichlet Allocation

Probabilistic Machine Learning

Sparse Stochastic Inference for Latent Dirichlet Allocation

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Data Mining Techniques

COS513 LECTURE 8 STATISTICAL CONCEPTS

Mixture Models and Expectation-Maximization

Graphical Models for Collaborative Filtering

Language Information Processing, Advanced. Topic Models

Topic Models. Charles Elkan November 20, 2008

1 Expectation Maximization

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

Topic Models and Applications to Short Documents

Parametric Mixture Models for Multi-Labeled Text

Dimension Reduction (PCA, ICA, CCA, FLD,

Clustering problems, mixture models and Bayesian nonparametrics

Bayesian Machine Learning

Posterior Regularization

Directed and Undirected Graphical Models

Probabilistic Graphical Models for Image Analysis - Lecture 1

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Probability models for machine learning. Advanced topics ML4bio 2016 Alan Moses

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

The Bayes classifier

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

LDA with Amortized Inference

Nonparameteric Regression:

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

STA 414/2104: Machine Learning

Recent Advances in Bayesian Inference Techniques

Linear Dynamical Systems

Knowledge Discovery and Data Mining 1 (VO) ( )

Introduction to Machine Learning

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

An Introduction to Bayesian Machine Learning

Machine Learning

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

Probabilistic Latent Semantic Analysis

Mixtures of Multinomials

Probabilistic Graphical Models

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Note Set 5: Hidden Markov Models

Lecture 1: Bayesian Framework Basics

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Expectation maximization

Homework 6: Image Completion using Mixture of Bernoullis

Bayesian Methods: Naïve Bayes

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Note for plsa and LDA-Version 1.1

Text Mining for Economics and Finance Latent Dirichlet Allocation

STA 4273H: Statistical Machine Learning

Transcription:

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14

Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p(x, z; θ) in such a way that it is linear in z 2 Initialize θ 0, e.g. at random or using a good first guess 3 Repeat until convergence: θ t+1 = arg max θ M E p(zm xm;θ t)[log p(x m, Z; θ)] m=1 Notice that log p(x m, Z; θ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms E step corresponds to computing the objective (i.e., the expectations) M step corresponds to maximizing the objective David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 2 / 14

Derivation of EM algorithm L(θ n+1 ) l(θ n+1 θ n ) L(θ n )=l(θ n θ n ) L(θ) l(θ θ n ) L(θ) l(θ θ n ) θ n θ n+1 θ Figure 2: Graphical (Figure interpretation from tutorial of aby single Seaniteration Borman) of the EM algorithm: The function l(θ θ n )isboundedabovebythelikelihoodfunctionl(θ). The functions are equal at θ = θ n.theemalgorithmchoosesθ n+1 as the value of θ David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 3 / 14

Application to mixture models θ Prior distribution over topics Topic-word distributions β z d Topic of doc d w id Word i =1toN d =1toD This model is a type of (discrete) mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 4 / 14

EM for mixture models θ Prior distribution over topics Topic-word distributions β z d Topic of doc d w id Word i =1toN d =1toD The complete likelihood is p(w, Z; θ, β) = D d=1 p(w d, Z d ; θ, β), where p(w d, Z d ; θ, β) = θ Zd N β Zd,w id i=1 Trick #1: re-write this as p(w d, Z d ; θ, β) = K k=1 θ 1[Z d =k] k N K i=1 k=1 β 1[Z d =k] k,w id David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 5 / 14

EM for mixture models Thus, the complete log-likelihood is: ( D K ) N K log p(w, Z; θ, β) = 1[Z d = k] log θ k + 1[Z d = k] log β k,wid d=1 k=1 i=1 k=1 In the E step, we take the expectation of the complete log-likelihood with respect to p(z w; θ t, β t ), applying linearity of expectation, i.e. E p(z w;θ t,β t )[log p(w, z; θ, β)] = ( D K p(z d = k w; θ t, β t ) log θ k + d=1 k=1 N i=1 k=1 ) K p(z d = k w; θ t, β t ) log β k,wid In the M step, we maximize this with respect to θ and β David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 6 / 14

EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from ( D K ) N K p(z d = k w; θ t, β t ) log θ k + p(z d = k w; θ t, β t ) log β k,wid d=1 k=1 i=1 k=1 to K log θ k D K W p(z d = k w d ; θ t, β t )+ log β k,w k=1 d=1 k=1 w=1 d=1 D N dw p(z d = k w d ; θ t, β t ) We then have that θ t+1 k = D d=1 p(z d = k w d ; θ t, β t ) Kˆk=1 D d=1 p(z d = ˆk w d ; θ t, β t ) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 7 / 14

Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ Latent Dirichlet David+Sontag,+Daniel+Roy+ allocation (LDA) (NYU,+Cambridge)+ W66+ Topic models are powerful tools for exploring large data sets and for Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ making inferences about the content of documents inferences+about+the+content+of+documents+ Documents+!"#$%&'() *"+,#) Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ +"/,9#)1.&/,0,"'1 )+".()1 president+.0095+ hindu+.0092+ baseball+.0100+ +.&),3&'(1 2,'3$1 65)&65//1 obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ "65%51 4$3,5)%1 ethics+.0075+ )"##&.1 basketball+.0050+ :5)2,'0("'1 religion+.0060+ &(2,#)1 buddhism+.0016+ 65)7&(65//1 football+.0045+.&/,0,"'1 6$332,)%1 8""(65//1 - + - - t = p(w z = t) ManyAlmost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ applications in information retrieval, document summarization, and classification retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ + What+is+this+document+about?+ weather+.50+ finance+.49+ sports+.01+ + Words+w 1,+,+w N+ Distribu6on+of+topics + LDA is one of the simplest and most widely used topic models David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 8 / 14

Generative model for a document in LDA 1 Sample the document s topic distribution θ (aka topic vector) θ Dirichlet(α 1:T ) where the {α t } T t=1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / t α t 2 For i = 1 to N, sample the topic z i of the i th word z i θ θ 3... and then sample the actual word w i from the z i th topic w i z i β zi where {β t } T t=1 words) are the topics (a fixed collection of distributions on David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 9 / 14

Generative model for a document in LDA 1 Sample the document s topic distribution θ (aka topic vector) θ Dirichlet(α 1:T ) where the {α t } T t=1 are hyperparameters.the Dirichlet density, defined over = { θ R T : t θ t 0, T t=1 θ t = 1}, is: T p(θ 1,..., θ T ) For example, for T =3 (θ 3 = 1 θ 1 θ 2 ): t=1 θ αt 1 t α1 = α2 = α3 = α 1 = α 2 = α 3 = log Pr(θ) log Pr(θ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 10 / 14

Generative model for a document in LDA 3... and then sample the actual word w i from the z i th topic Complexity+of+Inference+in+Latent+Dirichlet+Alloca6on+ David+Sontag,+Daniel+Roy+ w i z i β zi (NYU,+Cambridge)+ where {β t } T t=1 are the topics (a fixed collection of distributions W66+ on Topic+models+are+powerful+tools+for+exploring+large+data+sets+and+for+making+ words) inferences+about+the+content+of+documents+ Documents+ poli6cs+.0100+ president+.0095+ obama+.0090+ washington+.0085+ religion+.0060+ Topics+ Almost+all+uses+of+topic+models+(e.g.,+for+unsupervised+learning,+informa6on+ retrieval,+classifica6on)+require+probabilis)c+inference:+ New+document+ + religion+.0500+ hindu+.0092+ judiasm+.0080+ ethics+.0075+ buddhism+.0016+ + t = p(w z = t) What+is+this+document+about?+ sports+.0105+ baseball+.0100+ soccer+.0055+ basketball+.0050+ football+.0045+ David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 11 / 14 +

Example of using LDA β 1 Topics gene 0.04 dna 0.02 genetic 0.01.,, Documents Topic proportions and assignments z 1d life 0.02 evolve 0.01 organism 0.01.,, θ d β T brain 0.04 neuron 0.02 nerve 0.01... data 0.02 number 0.02 computer 0.01.,, z Nd (Blei, Introduction to Probabilistic Topic Models, 2011) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 12 / 14

Plate notation for LDA model α Dirichlet hyperparameters θ d Topic distribution for document Topic-word distributions β z id Topic of word i of doc d w id Word i =1toN d =1toD Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 13 / 14

Comparison of mixture and admixture models α Dirichlet hyperparameters θ Prior distribution over topics θ d Topic distribution for document Topic-word distributions β z d Topic of doc d Topic-word distributions β z id Topic of word i of doc d w id Word w id Word i =1toN i =1toN d =1toD d =1toD Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 14 / 14