LDA with Amortized Inference

Size: px
Start display at page:

Download "LDA with Amortized Inference"

Transcription

1 LDA with Amortied Inference Nanbo Sun Abstract This report describes how to frame Latent Dirichlet Allocation LDA as a Variational Auto- Encoder VAE and use the Amortied Variational Inference AVI to optimie it. 1. We will introduce the LDA and use Mean Field Variational Inference MFVI to optimie it. 2. We collapse the topics in LDA model because we can not do backpropagation through a categorical variable. 3. We will introduce Gaussian VAE with AVI. 4. We frame the collapsed LDA as VAE and do AVI. 1 Prerequisite Before reading this report, you must read the following papers carefully. 1. Blei, D. M., Ng, A. Y., & Jordan, M. I Latent dirichlet allocation. Journal of machine Learning research, 3Jan, ingma, D. P., & Welling, M Auto-encoding variational bayes. arxiv preprint arxiv: Figurnov, M., Mohamed, S., & Mnih, A Implicit Reparameteriation Gradients. arxiv preprint arxiv: LDA with MFVI 2.1 Generative Process For document d with topics and V unique words, 1. draw a topic mixture θ Dirα, where θ is a -vector, and k θ k 1; 2. for each of the N m word counts, independently a draw a topic n Multθ; b draw a word w n p w n n, β, where β is a V matrix, each row of which defines a multinomial distribution over all the voxels. β ij is the probability that word j appears given topic i. 1

2 α θ d d,n w d,n N d D β Figure 1: LDA model. γ d θ d d,n d,n N d D Figure 2: Variational distribution used to approximate the posterior in LDA. 2.2 Constructing the Lower Bound From Figure 2, the variational distribution used to approximate the true posterior is factoriable as q θ, γ, q θ γ N q n n. The lower bound L γ, α, β of the single-document 1 log likelihood log p w α, β is computed using Jensen s inequality as follows log p w α, β log p θ,, w α, β dθ log p θ,, w α, β q θ, γ, dθ q θ, γ, p θ,, w α, β log q θ, γ, dθ q θ, γ, { p θ,, w α, β log E q q θ, γ, E q {log p θ,, w α, β E q {log q θ, γ, L γ, α, β. The difference between the log likelihood and its lower bound can be proven to be in fact the L divergence between the variational distribution and the true posterior. log p w α, β L γ, α, β E q {log p w α, β E q {log p θ,, w α, β + E q {log q θ, γ, { p w α, β q θ, γ, E q log p θ,, w α, β 1 This also explains why the document subscript is dropped for simplicity hereafter. 1 2

3 { q θ, γ, E q log p θ, w, α, β D L q θ, γ, p θ, w, α, β. Therefore, maximiing the lower bound is equivalent to minimiing the L divergence. That is, the variational distribution automatically approaches to the real posterior as we maximie the lower bound. 2.3 Expanding the Lower Bound To maximie the lower bound, we first need to spell out the lower bound L γ, α, β in terms of the model parameters α, β and the variational parameters γ,. Continuing from 1, we have L γ, α, β E q {log p θ,, w α, β E q {log q θ, γ, { p θ,, w α, β E q log q θ, γ, { p θ α p θ p w, β E q log q θ γ q E q {log p θ α + E q {log p θ + E q {log p w, β E q {log q θ γ E q {log q. 2 We now further expand each of the five terms in 2. The first term is Γ E q {log p θ α E q log i1 i α Γ α θ α k 1 k k E q {log Γ α i + α k 1 log θ k log Γ α k i1 i1 α k 1E q {log θ k + log Γ α i log Γ α k α k 1 Ψγ k Ψ γ i + log Γ α i log Γ α k, where Ψ is the digamma function, the first derivative of the log Gamma function. The final line is due to the following property of the Dirichlet distribution as a member of the exponential family. If θ Dirα, then E pθ α {log θ i Ψα i Ψ i1 α i. The second term is N E q {log p θ E q {log p n θ E q {log i1 N θ 1n,k k 3 i1

4 N N N E q {1 n, k log θ k E q {1 n, k E q {log θ k n,k Ψγ k Ψ γ i, where n,k is the probability of the nth word being produced by topic k, and 1 is the indicator function. We expand the third term as E q {log p w, β E q {log E q {log E q { N N N p w n n, β N V i1 β 1n,k1wn,v k,v v1 v1 v1 N v1 V 1 n, k1 w n, v log β k,v V E q {1 n, k 1 w n, v log β k,v V n,k 1 w n, v log β k,v. Very similar to the first term, the fourth term is expanded as E q {log q θ γ γ k 1 Ψγ k Ψ γ i + log Γ γ k log Γ γ k. Finally, the fifth term is expanded as i1 E q {log q E q {log E q {log N N N q n n N 1n,k n,k E q {1 n, k log n,k n,k log n,k. Therefore, the fully expanded lower bound is L γ, α, β α k 1 Ψγ k Ψ γ i + log Γ α i log Γ α k 4 i1 i1

5 N + n,k Ψγ k Ψ γ i + N v1 i1 V n,k 1 w n, v log β k,v 3 γ k 1 Ψγ k Ψ γ i log Γ γ k + log Γ γ k i1 N n,k log n,k. 2.4 Maximiing the Lower Bound In this section, we maximie the lower bound w.r.t. the variational parameters and γ. Recall that as the maximiation runs, the L divergence between the variational distribution and the true posterior drops E-step of the variational EM algorithm Variational Multinomial We first maximie Equation 3 w.r.t. n,k. Since n,k 1, this is a constrained optimiation problem that can be solved by the Lagrange multiplier method. The Lagrangian w.r.t. n,k is L n,k n,k Ψγ k Ψ γ i + n,k log β k,v n,k log n,k + λ n n,i 1, i1 where λ n is the Lagrange multiplier. Taking the derivative, we get L n,k Ψγ k Ψ γ i + log β k,v log n,k 1 + λ n. n,k i1 Setting this derivative to ero yields n,k β k,v exp Ψγ k Ψ γ i + λ n Variational Dirichlet i1 β k,v exp Ψγ k Ψ γ i. Now we maximie Equation 3 w.r.t. γ k, the kth component of the Dirichlet parameter. Only the terms containing γ k are retained. L γ α k 1 Ψγ k Ψ γ i i1 5 i1 i1

6 + N n,k Ψγ k Ψ γ i i1 γ k 1 Ψγ k Ψ γ i log Γ γ i + log Γ γ k i1 Taking the derivative w.r.t. γ k, we have L γ γ Ψ γ k Ψ γ i α k 1 k i1 N + Ψ γ k Ψ γ i i1 n,k i1 Ψ γ k Ψ γ i γ k 1 Ψγ k Ψ γ i i1 Ψ i1 γ i + Ψγ k Γ i1 γ Γ γ i k Ψ γ k Ψ γ i α k + i1 Ψ γ i + Ψγ k i1 Setting it to ero, we have Ψ γ k Ψ γ i α k + i1 2.5 Estimating Model Parameters γ k α k + i1 N n,k γ k Ψγ k + Ψ γ i N n,k γ k. N n,k. The previous section is the E-step of the variational EM algorithm, where we tighten the lower bound w.r.t. the variational parameters; this section is the M-step, where we maximie the lower bound w.r.t. the model parameters α and β. Now we add back the document subscript to consider the whole corpus. i1 By the assumed exchangeability of the documents, the overall log likelihood of the corpus is just the sum of all the documents log likelihoods, and the overall lower bound is just the sum of the individual lower bounds. Again, only the terms involving β are left in the overall lower bound. Adding the Lagrange multipliers, we obtain D N d V V L β d,n,k 1 w d, n, v log β k,v + λ k β k,v 1. d1 v1 v1 6

7 Taking the derivative w.r.t. β k,v and setting it to ero, we have d1 β k,v L β D N d d1 β k,v 1 λ k d,n,k 1 w d, n, v 1 β k,v + λ k 0 β k,v D N d d,n,k 1 w d, n, v d1 N d D d,n,k 1 w d, n, v. d1 Similarly, for α, we have D L α α k 1 Ψγ d,k Ψ γ d,i + log Γ α i log Γ α k α k L α i1 D Ψγ d,k Ψ γ d,i + Ψ α i Ψα k d1 i1 D Ψγ d,k Ψ γ d,i + D Ψ α i Ψα k. d1 i1 Since the derivative also depends on other α k k, we compute the Hessian i1 i1 i1 2 L α k α α DΨ α i Dδk k Ψα k, k and notice that its form allows for the linear-time Newton-Raphson algorithm. i1 3 Collapsed LDA without topics 3.1 Generative Process For document d with topics and V unique words, 1. draw a topic mixture θ Dirα, where θ is a -vector, and k θ k 1; 2. for each of the N m word counts, independently a draw a word w n p w n θ, β, where β is a V matrix, each row of which defines a multinomial distribution over all the voxels. β ij is the probability that word j appears given topic i. 7

8 α θ d w d,n N d D β Figure 3: Collapsed LDA model. γ d θ d Figure 4: Variational distribution used to approximate the posterior in collapsed LDA. D 3.2 Comparing LDA vs collapsed LDA Table 1: Comparing LDA vs collapsed LDA. LDA collapsed LDA prior pθ Dirα pθ Dirα likelihood pw k, β Catβ k pw θ, β Catθβ posterior p, θ w pθ w approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ Proof of likelihood for collapsed LDA pw dn θ d, β p dn, w dn θ d, β dn p dn θpw dn dn, β dn k θ dk β kw 8

9 4 Gaussian VAE with AVI 4.1 Cost Function θ x N Figure 5: SVI θ x N Figure 6: VAE circles shaded circles unshaded circles N θ random variables observed random variables hidden random variables number of samples generative model parameters variational approximation parameters L θ, ; x i E q x i log q x i + log p θ, x i log p θ x i D L q x i p θ x i lower bound D L q x i p θ, x i joint-constrastive E q x i log p θ x i D L q x i p θ prior-contrastive Reconstruction - Regulariation L θ, ; x i is the variational lower bound on the marginal log likelihood of the data point x i. The sum N i1 L θ, ; xi is the evidence lower bound objective ELBO. 9

10 4.2 Optimiation We want to maximie L θ, ; x i in order to maximie the marginal likelihood log p θ x 1,..., x N of the data. Therefore, we need to differentiate and optimie L θ, ; x i with respect to θ and. Differentiate with respect to θ θ L θ, ; x i θ E q x i log q x i + log p θ, x i E q x i θ log q x i + θ log p θ, x i differentiate inside expectation E q x i 0 + θp θ, x i p θ, x i Differentiate with respect to L θ, ; x i E q x i log q x i + log p θ, x i E q x i log q x i + log p θ, x i can t move inside expectation Score Function Estimators Idea Differentiate the density q E q f fq d f q d differentiate the density q f q log q d REINFORCE trick f log q q d E q f log q score function estimator 1 L f l log q l L Monte Carlo approximation l1 This estimator has been used in LDA with MFVI or SVI Pathwise Gradient Estimators Explicit Reparametriation Gradients Idea 10

11 Differentiate the function f Apply a standardiation function S ɛ qɛ to remove dependence of q on S should be continuously differentiable w.r.t. parameters and argument and invertible S 1 ɛ E q f fq d fs 1 ɛq S 1 ɛds 1 ɛ dɛ integration by subsititution ɛ dɛ fs 1 ɛqɛdɛ qɛ q S 1 ɛds 1 ɛ ɛ dɛ E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 ɛ chain rule Implicit Reparametriation Gradients Idea S ɛ S + S 0 S 1 S take the gradient and chain rule E q f ɛ ɛ fq d fs 1 ɛq S 1 ɛds 1 ɛ dɛ dɛ integration by subsititution fs 1 ɛqɛdɛ qɛ q S 1 E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 ɛ E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 S f S 1 S qɛdɛ ɛ f S 1 S q d ɛ E q f S 1 S 11 ɛds 1 dɛ ɛ chain rule

12 Comparision between explicit and implicit 4.3 Model Table 2: Comparing LDA vs collapsed LDA vs Gaussian VAE. LDA collapsed LDA VAE prior pθ Dirα pθ Dirα p θ N 0, I likelihood pw k, β Catβ k pw θ, β Catθβ p θ x N µ, σ 2 I posterior p, θ w pθ w p θ x approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ q x N µ, σ 2 I L θ, ; x i E q x i log p θ x i D L q x i p θ L B θ, ; x i 1 L log p θ x i i,l D L q x i p θ L 1 L 1 L 1 L l1 L log N x i ; µ, σ 2 I D L q x i p θ l1 L x i µ 2 D L q x i p θ discarding some constants l1 σ 2 L x i µ 2 l1 σ J 1 + log σj 2 µ 2 j σj 2 j1 MSE + Regulariation 12

13 Figure 7: Gaussian VAE Encoder µ i, σ i MLP x i Explicit reparameteriation Implicit reparameteriation ɛ l pɛ N 0, I i,l S 1 ɛl, x i µ i + σ i ɛ l i,l q x i N ; µ i, σ 2i I Decoder θ without reparameteriation x i p θ x i i,l N x; µ, σ 2 I θ with reparametriation µ, σ MLP x i x i p θ x i i,l N x; µ, σ 2 I 13

14 5 Collapsed LDA as VAE with AVI 5.1 Model Table 3: Comparing LDA vs collapsed LDA vs Gaussian VAE. LDA collapsed LDA VAE prior pθ Dirα pθ Dirα p θ N 0, I likelihood pw k, β Catβ k pw θ, β Catθβ p θ x N µ, σ 2 I posterior p, θ w pθ w p θ x approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ q x N µ, σ 2 I L θ, ; x i E q x i log p θ x i D L q x i p θ L B θ, ; x i 1 L log p θ x i i,l D L q x i p θ L 1 L l1 L log Catw i ; θβ D L Dirθ; γ Dirθ; α l1 Figure 8: LDA VAE 14

15 5.1.1 Encoder Implicit reparameteriation γ i MLP w i θ i,l q θ w i Dirθ; γ i Decoder θ without reparameteriation w i p θ w i θ i,l Catw; θβ 15

Note 1: Varitional Methods for Latent Dirichlet Allocation

Note 1: Varitional Methods for Latent Dirichlet Allocation Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content

More information

Variational Autoencoder

Variational Autoencoder Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational

More information

Lecture 8: Graphical models for Text

Lecture 8: Graphical models for Text Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016

Deep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016 Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644: Gibbs Sampling Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 2019 03 27 Latent semantic analysis Documents as mixtures of topics (Hoffman 1999) 1 / 60 Latent semantic analysis

More information

Generative models for missing value completion

Generative models for missing value completion Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative

More information

Clustering problems, mixture models and Bayesian nonparametrics

Clustering problems, mixture models and Bayesian nonparametrics Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute

More information

Variational Inference via Stochastic Backpropagation

Variational Inference via Stochastic Backpropagation Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

Applying hlda to Practical Topic Modeling

Applying hlda to Practical Topic Modeling Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Latent Dirichlet Alloca/on

Latent Dirichlet Alloca/on Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which

More information

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07

Probablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07 Probablistic Graphical odels, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07 Instructions There are four questions in this homework. The last question involves some programming which

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Mixture Models and Expectation-Maximization

Mixture Models and Expectation-Maximization Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Streaming Variational Bayes

Streaming Variational Bayes Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Note for plsa and LDA-Version 1.1

Note for plsa and LDA-Version 1.1 Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Latent Dirichlet Allocation

Latent Dirichlet Allocation Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology

More information

Statistical Debugging with Latent Topic Models

Statistical Debugging with Latent Topic Models Statistical Debugging with Latent Topic Models David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu Department of Computer Sciences University of Wisconsin Madison European Conference on Machine Learning,

More information

Latent Dirichlet Bayesian Co-Clustering

Latent Dirichlet Bayesian Co-Clustering Latent Dirichlet Bayesian Co-Clustering Pu Wang 1, Carlotta Domeniconi 1, and athryn Blackmond Laskey 1 Department of Computer Science Department of Systems Engineering and Operations Research George Mason

More information

Expectation Maximization and Mixtures of Gaussians

Expectation Maximization and Mixtures of Gaussians Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation

More information

Content-based Recommendation

Content-based Recommendation Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Motif representation using position weight matrix

Motif representation using position weight matrix Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31 Position weight matrix Position weight matrix representation

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Variational Autoencoders

Variational Autoencoders Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly

More information

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

LSI, plsi, LDA and inference methods

LSI, plsi, LDA and inference methods LSI, plsi, LDA and inference methods Guillaume Obozinski INRIA - Ecole Normale Supérieure - Paris RussIR summer school Yaroslavl, August 6-10th 2012 Guillaume Obozinski LSI, plsi, LDA and inference methods

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

More information

arxiv: v2 [stat.ml] 15 Aug 2017

arxiv: v2 [stat.ml] 15 Aug 2017 Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Applying LDA topic model to a corpus of Italian Supreme Court decisions Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Posterior Regularization

Posterior Regularization Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Mixed-membership Models (and an introduction to variational inference)

Mixed-membership Models (and an introduction to variational inference) Mixed-membership Models (and an introduction to variational inference) David M. Blei Columbia University November 24, 2015 Introduction We studied mixture models in detail, models that partition data into

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Non-Parametric Bayes

Non-Parametric Bayes Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian

More information

The Success of Deep Generative Models

The Success of Deep Generative Models The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

More information

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng

Deep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why

More information

Unsupervised Learning

Unsupervised Learning CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Evaluation Methods for Topic Models

Evaluation Methods for Topic Models University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno Statistical Topic Models Useful for analyzing large, unstructured

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Bayesian Semi-supervised Learning with Deep Generative Models

Bayesian Semi-supervised Learning with Deep Generative Models Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Mixtures of Multinomials

Mixtures of Multinomials Mixtures of Multinomials Jason D. M. Rennie jrennie@gmail.com September, 25 Abstract We consider two different types of multinomial mixtures, () a wordlevel mixture, and (2) a document-level mixture. We

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

Latent Variable View of EM. Sargur Srihari

Latent Variable View of EM. Sargur Srihari Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time

More information

Lecture 2: Correlated Topic Model

Lecture 2: Correlated Topic Model Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables

More information

Bayesian Inference for Dirichlet-Multinomials

Bayesian Inference for Dirichlet-Multinomials Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

arxiv: v8 [stat.ml] 3 Mar 2014

arxiv: v8 [stat.ml] 3 Mar 2014 Stochastic Gradient VB and the Variational Auto-Encoder arxiv:1312.6114v8 [stat.ml] 3 Mar 2014 Diederik P. Kingma Machine Learning Group Universiteit van Amsterdam dpkingma@gmail.com Abstract Max Welling

More information

Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Agressive Learning Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich Introduction Online

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Topic Models. Charles Elkan November 20, 2008

Topic Models. Charles Elkan November 20, 2008 Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One

More information

Normalized Random Measure Mixture Models in Variational Autoencoders

Normalized Random Measure Mixture Models in Variational Autoencoders JMLR: Workshop and Conference Proceedings 1 6 Normalized Random Measure Mixture Models in Variational Autoencoders Rahul Mehta mehta5@uic.edu and Hui Lu huilu@uic.edu Department of Bioengineering University

More information

Probabilistic Graphical Models for Image Analysis - Lecture 4

Probabilistic Graphical Models for Image Analysis - Lecture 4 Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.

More information

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA

More information

Approximate Bayesian inference

Approximate Bayesian inference Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic

More information

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS

REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu

More information

Variational Scoring of Graphical Model Structures

Variational Scoring of Graphical Model Structures Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

An introduction to Variational calculus in Machine Learning

An introduction to Variational calculus in Machine Learning n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information