LDA with Amortized Inference
|
|
- Morris Welch
- 5 years ago
- Views:
Transcription
1 LDA with Amortied Inference Nanbo Sun Abstract This report describes how to frame Latent Dirichlet Allocation LDA as a Variational Auto- Encoder VAE and use the Amortied Variational Inference AVI to optimie it. 1. We will introduce the LDA and use Mean Field Variational Inference MFVI to optimie it. 2. We collapse the topics in LDA model because we can not do backpropagation through a categorical variable. 3. We will introduce Gaussian VAE with AVI. 4. We frame the collapsed LDA as VAE and do AVI. 1 Prerequisite Before reading this report, you must read the following papers carefully. 1. Blei, D. M., Ng, A. Y., & Jordan, M. I Latent dirichlet allocation. Journal of machine Learning research, 3Jan, ingma, D. P., & Welling, M Auto-encoding variational bayes. arxiv preprint arxiv: Figurnov, M., Mohamed, S., & Mnih, A Implicit Reparameteriation Gradients. arxiv preprint arxiv: LDA with MFVI 2.1 Generative Process For document d with topics and V unique words, 1. draw a topic mixture θ Dirα, where θ is a -vector, and k θ k 1; 2. for each of the N m word counts, independently a draw a topic n Multθ; b draw a word w n p w n n, β, where β is a V matrix, each row of which defines a multinomial distribution over all the voxels. β ij is the probability that word j appears given topic i. 1
2 α θ d d,n w d,n N d D β Figure 1: LDA model. γ d θ d d,n d,n N d D Figure 2: Variational distribution used to approximate the posterior in LDA. 2.2 Constructing the Lower Bound From Figure 2, the variational distribution used to approximate the true posterior is factoriable as q θ, γ, q θ γ N q n n. The lower bound L γ, α, β of the single-document 1 log likelihood log p w α, β is computed using Jensen s inequality as follows log p w α, β log p θ,, w α, β dθ log p θ,, w α, β q θ, γ, dθ q θ, γ, p θ,, w α, β log q θ, γ, dθ q θ, γ, { p θ,, w α, β log E q q θ, γ, E q {log p θ,, w α, β E q {log q θ, γ, L γ, α, β. The difference between the log likelihood and its lower bound can be proven to be in fact the L divergence between the variational distribution and the true posterior. log p w α, β L γ, α, β E q {log p w α, β E q {log p θ,, w α, β + E q {log q θ, γ, { p w α, β q θ, γ, E q log p θ,, w α, β 1 This also explains why the document subscript is dropped for simplicity hereafter. 1 2
3 { q θ, γ, E q log p θ, w, α, β D L q θ, γ, p θ, w, α, β. Therefore, maximiing the lower bound is equivalent to minimiing the L divergence. That is, the variational distribution automatically approaches to the real posterior as we maximie the lower bound. 2.3 Expanding the Lower Bound To maximie the lower bound, we first need to spell out the lower bound L γ, α, β in terms of the model parameters α, β and the variational parameters γ,. Continuing from 1, we have L γ, α, β E q {log p θ,, w α, β E q {log q θ, γ, { p θ,, w α, β E q log q θ, γ, { p θ α p θ p w, β E q log q θ γ q E q {log p θ α + E q {log p θ + E q {log p w, β E q {log q θ γ E q {log q. 2 We now further expand each of the five terms in 2. The first term is Γ E q {log p θ α E q log i1 i α Γ α θ α k 1 k k E q {log Γ α i + α k 1 log θ k log Γ α k i1 i1 α k 1E q {log θ k + log Γ α i log Γ α k α k 1 Ψγ k Ψ γ i + log Γ α i log Γ α k, where Ψ is the digamma function, the first derivative of the log Gamma function. The final line is due to the following property of the Dirichlet distribution as a member of the exponential family. If θ Dirα, then E pθ α {log θ i Ψα i Ψ i1 α i. The second term is N E q {log p θ E q {log p n θ E q {log i1 N θ 1n,k k 3 i1
4 N N N E q {1 n, k log θ k E q {1 n, k E q {log θ k n,k Ψγ k Ψ γ i, where n,k is the probability of the nth word being produced by topic k, and 1 is the indicator function. We expand the third term as E q {log p w, β E q {log E q {log E q { N N N p w n n, β N V i1 β 1n,k1wn,v k,v v1 v1 v1 N v1 V 1 n, k1 w n, v log β k,v V E q {1 n, k 1 w n, v log β k,v V n,k 1 w n, v log β k,v. Very similar to the first term, the fourth term is expanded as E q {log q θ γ γ k 1 Ψγ k Ψ γ i + log Γ γ k log Γ γ k. Finally, the fifth term is expanded as i1 E q {log q E q {log E q {log N N N q n n N 1n,k n,k E q {1 n, k log n,k n,k log n,k. Therefore, the fully expanded lower bound is L γ, α, β α k 1 Ψγ k Ψ γ i + log Γ α i log Γ α k 4 i1 i1
5 N + n,k Ψγ k Ψ γ i + N v1 i1 V n,k 1 w n, v log β k,v 3 γ k 1 Ψγ k Ψ γ i log Γ γ k + log Γ γ k i1 N n,k log n,k. 2.4 Maximiing the Lower Bound In this section, we maximie the lower bound w.r.t. the variational parameters and γ. Recall that as the maximiation runs, the L divergence between the variational distribution and the true posterior drops E-step of the variational EM algorithm Variational Multinomial We first maximie Equation 3 w.r.t. n,k. Since n,k 1, this is a constrained optimiation problem that can be solved by the Lagrange multiplier method. The Lagrangian w.r.t. n,k is L n,k n,k Ψγ k Ψ γ i + n,k log β k,v n,k log n,k + λ n n,i 1, i1 where λ n is the Lagrange multiplier. Taking the derivative, we get L n,k Ψγ k Ψ γ i + log β k,v log n,k 1 + λ n. n,k i1 Setting this derivative to ero yields n,k β k,v exp Ψγ k Ψ γ i + λ n Variational Dirichlet i1 β k,v exp Ψγ k Ψ γ i. Now we maximie Equation 3 w.r.t. γ k, the kth component of the Dirichlet parameter. Only the terms containing γ k are retained. L γ α k 1 Ψγ k Ψ γ i i1 5 i1 i1
6 + N n,k Ψγ k Ψ γ i i1 γ k 1 Ψγ k Ψ γ i log Γ γ i + log Γ γ k i1 Taking the derivative w.r.t. γ k, we have L γ γ Ψ γ k Ψ γ i α k 1 k i1 N + Ψ γ k Ψ γ i i1 n,k i1 Ψ γ k Ψ γ i γ k 1 Ψγ k Ψ γ i i1 Ψ i1 γ i + Ψγ k Γ i1 γ Γ γ i k Ψ γ k Ψ γ i α k + i1 Ψ γ i + Ψγ k i1 Setting it to ero, we have Ψ γ k Ψ γ i α k + i1 2.5 Estimating Model Parameters γ k α k + i1 N n,k γ k Ψγ k + Ψ γ i N n,k γ k. N n,k. The previous section is the E-step of the variational EM algorithm, where we tighten the lower bound w.r.t. the variational parameters; this section is the M-step, where we maximie the lower bound w.r.t. the model parameters α and β. Now we add back the document subscript to consider the whole corpus. i1 By the assumed exchangeability of the documents, the overall log likelihood of the corpus is just the sum of all the documents log likelihoods, and the overall lower bound is just the sum of the individual lower bounds. Again, only the terms involving β are left in the overall lower bound. Adding the Lagrange multipliers, we obtain D N d V V L β d,n,k 1 w d, n, v log β k,v + λ k β k,v 1. d1 v1 v1 6
7 Taking the derivative w.r.t. β k,v and setting it to ero, we have d1 β k,v L β D N d d1 β k,v 1 λ k d,n,k 1 w d, n, v 1 β k,v + λ k 0 β k,v D N d d,n,k 1 w d, n, v d1 N d D d,n,k 1 w d, n, v. d1 Similarly, for α, we have D L α α k 1 Ψγ d,k Ψ γ d,i + log Γ α i log Γ α k α k L α i1 D Ψγ d,k Ψ γ d,i + Ψ α i Ψα k d1 i1 D Ψγ d,k Ψ γ d,i + D Ψ α i Ψα k. d1 i1 Since the derivative also depends on other α k k, we compute the Hessian i1 i1 i1 2 L α k α α DΨ α i Dδk k Ψα k, k and notice that its form allows for the linear-time Newton-Raphson algorithm. i1 3 Collapsed LDA without topics 3.1 Generative Process For document d with topics and V unique words, 1. draw a topic mixture θ Dirα, where θ is a -vector, and k θ k 1; 2. for each of the N m word counts, independently a draw a word w n p w n θ, β, where β is a V matrix, each row of which defines a multinomial distribution over all the voxels. β ij is the probability that word j appears given topic i. 7
8 α θ d w d,n N d D β Figure 3: Collapsed LDA model. γ d θ d Figure 4: Variational distribution used to approximate the posterior in collapsed LDA. D 3.2 Comparing LDA vs collapsed LDA Table 1: Comparing LDA vs collapsed LDA. LDA collapsed LDA prior pθ Dirα pθ Dirα likelihood pw k, β Catβ k pw θ, β Catθβ posterior p, θ w pθ w approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ Proof of likelihood for collapsed LDA pw dn θ d, β p dn, w dn θ d, β dn p dn θpw dn dn, β dn k θ dk β kw 8
9 4 Gaussian VAE with AVI 4.1 Cost Function θ x N Figure 5: SVI θ x N Figure 6: VAE circles shaded circles unshaded circles N θ random variables observed random variables hidden random variables number of samples generative model parameters variational approximation parameters L θ, ; x i E q x i log q x i + log p θ, x i log p θ x i D L q x i p θ x i lower bound D L q x i p θ, x i joint-constrastive E q x i log p θ x i D L q x i p θ prior-contrastive Reconstruction - Regulariation L θ, ; x i is the variational lower bound on the marginal log likelihood of the data point x i. The sum N i1 L θ, ; xi is the evidence lower bound objective ELBO. 9
10 4.2 Optimiation We want to maximie L θ, ; x i in order to maximie the marginal likelihood log p θ x 1,..., x N of the data. Therefore, we need to differentiate and optimie L θ, ; x i with respect to θ and. Differentiate with respect to θ θ L θ, ; x i θ E q x i log q x i + log p θ, x i E q x i θ log q x i + θ log p θ, x i differentiate inside expectation E q x i 0 + θp θ, x i p θ, x i Differentiate with respect to L θ, ; x i E q x i log q x i + log p θ, x i E q x i log q x i + log p θ, x i can t move inside expectation Score Function Estimators Idea Differentiate the density q E q f fq d f q d differentiate the density q f q log q d REINFORCE trick f log q q d E q f log q score function estimator 1 L f l log q l L Monte Carlo approximation l1 This estimator has been used in LDA with MFVI or SVI Pathwise Gradient Estimators Explicit Reparametriation Gradients Idea 10
11 Differentiate the function f Apply a standardiation function S ɛ qɛ to remove dependence of q on S should be continuously differentiable w.r.t. parameters and argument and invertible S 1 ɛ E q f fq d fs 1 ɛq S 1 ɛds 1 ɛ dɛ integration by subsititution ɛ dɛ fs 1 ɛqɛdɛ qɛ q S 1 ɛds 1 ɛ ɛ dɛ E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 ɛ chain rule Implicit Reparametriation Gradients Idea S ɛ S + S 0 S 1 S take the gradient and chain rule E q f ɛ ɛ fq d fs 1 ɛq S 1 ɛds 1 ɛ dɛ dɛ integration by subsititution fs 1 ɛqɛdɛ qɛ q S 1 E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 ɛ E qɛ fs 1 ɛ E qɛ fs 1 ɛ S 1 S f S 1 S qɛdɛ ɛ f S 1 S q d ɛ E q f S 1 S 11 ɛds 1 dɛ ɛ chain rule
12 Comparision between explicit and implicit 4.3 Model Table 2: Comparing LDA vs collapsed LDA vs Gaussian VAE. LDA collapsed LDA VAE prior pθ Dirα pθ Dirα p θ N 0, I likelihood pw k, β Catβ k pw θ, β Catθβ p θ x N µ, σ 2 I posterior p, θ w pθ w p θ x approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ q x N µ, σ 2 I L θ, ; x i E q x i log p θ x i D L q x i p θ L B θ, ; x i 1 L log p θ x i i,l D L q x i p θ L 1 L 1 L 1 L l1 L log N x i ; µ, σ 2 I D L q x i p θ l1 L x i µ 2 D L q x i p θ discarding some constants l1 σ 2 L x i µ 2 l1 σ J 1 + log σj 2 µ 2 j σj 2 j1 MSE + Regulariation 12
13 Figure 7: Gaussian VAE Encoder µ i, σ i MLP x i Explicit reparameteriation Implicit reparameteriation ɛ l pɛ N 0, I i,l S 1 ɛl, x i µ i + σ i ɛ l i,l q x i N ; µ i, σ 2i I Decoder θ without reparameteriation x i p θ x i i,l N x; µ, σ 2 I θ with reparametriation µ, σ MLP x i x i p θ x i i,l N x; µ, σ 2 I 13
14 5 Collapsed LDA as VAE with AVI 5.1 Model Table 3: Comparing LDA vs collapsed LDA vs Gaussian VAE. LDA collapsed LDA VAE prior pθ Dirα pθ Dirα p θ N 0, I likelihood pw k, β Catβ k pw θ, β Catθβ p θ x N µ, σ 2 I posterior p, θ w pθ w p θ x approximate posterior q, θ, γ q qθ γ CatDirγ qθ γ Dirγ q x N µ, σ 2 I L θ, ; x i E q x i log p θ x i D L q x i p θ L B θ, ; x i 1 L log p θ x i i,l D L q x i p θ L 1 L l1 L log Catw i ; θβ D L Dirθ; γ Dirθ; α l1 Figure 8: LDA VAE 14
15 5.1.1 Encoder Implicit reparameteriation γ i MLP w i θ i,l q θ w i Dirθ; γ i Decoder θ without reparameteriation w i p θ w i θ i,l Catw; θβ 15
Note 1: Varitional Methods for Latent Dirichlet Allocation
Technical Note Series Spring 2013 Note 1: Varitional Methods for Latent Dirichlet Allocation Version 1.0 Wayne Xin Zhao batmanfly@gmail.com Disclaimer: The focus of this note was to reorganie the content
More informationVariational Autoencoder
Variational Autoencoder Göker Erdo gan August 8, 2017 The variational autoencoder (VA) [1] is a nonlinear latent variable model with an efficient gradient-based training procedure based on variational
More informationLecture 8: Graphical models for Text
Lecture 8: Graphical models for Text 4F13: Machine Learning Joaquin Quiñonero-Candela and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/
More informationCS Lecture 18. Topic Models and LDA
CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang
More informationDeep Variational Inference. FLARE Reading Group Presentation Wesley Tansey 9/28/2016
Deep Variational Inference FLARE Reading Group Presentation Wesley Tansey 9/28/2016 What is Variational Inference? What is Variational Inference? Want to estimate some distribution, p*(x) p*(x) What is
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationTopic Modelling and Latent Dirichlet Allocation
Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationGibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:
Gibbs Sampling Héctor Corrada Bravo University of Maryland, College Park, USA CMSC 644: 2019 03 27 Latent semantic analysis Documents as mixtures of topics (Hoffman 1999) 1 / 60 Latent semantic analysis
More informationGenerative models for missing value completion
Generative models for missing value completion Kousuke Ariga Department of Computer Science and Engineering University of Washington Seattle, WA 98105 koar8470@cs.washington.edu Abstract Deep generative
More informationClustering problems, mixture models and Bayesian nonparametrics
Clustering problems, mixture models and Bayesian nonparametrics Nguyễn Xuân Long Department of Statistics Department of Electrical Engineering and Computer Science University of Michigan Vietnam Institute
More informationVariational Inference via Stochastic Backpropagation
Variational Inference via Stochastic Backpropagation Kai Fan February 27, 2016 Preliminaries Stochastic Backpropagation Variational Auto-Encoding Related Work Summary Outline Preliminaries Stochastic Backpropagation
More informationDirichlet Enhanced Latent Semantic Analysis
Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,
More informationApplying hlda to Practical Topic Modeling
Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the
More informationAuto-Encoding Variational Bayes
Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower
More informationLatent Dirichlet Alloca/on
Latent Dirichlet Alloca/on Blei, Ng and Jordan ( 2002 ) Presented by Deepak Santhanam What is Latent Dirichlet Alloca/on? Genera/ve Model for collec/ons of discrete data Data generated by parameters which
More informationProbablistic Graphical Models, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07
Probablistic Graphical odels, Spring 2007 Homework 4 Due at the beginning of class on 11/26/07 Instructions There are four questions in this homework. The last question involves some programming which
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationMixture Models and Expectation-Maximization
Mixture Models and Expectation-Maximiation David M. Blei March 9, 2012 EM for mixtures of multinomials The graphical model for a mixture of multinomials π d x dn N D θ k K How should we fit the parameters?
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING Text Data: Topic Model Instructor: Yizhou Sun yzsun@cs.ucla.edu December 4, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Classification Clustering
More informationLatent Dirichlet Allocation Introduction/Overview
Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models
More informationStreaming Variational Bayes
Streaming Variational Bayes Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, Michael I. Jordan UC Berkeley Discussion led by Miao Liu September 13, 2013 Introduction The SDA-Bayes Framework
More information16 : Approximate Inference: Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution
More informationLatent Dirichlet Allocation (LDA)
Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language
More informationDocument and Topic Models: plsa and LDA
Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis
More informationNote for plsa and LDA-Version 1.1
Note for plsa and LDA-Version 1.1 Wayne Xin Zhao March 2, 2011 1 Disclaimer In this part of PLSA, I refer to [4, 5, 1]. In LDA part, I refer to [3, 2]. Due to the limit of my English ability, in some place,
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationLatent Dirichlet Allocation
Outlines Advanced Artificial Intelligence October 1, 2009 Outlines Part I: Theoretical Background Part II: Application and Results 1 Motive Previous Research Exchangeability 2 Notation and Terminology
More informationStatistical Debugging with Latent Topic Models
Statistical Debugging with Latent Topic Models David Andrzejewski, Anne Mulhern, Ben Liblit, Xiaojin Zhu Department of Computer Sciences University of Wisconsin Madison European Conference on Machine Learning,
More informationLatent Dirichlet Bayesian Co-Clustering
Latent Dirichlet Bayesian Co-Clustering Pu Wang 1, Carlotta Domeniconi 1, and athryn Blackmond Laskey 1 Department of Computer Science Department of Systems Engineering and Operations Research George Mason
More informationExpectation Maximization and Mixtures of Gaussians
Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation
More informationContent-based Recommendation
Content-based Recommendation Suthee Chaidaroon June 13, 2016 Contents 1 Introduction 1 1.1 Matrix Factorization......................... 2 2 slda 2 2.1 Model................................. 3 3 flda 3
More informationInformation retrieval LSI, plsi and LDA. Jian-Yun Nie
Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and
More informationReplicated Softmax: an Undirected Topic Model. Stephen Turner
Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s
More informationVariational Autoencoders (VAEs)
September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p
More informationMotif representation using position weight matrix
Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31 Position weight matrix Position weight matrix representation
More informationVariational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures
17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationVariational Autoencoders
Variational Autoencoders Recap: Story so far A classification MLP actually comprises two components A feature extraction network that converts the inputs into linearly separable features Or nearly linearly
More informationCOMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017
COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University SOFT CLUSTERING VS HARD CLUSTERING
More informationText Mining for Economics and Finance Latent Dirichlet Allocation
Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model
More informationLSI, plsi, LDA and inference methods
LSI, plsi, LDA and inference methods Guillaume Obozinski INRIA - Ecole Normale Supérieure - Paris RussIR summer school Yaroslavl, August 6-10th 2012 Guillaume Obozinski LSI, plsi, LDA and inference methods
More informationGenerative Clustering, Topic Modeling, & Bayesian Inference
Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationStochastic Backpropagation, Variational Inference, and Semi-Supervised Learning
Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian
More informationarxiv: v2 [stat.ml] 15 Aug 2017
Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu
More informationSeries 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)
Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof
More informationBayesian Deep Learning
Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference
More informationApplying LDA topic model to a corpus of Italian Supreme Court decisions
Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationTopic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up
Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can
More informationGaussian Mixture Model
Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationMixed-membership Models (and an introduction to variational inference)
Mixed-membership Models (and an introduction to variational inference) David M. Blei Columbia University November 24, 2015 Introduction We studied mixture models in detail, models that partition data into
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationNon-Parametric Bayes
Non-Parametric Bayes Mark Schmidt UBC Machine Learning Reading Group January 2016 Current Hot Topics in Machine Learning Bayesian learning includes: Gaussian processes. Approximate inference. Bayesian
More informationThe Success of Deep Generative Models
The Success of Deep Generative Models Jakub Tomczak AMLAB, University of Amsterdam CERN, 2018 What is AI about? What is AI about? Decision making: What is AI about? Decision making: new data High probability
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationIntroduction to Stochastic Gradient Markov Chain Monte Carlo Methods
Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer
More informationDeep Sequence Models. Context Representation, Regularization, and Application to Language. Adji Bousso Dieng
Deep Sequence Models Context Representation, Regularization, and Application to Language Adji Bousso Dieng All Data Are Born Sequential Time underlies many interesting human behaviors. Elman, 1990. Why
More informationUnsupervised Learning
CS 3750 Advanced Machine Learning hkc6@pitt.edu Unsupervised Learning Data: Just data, no labels Goal: Learn some underlying hidden structure of the data P(, ) P( ) Principle Component Analysis (Dimensionality
More informationParametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a
Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a
More informationBayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework
HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for
More informationEvaluation Methods for Topic Models
University of Massachusetts Amherst wallach@cs.umass.edu April 13, 2009 Joint work with Iain Murray, Ruslan Salakhutdinov and David Mimno Statistical Topic Models Useful for analyzing large, unstructured
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationBayesian Semi-supervised Learning with Deep Generative Models
Bayesian Semi-supervised Learning with Deep Generative Models Jonathan Gordon Department of Engineering Cambridge University jg801@cam.ac.uk José Miguel Hernández-Lobato Department of Engineering Cambridge
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationMixtures of Multinomials
Mixtures of Multinomials Jason D. M. Rennie jrennie@gmail.com September, 25 Abstract We consider two different types of multinomial mixtures, () a wordlevel mixture, and (2) a document-level mixture. We
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationA Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank
A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,
More informationLatent Variable View of EM. Sargur Srihari
Latent Variable View of EM Sargur srihari@cedar.buffalo.edu 1 Examples of latent variables 1. Mixture Model Joint distribution is p(x,z) We don t have values for z 2. Hidden Markov Model A single time
More informationLecture 2: Correlated Topic Model
Probabilistic Moels for Unsupervise Learning Spring 203 Lecture 2: Correlate Topic Moel Inference for Correlate Topic Moel Yuan Yuan First of all, let us make some claims about the parameters an variables
More informationBayesian Inference for Dirichlet-Multinomials
Bayesian Inference for Dirichlet-Multinomials Mark Johnson Macquarie University Sydney, Australia MLSS Summer School 1 / 50 Random variables and distributed according to notation A probability distribution
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationarxiv: v8 [stat.ml] 3 Mar 2014
Stochastic Gradient VB and the Variational Auto-Encoder arxiv:1312.6114v8 [stat.ml] 3 Mar 2014 Diederik P. Kingma Machine Learning Group Universiteit van Amsterdam dpkingma@gmail.com Abstract Max Welling
More informationOnline Bayesian Passive-Agressive Learning
Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich Introduction Online
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationTopic Models. Charles Elkan November 20, 2008
Topic Models Charles Elan elan@cs.ucsd.edu November 20, 2008 Suppose that we have a collection of documents, and we want to find an organization for these, i.e. we want to do unsupervised learning. One
More informationNormalized Random Measure Mixture Models in Variational Autoencoders
JMLR: Workshop and Conference Proceedings 1 6 Normalized Random Measure Mixture Models in Variational Autoencoders Rahul Mehta mehta5@uic.edu and Hui Lu huilu@uic.edu Department of Bioengineering University
More informationProbabilistic Graphical Models for Image Analysis - Lecture 4
Probabilistic Graphical Models for Image Analysis - Lecture 4 Stefan Bauer 12 October 2018 Max Planck ETH Center for Learning Systems Overview 1. Repetition 2. α-divergence 3. Variational Inference 4.
More informationCOMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017
COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University TOPIC MODELING MODELS FOR TEXT DATA
More informationApproximate Bayesian inference
Approximate Bayesian inference Variational and Monte Carlo methods Christian A. Naesseth 1 Exchange rate data 0 20 40 60 80 100 120 Month Image data 2 1 Bayesian inference 2 Variational inference 3 Stochastic
More informationREINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS
Worshop trac - ICLR 207 REINTERPRETING IMPORTANCE-WEIGHTED AUTOENCODERS Chris Cremer, Quaid Morris & David Duvenaud Department of Computer Science University of Toronto {ccremer,duvenaud}@cs.toronto.edu
More informationVariational Scoring of Graphical Model Structures
Variational Scoring of Graphical Model Structures Matthew J. Beal Work with Zoubin Ghahramani & Carl Rasmussen, Toronto. 15th September 2003 Overview Bayesian model selection Approximations using Variational
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationAn introduction to Variational calculus in Machine Learning
n introduction to Variational calculus in Machine Learning nders Meng February 2004 1 Introduction The intention of this note is not to give a full understanding of calculus of variations since this area
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August
More informationLecture 6: Graphical Models: Learning
Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)
More information