Lecture 19, November 19, 2012

Similar documents
Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

What is Principal Component Analysis?

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Lecture 13 : Variational Inference: Mean Field Approximation

Latent Dirichlet Allocation (LDA)

Topic Modelling and Latent Dirichlet Allocation

Generative Clustering, Topic Modeling, & Bayesian Inference

Linear Algebra Background

Applying hlda to Practical Topic Modeling

Document and Topic Models: plsa and LDA

Machine learning for pervasive systems Classification in high-dimensional spaces

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Principal Component Analysis

PROBABILISTIC LATENT SEMANTIC ANALYSIS

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Topic Models and Applications to Short Documents

Notes on Latent Semantic Analysis

Machine Learning. Data visualization and dimensionality reduction. Eric Xing. Lecture 7, August 13, Eric Xing Eric CMU,

Information Retrieval

PCA, Kernel PCA, ICA

AN INTRODUCTION TO TOPIC MODELS

Study Notes on the Latent Dirichlet Allocation

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Latent Semantic Models. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

16 : Approximate Inference: Markov Chain Monte Carlo

Language Information Processing, Advanced. Topic Models

CS Lecture 18. Topic Models and LDA

Introduction to Machine Learning

Latent Dirichlet Allocation Introduction/Overview

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Lecture 22 Exploratory Text Analysis & Topic Models

Machine Learning Techniques for Computer Vision

Bayesian Nonparametrics for Speech and Signal Processing

Latent Dirichlet Allocation

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

topic modeling hanna m. wallach

Measuring Topic Quality in Latent Dirichlet Allocation

CS 572: Information Retrieval

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Probabilistic Latent Semantic Analysis

Dimension Reduction (PCA, ICA, CCA, FLD,

Non-Parametric Bayes

Text Mining for Economics and Finance Latent Dirichlet Allocation

CS145: INTRODUCTION TO DATA MINING

STA 414/2104: Machine Learning

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Latent Dirichlet Alloca/on

Probabilistic Topic Models Tutorial: COMAD 2011

Recent Advances in Bayesian Inference Techniques

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Introduction to Probabilistic Machine Learning

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Modeling Environment

Lecture 6: Graphical Models: Learning

Lecture 21: Spectral Learning for Graphical Models

Gaussian Mixture Model

Distance dependent Chinese restaurant processes

STA141C: Big Data & High Performance Statistical Computing

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Data Mining Techniques

Probabilistic Graphical Models

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

PCA and admixture models

Pattern Recognition and Machine Learning

Topic Models. Charles Elkan November 20, 2008

Machine Learning Summer School

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Topic Modeling Using Latent Dirichlet Allocation (LDA)

STA141C: Big Data & High Performance Statistical Computing

Probabilistic & Unsupervised Learning

Advanced Machine Learning

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CS281 Section 4: Factor Analysis and PCA

Time-Sensitive Dirichlet Process Mixture Models

Probabilistic Time Series Classification

Scaling Neighbourhood Methods

Dirichlet Enhanced Latent Semantic Analysis

Crouching Dirichlet, Hidden Markov Model: Unsupervised POS Tagging with Context Local Tag Generation

Machine Learning - MT & 14. PCA and MDS

Expectation Maximization

Classical Predictive Models

Probabilistic Graphical Models

Latent Dirichlet Allocation Based Multi-Document Summarization

Bayesian Nonparametric Models

Latent Dirichlet Allocation and Singular Value Decomposition based Multi-Document Summarization

Machine Learning (Spring 2012) Principal Component Analysis

13: Variational inference II

Latent Dirichlet Allocation (LDA)

Collaborative Filtering: A Machine Learning Perspective

Evaluation Methods for Topic Models

Nonparametric Bayesian Methods (Gaussian Processes)

cross-language retrieval (by concatenate features of different language in X and find co-shared U). TOEFL/GRE synonym in the same way.

Clustering VS Classification

STA 4273H: Sta-s-cal Machine Learning

Probabilistic Latent Semantic Analysis

Transcription:

Machine Learning 0-70/5-78, Fall 0 Latent Space Analysis SVD and Topic Models Eric Xing Lecture 9, November 9, 0 Reading: Tutorial on Topic Model @ ACL Eric Xing @ CMU, 006-0

We are inundated with data (from imagesgooglecn) Humans cannot afford to deal with (eg, search, browse, or measure similarity) a huge number of text and media documents We need computers to help out Eric Xing @ CMU, 006-0

A task: Say, we want to have a mapping, so that Compare similarity Classify contents Cluster/group/categorize docs Distill semantics and perspectives Eric Xing @ CMU, 006-0 3

Representation: Data: Bag of Words Representation As for the Arabian and Palestinean voices that are against the current negotiations and the so-called peace process, they are not against peace per se, but rather for their well-founded predictions that Israel would NOT give an inch of the West bank (and most probably the same for Golan Heights) back to the Arabs An 8 months of "negotiations" in Madrid, and Washington proved these predictions Now many will jump on me saying why are you blaming israelis for no-result negotiations I would say why would the Arabs stall the negotiations, what do they have to loose? Arabian negotiations against peace Israel Arabs blaming Each document is a vector in the word space Ignore the order of words in a document Only count matters! A high-dimensional and sparse representation Not efficient text processing tasks, eg, search, document classification, or similarity measure Not effective for browsing Eric Xing @ CMU, 006-0 4

Subspace analysis Document Term = * * X (m x n) T Λ D T (m x k) (k x k) (k x n) cluster/topic/bas A priori weights is Distributions s (subspace) Clustering: (0,) matrix LSI/NMF: arbitrary matrices Topic Models: stochastic matrix Sparse coding: arbitrary sparse matrices Memberships (coordinates) Eric Xing @ CMU, 006-0 5

An example: Eric Xing @ CMU, 006-0 6

Principal Component Analysis The new variables/dimensions Are linear combinations of the original ones Are uncorrelated with one another Orthogonal in original dimension space Capture as much of the original variance in the data as possible l Variable B Origina PC PC Are called Principal Components Orthogonal directions of greatest variance in data Projections along PC discriminate the data most along any one axis Original Variable A First principal component is the direction of greatest variability (covariance) in the data Second is the next orthogonal (uncorrelated) direction of greatest variability So first remove all the variability along the first component, and then find the next direction of greatest variability And so on Eric Xing @ CMU, 006-0 7

Computing the Components Projection of vector x onto an axis (dimension) u is u T x Direction of greatest variability is that in which the average square of the projection is greatest: Maximize u T XX T u st u T u = Construct Langrangian u T XX T u λu T u Vector of partial derivatives set to zero xx T u λu = (xx T λi) u = 0 As u 0 then u must be an eigenvector of XX T with eigenvalue λ λ is the principal eigenvalue of the correlation matrix C= XX T The eigenvalue denotes the amount of variability captured along that dimension Eric Xing @ CMU, 006-0 8

Computing the Components Similarly for the next axis, etc So, the new axes are the eigenvectors of the matrix of correlations of the original variables, which captures the similarities of the original variables based on how data samples project to them Geometrically: centering followed by rotation Linear transformation Eric Xing @ CMU, 006-0 9

Eigenvalues & Eigenvectors For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal Sv {, } = λ {, } v{, and λ λ v v = }, All eigenvalues of a real symmetric matrix are real 0 if S λi = 0 and S = T S λ R All eigenvalues of a positive semidefinite matrix are nonnegative w R n T, w Sw 0, then if Sv = λ v λ 0 Eric Xing @ CMU, 006-0 0

Eigen/diagonal Decomposition Let be a square matrix with m linearly independent eigenvectors (a non-defective matrix) Theorem: Exists an eigen decomposition diagonal Unique for distinc t eigenvalues (cf matrix diagonalization theorem) Columns of U are eigenvectors of S Diagonal elements of are eigenvalues of Eric Xing @ CMU, 006-0

PCs, Variance and Least-Squares The first PC retains the greatest amount of variation in the sample The k th PC retains the kth greatest t fraction of the variation in the sample The k th largest eigenvalue of the correlation matrix C is the variance in the sample along the k th PC The least-squares view: PCs are a series of linear least squares fits to a sample, each orthogonal ogo to all previous ones Eric Xing @ CMU, 006-0

The Corpora Matrix Doc Doc Doc 3 Word 3 0 0 X = Word 0 8 Word 3 0 3 Word 4 0 0 Word 5 0 0 0 0 0 Eric Xing @ CMU, 006-0 3

Singular Value Decomposition For an m n matrix A of rank r there exists a factorization (Singular Value Decomposition = SVD) as follows: A = U Σ V m m m n V is n n T The columns of U are orthogonal eigenvectors of AA T The columns of V are orthogonal eigenvectors of A T A Eigenvalues λ λ r of AA T are the eigenvalues of A T A Σ = σ = λ i λ i ( σ ) diag σ r Eric Xing @ CMU, 006-0 Singular values 4

SVD and PCA The first root is called the prinicipal eigenvalue which has an associated orthonormal (u T u = ) eigenvector u Subsequent roots are ordered such that λ > λ > > λ M with rank(d) non-zero values Eigenvectors form an orthonormal basis ie u it u j = δ ij The eigenvalue decomposition of XX T = UΣU T where U = [u, u,, u M ] and Σ = diag[λ, λ,, λ M ] Similarly the eigenvalue decomposition of X T X = VΣV T The SVD is closely related to the above X=U Σ / V T The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues Eric Xing @ CMU, 006-0 5

HowManyPCs? For n original dimensions, sample covariance matrix is nxn, and has up to n eigenvectors So n PCs Where does dimensionality reduction come from? Can ignore the components of lesser significance 5 0 (%) Variance 5 0 5 0 PC PC PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC0 You do lose some information, but if the eigenvalues are small, you don t lose much n dimensions in original data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors, based on their eigenvalues final data set has only p dimensions Eric Xing @ CMU, 006-0 6

K is the number of singular values used Eric Xing @ CMU, 006-0 Within 40 threshold 7

Summary: Latent Semantic Indexing(Deerwester et al, 990) Document Term = * * X (m x n) T Λ D T (m x k) (k x k) (k x n) r w= K k= r d k λ T LSI does not define a properly p normalized probability distribution of observed and latent entities Does not support probabilistic reasoning under uncertainty and data fusion k k Eric Xing @ CMU, 006-0 8

Connecting Probability Models to Data (Generative Model) P(Data Parameters) Probabilistic Model Real World Data P(Parameters Data) (Inference) Eric Xing @ CMU, 006-0 9

Latent Semantic Structure in GM Latent Structure l Distribution over words P( w) = P( w,l) l Inferring latent structure P(w l) P( l) Words w P( l w) = P(w ) Eric Xing @ CMU, 006-0 0

How to Model Semantics? Q: What is it about? A: Mainly MT, with syntax, some learning 06 03 0 AdMixing Proportion MT Syntax Learning Source Target SMT Alignment Score BLEU Parse Tree Noun Phrase Grammar CFG likelihood EM Hidden Parameters Estimation argmax Top pics A Hierarchical Phrase-Based Model for Statistical ti ti lmachine Translation We present a statistical phrase-based Translation model that uses hierarchical phrases phrases that contain sub-phrases The model is formally a synchronous context-free grammar but is learned from a bitext t without t any syntactic ti information Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Improvement of 75% over Pharaoh, a state-of-the-art phrase-based system Unigram over vocabulary Topic Models Eric Xing @ CMU, 006-0

WhythisisUseful? Q: What is it about? A: Mainly MT, with syntax, some learning 06 03 0 AdMixing for Statistical Machine Translation Proportion MT Syntax Learning Q: give me similar document? Structured way of browsing the collection A Hierarchical Phrase-Based Model f St ti ti lm hi T l ti We present a statistical phrase-based Translation model that uses hierarchical phrases phrases that contain sub-phrases The model is formally a synchronous context-free grammar but is learned from a bitext t without t any syntactic ti information Thus it can be seen as a shift to the formal machinery of syntax based translation systems without any linguistic commitment In our experiments using BLEU as a metric, the hierarchical Phrase based model achieves a relative Other tasks Improvement of 75% over Pharaoh, a state-of-the-art phrase-based system Other tasks Dimensionality reduction TF-IDF vs topic mixing proportion Classification, clustering, and more Eric Xing @ CMU, 006-0

Words in Contexts It was a nice shot Eric Xing @ CMU, 006-0 3

Words in Contexts (con'd) the opposition Labor Party fared even worse, with a predicted 35 seats, seven less than last election Eric Xing @ CMU, 006-0 4

A possible generative process of a document TOPIC 3 8 DOCUMENT : money bank bank loan river stream bank money river bank money bank loan money stream bank money bank bank loan river stream bank money river bank money bank loan bank money stream DOCUMENT : river stream bank stream bank money loan river stream loan bank river bank bank stream river loan bank stream bank money loan river stream bank stream bank money river 7 stream loan bank river bank money bank stream river bank stream bank money TOPIC Mixture Components (distributions over elements) admixing weight vector θ (represents all components contributions) vector θ Bayesian approach: use priors Admixture weights ~ Dirichlet( α ) Mixture components ~ Dirichlet( Γ ) Eric Xing @ CMU, 006-0 5

Probabilistic LSI Hoffman (999) β k θ d z n w n N M Eric Xing @ CMU, 006-0 6

Probabilistic LSI A "generative" model Models each word in a document as a sample from a mixture model Each word is generated from a single topic, different words in the document may be generated from different topics A topic is characterized by a distribution over words Each document is represented as a list of admixing proportions for the components (ie topic vector θ ) β k θ d z n w n N M Eric Xing @ CMU, 006-0 7

Latent Dirichlet Allocation Blei, Ng and Jordan (003) Essentially a Bayesian plsi: η β k K α θ z n w n N M Eric Xing @ CMU, 006-0 8

LDA Generative model Models each word in a document as a sample from a mixture model Each word is generated from a single topic, different words in the document may be generated from different topics A topic is characterized by a distribution over words Each document is represented as a list of admixing proportions for the components (ie topic vector) The topic vectors and the word rates each follows a Dirichlet prior --- essentially a Bayesian plsi η β k K α θ z n w n Eric Xing @ CMU, 006-0 N M 9

Topic Models = Mixed Membership Models = Admixture Generating a document Draw θ from the prior For each word n - Draw z n - Draw w n from multinomia l z n, ( θ ) { β } from multinomia l( β ) : k zn Prior θ z β K w N d Which prior to use? N Eric Xing @ CMU, 006-0 30

Choices of Priors Dirichlet (LDA) (Blei et al 003) Conjugate prior means efficient inference Can only capture variations in each topic s intensity independently Logistic Normal (CTM=LoNTAM) (Blei & Lafferty 005, Ahmed & Xing 006) Capture the intuition that some topics are highly correlated and can rise up in intensity together Not a conjugate prior implies hard inference Nested CRP (Blei et al 005) Defines hierarchy on topics Eric Xing @ CMU, 006-0 3

Generative Semantic of LoNTAM Generating a document µ Draw θ from the For each word γ ~ N - Draw z n n from prior - Draw wn zn, : θ ~ LN K multinomia l ( θ ) { β } from multinomia l( β ) k zn ( µ, Σ) ( µ, Σ) γ 0 K K = K = exp log + θi γ i i= C ( γ ) K = log + i= e γ i e γ i Eric Xing @ CMU, 006-0 β K - Log Partition Function - Normalization Constant Σ γ z w N d 3 N

Outcomes from a topic model The topics β in a corpus: There is no name for each topic, you need to name it! There is no objective measure of good/bad The shown topics are the good ones, there are many many trivial ones, meaningless ones, redundant ones, you need to manually prune the results How many topics? Eric Xing @ CMU, 006-0 33

Outcomes from a topic model The topic vector θ of each doc Create an embedding of docs in a topic space Their no ground truth of θ to measure quality of inference But on θ it is possible to define an objective measure of goodness, such as classification error, retrieval of similar docs, clustering, etc, of documents But there is no consensus on whether these tasks bear the true value of topic models Eric Xing @ CMU, 006-0 34

Outcomes from a topic model The per-word topic indicator z: Not very useful under the bag of word representation, because of loss of ordering But it is possible to define simple probabilistic linguistic constraints (eg, bi-grams) over z and get potentially interesting results [Griffiths, Steyvers, Blei, & Tenenbaum, 004] Eric Xing @ CMU, 006-0 35

Outcomes from a topic model The topic graph S (when using CTM): Kind of interesting for understanding/visualizing large corpora Eric Xing @ CMU, 006-0 [David Blei, MLSS09] 36

Outcomes from a topic model Topic change trends [David Blei, MLSS09] Eric Xing @ CMU, 006-0 37

The Big Picture Unstructured Collection Structured Topic Network Topic Discovery w n w T x x x x x w Dimensionality T x k x x T Word Simplex Reduction Eric Xing @ CMU, 006-0 Topic Space (eg, a Simplex) 38

Computation on LDA Inference Given a Document D Posterior: P(Θ µ,σ, β,d) Evaluation: P(D µ,σ, Σ β ) β i θ n Learning Given a collection of documents {D i } Parameter estimation arg max ( µ, Σ, β ) log ( P( µ, Σ, β )) D i Eric Xing @ CMU, 006-0 39

Exact Bayesian inference on LDA is intractable A possible query: is intractable p q y? ) (? ) (, = = D z p D p m n n πθ n Close form solution? ) ( ), ( ) ( D p D p D p n n = π π, θ n θ n ) ( ) ( ) ( ) ( ) ( } {,,, D p d d G p p z p x p m n n z i n n m n m n z m n = φ π φ α π π φ θ n θ -n θ n β β = } {,,, ) ( ) ( ) ( ) ( ) ( m n n z N n n m n m n z m n d d d G p p z p x p D p φ π π φ α π π φ L L θ n θ n θ θ β β β Sum in the denominator over T n terms, and integrate over n k-dimensional topic vectors 40 Eric Xing @ CMU, 006-0

Approximate Inference Variational Inference Mean field approximation (Blei et al) Expectation ti propagation (Minka et al) Variational nd -order Taylor approximation (Ahmed and Xing) Markov Chain Monte Carlo Gibbs sampling (Griffiths et al) Eric Xing @ CMU, 006-0 4

Collapsed Gibbs sampling (Tom Griffiths & Mark Steyvers) Collapsed Gibbs sampling Integrate out θ For variables z = z, z,, z n Draw z (t+) i from P(z i z -i, w) β i z =z (t+) (t+) (t+) (t) (t) -i, z,, z i-, z i+,, z n θ n Eric Xing @ CMU, 006-0 4

Gibbs sampling Need full conditional distributions for variables Since we only sample z we need θ n β i G G number of times word w assigned to topic j number of times topic j used in document d Eric Xing @ CMU, 006-0 43

Gibbs sampling i w i d i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 iteration Eric Xing @ CMU, 006-0 44

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? Eric Xing @ CMU, 006-0 45

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 46

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 47

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 48

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 49

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 50

Gibbs sampling iteration i w i d i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5? G G Eric Xing @ CMU, 006-0 5

Gibbs sampling iteration 000 i w i d i z i z i z i 3 4 5 6 7 8 9 0 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE JOY 5 G G Eric Xing @ CMU, 006-0 5

Learning a TM Maximum likelihood estimation: Need statistics on topic-specific word assignment (due to z), topic vector distribution (due to θ), etc Eg,, this is the formula for topic k: These are hidden variables, therefore need an EM algorithm (also known as data augmentation, or DA, in Monte Carlo paradigm) This is a reduce step in parallel implementation Eric Xing @ CMU, 006-0 53

Conclusion GM-based topic models are cool Flexible Modular Interactive There are many ways of implementing topic models unsupervised supervised Efficient Inference/learning algorithms GMF, with Laplace approx for non-conjugate dist MCMC Many applications Word-sense disambiguation Image understanding Network inference Eric Xing @ CMU, 006-0 54