Learning MN Parameters with Approximation. Sargur Srihari

Size: px
Start display at page:

Download "Learning MN Parameters with Approximation. Sargur Srihari"

Transcription

1 Learning MN Parameters with Approximation Sargur 1

2 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief Propagation MAP-based Learning Alternative Objective Functions Pseudo-likelihood Contrastive Objective Divergence Map-based 2

3 Exact Learning of MN parameters θ Initialize. θ P θ (χ) = 1 Z θ ( ) exp θ f (D ) i i i k i=1 Run inference (compute Z(θ) ) Compute gradient of l. Update θ. ( ) = exp θ i f i ( ξ) Z θ E θ [ f i ] = 1 ξ i Z(θ) θ i l θ; D ( ) = E D [ f i χ ( )] E θ [ f i ] θ t+1 θ t + η l( θ t ;D) ξ f i (ξ)exp θ j f j ( ξ ) j l(θ) = d dθ l(θ) = l ( θ ). θ 1 l( θ) θ k No Optimum Reached? Yes Stop θ t+1 θ t δ 3

4 Difficulty with Exact Methods Exact Parameter Estimation Methods assume ability to compute 1. Partition function Z(θ) and 2. Expectations E Pθ [ f i ] In many applications structure of network does not allow exact computation of these terms In image segmentation, grid networks lead to exponential size clusters for exact inference Cluster graph is clique tree with overlapping factors 4

5 Approximate Methods for Learning θ 1. Use approximate inference for queries of P θ Decouples inference from Learning Parameters Inference is a black-box But approximation may interfere with learning Non-convergence of inference can lead to oscillating estimates of the gradient & no learning convergence 2. Use an approximate objective function Whose optimization doesn t require much inference Approximately optimizing the likelihood function can be reformulated as exactly optimizing an approximate objective 5

6 Approximate Methods (Taxonomy) 1. Approximate Inference methods 1. Belief Propagation 2. MAP-based Learning 2. Alternative Objective Functions 1. Pseudo-likelihood and its generalizations 2. Contrastive Optimization Criteria 1. Contrastive Divergence 2. Margin-Based Training 6

7 Approximate Inference: Belief Propagation Popular Approach for Approximate Inference is Belief Propagation Given a model resulting from a learning procedure, an algorithm from this family is used Model trained with same inference algorithm is better than model trained with exact inference! BP is run with each iteration of Gradient Ascent to compute expected feature count E Pθ [ f i ] 7

8 Ex: Direct (Gibbs) vs BP (Clique tree) A 1.Gibbs Distribution!P Φ ( A,B,C,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 ( D,A) Factors D B P(A, B,C.D) = 1 Z φ 1(A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) where Z = φ 1 (A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) A,B,C,D Unormalized Distribution C Z=7,201,840 (a) 2. Clique Tree (triangulated): Initial Potentials: Each ψ has every factor involving its arguments ψ 1 ψ 2 Clique/Cluster C 1 Cluster C 2 Sepset ( A,B,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A ( B,C,D) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A 1. A,B,D {B,D} 2. B,C,D ( ) ( ) Computing Clique Beliefs (β i ), Sepset Beliefs (µ i,j ) ( A,B,D ) = P! Φ ( A,B,D ) = ψ 1 ( A,B,D ) β 1 e.g., β 1 (a 0,b 0,d 0 ) = 300, ,000 = 600,000 µ 1,2 (B,D) = β 1 C 1 C 1 S 1,2 = φ 1 (A,B)φ 2 (B,C)φ 3 (C,D)φ 4 (D,A) C ( ) = β 1 ( A,B,D ) e.g., µ 1,2 (b 0,d 0 ) = 600, = 600,200 β 2 ( B,C,D) = P! Φ B,C,D A ( ) = µ 1,2 B,D ( ) = 300, = 300,100 e.g., β 2 b 0,c 0,d 0 A C ( ) ψ 2 ( B,C,D) = ψ 2 B,C,D A ( ) Assignment a a,o ao n a" al al al AL All Clique and Sepset Beliefs 6o 6r 6t 6o 6o 6r 6r Verifying Inference maxc 600, ,030 5, ooo,5oo 1, ,000, , ,000 Assignment d 6o bl br!p Φ a 1,b 0,c 1,d 0 n,z(b, D) 600,200 1,300, 130 5, 100, ,000 ( ) = 100 ( )β 2 ( b 0,c 1,d 0 ) µ 1,2 ( b 0,d 0 ) β 1 a 1,b 0,d 0 = Assienment b0 bo b0 bl bl bt 6t co cl ct co co c1 ct d1 ll I 4o 4t 4o 4r d0 ll 5, l3z(8,c, = 100

9 Difficulty with Belief Propagation In principle, BP is run in each Gradient Ascent iteration to compute E Pθ [ f i ] used in gradient computation Due to family preservation property each feature must be a subset of a cluster C i in the cluster graph Hence to compute E Pθ [ f i ] we can compute BP marginals over C i But BP often does not converge Marginals derived often oscillate Final results depend on where we stop As a result gradient computed is unstable Hurt convergence properties of gradient descent Even more severe with line search 9

10 Convergent alternatives to Belief Propagation Three BP methods: 1. Pseudo- moment matching Reformulate task of learning with approximate inference as optimizing an alternative objective 2. Maximum Entropy approximation (CAMEL) General derivation that allows us to reformulate maximum likelihood with BP as a unified optimization problem with an approximate objective 3. MAP-based Learning Approximate expected feature counts with their counts in the single MAP assignment in current MN 10

11 Pseudo-moment Matching Begin with analysis of fixed points in learning Converged BP beliefs must satisfy Or j β i ( c i ) = ˆP j c i ( ) E βi C i ( ) f Ci We define for each sepset S i,j between C i and C j φ i β i µ i, j We use the final set of potentials as the parameterization of the Markov Network Provides a closed-form solution for both inference and learning Cannot be used with parameter regularization, nontable factors or CRFs 11 ( ) = E f D i C i Convergent point is a set of beliefs that match the data C 1 : A,B,D S 12 ={B,D} C 2 : B,C,D

12 BP and Maximum Entropy Maximum entropy dual of max likelihood Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k Made tractable by approximating H Q and E Q : H Q : Given a cluster graph U with clusters C i and sepset S i,j H Q ( χ ) H βi ( C i ) C i U E Q [f i ]: replace by E βi [f i ] ( ) H µi, j S i, j C i C j S Approximation is exact when cluster graph is a tree Method is known as CAMEL (Constrained Approximate Maximum Entropy Learning) 12

13 Example of Max Ent Learning Pairwise Markov Network A B C Variables are binary Three clusters: C 1 = {A,B}, C 2 ={B,C}, C 3 ={C,A} Log-linear model with features f 00 (x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of C i f 11 (x,y)=1 if x=1, y=1 and 0 otherwise Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) Unnormalized Feature counts, pooled over all clusters, are [ ] = (3 +1+1) / 3 = 5 / 3 [ ] = ( ) / 3 = 0 E! P f 00 E! P f 11 13

14 CAMEL Optimization Problem Optimization problem takes the form with two types of constraints: Type 1 Constraints: E Q [f i ]=E D [f i ], i=1,.., k Type 2 Constraints: Marginals from Cluster-graph approximation 14

15 CAMEL Solutions CAMEL optimization is a constrained maximization problem with linear constraints and a nonconcave objective Several solution algorithms, one of which is Lagrange multipliers for all constraints and optimize over resulting new variables 15

16 Sampling-based Learning Partition function Z(θ) is a summation over an exponentially large space Reformulate as expectation wrt a distribution Q(χ) ( ) = exp ξ i Q( ξ) = Q( ξ) exp ξ Z θ θ i f i ( ξ) θ i f i ( ξ) i 1 =E Q Q χ i ( ) exp θ f i i ( χ ) Precisely the form of importance sampling estimator Can generate samples from Q and correct by weights Simplify by choosing Q to be P θ 0 for some θ 0 Z ( θ) = E Pθ 0 { } Z ( θ 0 )exp i θ i f i ( χ ) exp{ θ 0 i i f i ( χ) } exp (θ i θ 0 i )f i (ξ k ) i =Z ( θ 0 )E Pθ 0 16

17 Importance sampling Samples {z (l) } are drawn from simpler dist. q(z) E[ f ] = f (z)p(z)dz = f (z) p(z) q(z) q(z)dz = 1 L L p(z (l ) ) f (z (l ) ) q(z (l ) ) Unlike rejection sampling l=1 All of the samples are retained Proposal distribution Samples are weighted by ratios r l = p(z (l) ) / q(z (l) ) Known as importance weights Which corrects the bias introduced by wrong distribution 17

18 Samples to approximate ln Z(θ) Choose Q to be P θ 0 for some parameters θ 0. Sample instances ξ 1,..ξ k from P θ 0, to approximate the log-partition function as lnz θ Plug it into K ( ) ln 1 K 1 M k=1 l(θ : D) = θ 0 exp ( θ i θ i )f i ξ k i i E D f i (d i ) i and optimize Use a sampling procedure to generate samples from current parameter set θ t Then use gradient descent to find θ t+1 that improves log-likelihood based on the samples ( ) ( [ ]) ln Z(θ) + lnz ( θ0)

19 MAP-based Learning Another approach to inference in learning Approximating expected feature counts with the counts in the single MAP assignment to current MN Approximate gradient at assignment θ is E D f i ( χ ) f i ( ξ MAP ( θ )) where ξ MAP (θ)=arg max ξ P(ξ θ) is the MAP assignment given the current set of parameters θ Approach also called as Viterbi training Equivalent to exact optimization of approximate objective 1 M l θ : D ( ) ( ) ln P ξ MAP ( θ ) θ 19

20 Alternative Objectives A class of approximations obtained by replacing objective with one more tractable In the case of a single instance ξ l( θ : ξ) = ln P " ( ξ θ) lnz ( θ) Expanding the partition function l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Maximizing l is to increase distance (contrast) between two terms First term: Log measure Summing over all possible values of dummy variable Unnormalized measure (log-probability) of ξ Second Term: Aggregate measure of all instances 20

21 Contrastive Objective Since log-measure increases with parameters increase parameters with positive empirical expectations in ξ and decrease parameters with negative empirical expectations However second term balances the first Key difficulty second term has exponential instances in Val(χ) and requires inference in the network Approach Increase log-measure of data instances and a more tractable set of other instances, one not requiring summation over an exponential space 21

22 Two Approaches to increase probability gap 1. Pseudolikelihood and its generalizations Simplifies likelihood by considering only pairwise dependencies 2. Contrastive Optimization Contrast data with a randomly perturbed set of neighbors 22

23 Pseudolikelihood for Tractability Consider likelihood of single instance ξ : n ( ) = P ( x j x 1,..,x j 1 ) P ξ j=1 Approximate by n ( ) = P ( x j x 1,..,x j 1, x j+1,..,x n ) P ξ j=1 Since from product rule P(x 1,x 2 ) = P(x 1 ) P(x 2 x 1 ) and P(x 1,x 2,x 3 ) = P(x 1 ) P(x 2,x 3 x 1 )= P(x 1 ) P(x 2 x 1 ) P(x 3 x 1.x 2 ) Replace each product term above by conditional probability of x j given all the other variables Gives the pseudolikelihood objective l PL θ : D lnp ( x j m m ) ( ) = 1 M m j x j which eliminates exponential summation P(x j x j ) = P(x j,x j ) P(x j ) = x j '!P(x j,x j )!P(x j ',x j ),θ where x -j stands for x 1,.., x j-1, x j+1,.., x n Requires only summation over X j 23

24 Pseudolikelihood is concave Pseudolikelihood objective of a single data ξ: lnp ( x j x j ) = ln P! ( x j,x j ) ln P! ( x 'j,x j ) j j x j ' second term is partition fn. = ln P! ( ξ) ln P! ( x 'j,x j ) j x j ' Simplify each term in summation as lnp(x j x j ) = θ i f i x j,x j i:scope f i ( ) Each term is a log-conditional-likelihood term for a MN for a single variable X j conditioned on rest Function is concave in the parameters θ Gradient is ( ) = f i (x j,x j ) E x j '~P θ X j x j θ i ln x j x j X j ln exp θ f x ' i i j,x j i:scope f x i X j ' j ( ) fi (x ( ) j ',x j ) 24

25 Contrastive Optimization Likelihood and Pseudolikelihood l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Both attempt to increase log-probability gap between D and probability of a set of instances Base on this intuition, a range of methods developed: By driving the probability of the observed data higher relative to other instances, we are tuning the parameters to predict the data better 25

26 Contrastive Optimization Criteria Aim to maximize the log-probability gap Consider a single training instance ξ Maximize the contrastive objective ln P! ( ξ θ) ln P! ξ ' θ where ξ is some other instance This takes a simple form θ T [f(ξ) - f(ξ )] ( ) which is a linear function of θ and hence unbounded Choice of ξ has to change throughout the optimization Two methods for choosing ξ : 1. Contrastive divergence 2. Margin-based training 26

27 Contrastive Divergence Popularity of method has grown Used in deep learning for training layers of RBMs Contrast data instances D with set of randomly perturbed neighbors D - Maximize l CD ( θ : D D ) = E ξ~ PD " ln P " θ ξ ( ) E ξ~ P " D ln P " θ ( ξ) Where P D and P D- are distributions relative to D and D - We want model to give high probability to instances in D relative to the perturbed instances in D- 27

28 Generating D - for Contrastive Divergence Contrasted instances D - will differ at different stages in the search Given current parameters θ generate samples D - from P θ using Gibbs sampling Initialize from the instances in D, run the Markov chain only a few steps to define D - Gradient of objective is l θ CD ( θ : D D ) = E "PD f i ( χ) E PD " i f i ( χ) 28

29 Margin-Based Training When the goal is MAP assignment, an SVMbased method can be used Training set consists of pairs Given observation x[m] we would like learned model to give highest probability to y[m] Maximize the margin lnp θ y m x ( m ) max y y m lnp θ y m x m ( ) D = {( y m,x m ) } m=1 the difference between the log-probability of the target assignment y[m] and next best assignment 29 M

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Inference as Optimization

Inference as Optimization Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate

More information

Structured Variational Inference

Structured Variational Inference Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:

More information

Exact Inference: Clique Trees. Sargur Srihari

Exact Inference: Clique Trees. Sargur Srihari Exact Inference: Clique Trees Sargur srihari@cedar.buffalo.edu 1 Topics 1. Overview 2. Variable Elimination and Clique Trees 3. Message Passing: Sum-Product VE in a Clique Tree Clique-Tree Calibration

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,

More information

Alternative Parameterizations of Markov Networks. Sargur Srihari

Alternative Parameterizations of Markov Networks. Sargur Srihari Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single

More information

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Learning Markov Networks. Presented by: Mark Berlin, Barak Gross Learning Markov Networks Presented by: Mark Berlin, Barak Gross Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II Reminder Until now

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018 Graphical Models Clique trees & Belief Propagation Siamak Ravanbakhsh Winter 2018 Learning objectives message passing on clique trees its relation to variable elimination two different forms of belief

More information

MAP Examples. Sargur Srihari

MAP Examples. Sargur Srihari MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013 School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two

More information

From Distributions to Markov Networks. Sargur Srihari

From Distributions to Markov Networks. Sargur Srihari From Distributions to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics The task: How to encode independencies in given distribution P in a graph structure G Theorems concerning What type of Independencies?

More information

Contrastive Divergence

Contrastive Divergence Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive

More information

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13

Energy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13 Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation

More information

12 : Variational Inference I

12 : Variational Inference I 10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of

More information

Likelihood Weighting and Importance Sampling

Likelihood Weighting and Importance Sampling Likelihood Weighting and Importance Sampling Sargur Srihari srihari@cedar.buffalo.edu 1 Topics Likelihood Weighting Intuition Importance Sampling Unnormalized Importance Sampling Normalized Importance

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 10 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9) About half

More information

Using Graphs to Describe Model Structure. Sargur N. Srihari

Using Graphs to Describe Model Structure. Sargur N. Srihari Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Notes on Markov Networks

Notes on Markov Networks Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS JONATHAN YEDIDIA, WILLIAM FREEMAN, YAIR WEISS 2001 MERL TECH REPORT Kristin Branson and Ian Fasel June 11, 2003 1. Inference Inference problems

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul

More information

17 Variational Inference

17 Variational Inference Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact

More information

Graphical models. Sunita Sarawagi IIT Bombay

Graphical models. Sunita Sarawagi IIT Bombay 1 Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 2 Probabilistic modeling Given: several variables: x 1,... x n, n is large. Task: build a joint distribution function Pr(x

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Inference in Graphical Models Variable Elimination and Message Passing Algorithm Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Need for Sampling in Machine Learning. Sargur Srihari

Need for Sampling in Machine Learning. Sargur Srihari Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Variational Inference and Learning. Sargur N. Srihari

Variational Inference and Learning. Sargur N. Srihari Variational Inference and Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics in Approximate Inference Task of Inference Intractability in Inference 1. Inference as Optimization 2. Expectation Maximization

More information

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall

Chapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall Chapter 8 Cluster Graph & elief ropagation robabilistic Graphical Models 2016 Fall Outlines Variable Elimination 消元法 imple case: linear chain ayesian networks VE in complex graphs Inferences in HMMs and

More information

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4

More information

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models Lecture Notes Fall 2009 Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilisti Graphial Models David Sontag New York University Leture 12, April 19, 2012 Aknowledgement: Partially based on slides by Eri Xing at CMU and Andrew MCallum at UMass Amherst David Sontag (NYU)

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

4 : Exact Inference: Variable Elimination

4 : Exact Inference: Variable Elimination 10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

Representation of undirected GM. Kayhan Batmanghelich

Representation of undirected GM. Kayhan Batmanghelich Representation of undirected GM Kayhan Batmanghelich Review Review: Directed Graphical Model Represent distribution of the form ny p(x 1,,X n = p(x i (X i i=1 Factorizes in terms of local conditional probabilities

More information

Training an RBM: Contrastive Divergence. Sargur N. Srihari

Training an RBM: Contrastive Divergence. Sargur N. Srihari Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and

More information

From Bayesian Networks to Markov Networks. Sargur Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari From Bayesian Networks to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Bayesian Networks and Markov Networks From BN to MN: Moralized graphs From MN to BN: Chordal graphs 2 Bayesian Networks

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Kyle Reing University of Southern California April 18, 2018

Kyle Reing University of Southern California April 18, 2018 Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

Gradient-Based Learning. Sargur N. Srihari

Gradient-Based Learning. Sargur N. Srihari Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Variable Elimination: Algorithm

Variable Elimination: Algorithm Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1 Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Part 1: Expectation Propagation

Part 1: Expectation Propagation Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud

More information

Semi-Markov/Graph Cuts

Semi-Markov/Graph Cuts Semi-Markov/Graph Cuts Alireza Shafaei University of British Columbia August, 2015 1 / 30 A Quick Review For a general chain-structured UGM we have: n n p(x 1, x 2,..., x n ) φ i (x i ) φ i,i 1 (x i, x

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15

More information

Lecture 8: PGM Inference

Lecture 8: PGM Inference 15 September 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I 1 Variable elimination Max-product Sum-product 2 LP Relaxations QP Relaxations 3 Marginal and MAP X1 X2 X3 X4

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information