Learning MN Parameters with Approximation. Sargur Srihari
|
|
- Carmel Norman
- 5 years ago
- Views:
Transcription
1 Learning MN Parameters with Approximation Sargur 1
2 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief Propagation MAP-based Learning Alternative Objective Functions Pseudo-likelihood Contrastive Objective Divergence Map-based 2
3 Exact Learning of MN parameters θ Initialize. θ P θ (χ) = 1 Z θ ( ) exp θ f (D ) i i i k i=1 Run inference (compute Z(θ) ) Compute gradient of l. Update θ. ( ) = exp θ i f i ( ξ) Z θ E θ [ f i ] = 1 ξ i Z(θ) θ i l θ; D ( ) = E D [ f i χ ( )] E θ [ f i ] θ t+1 θ t + η l( θ t ;D) ξ f i (ξ)exp θ j f j ( ξ ) j l(θ) = d dθ l(θ) = l ( θ ). θ 1 l( θ) θ k No Optimum Reached? Yes Stop θ t+1 θ t δ 3
4 Difficulty with Exact Methods Exact Parameter Estimation Methods assume ability to compute 1. Partition function Z(θ) and 2. Expectations E Pθ [ f i ] In many applications structure of network does not allow exact computation of these terms In image segmentation, grid networks lead to exponential size clusters for exact inference Cluster graph is clique tree with overlapping factors 4
5 Approximate Methods for Learning θ 1. Use approximate inference for queries of P θ Decouples inference from Learning Parameters Inference is a black-box But approximation may interfere with learning Non-convergence of inference can lead to oscillating estimates of the gradient & no learning convergence 2. Use an approximate objective function Whose optimization doesn t require much inference Approximately optimizing the likelihood function can be reformulated as exactly optimizing an approximate objective 5
6 Approximate Methods (Taxonomy) 1. Approximate Inference methods 1. Belief Propagation 2. MAP-based Learning 2. Alternative Objective Functions 1. Pseudo-likelihood and its generalizations 2. Contrastive Optimization Criteria 1. Contrastive Divergence 2. Margin-Based Training 6
7 Approximate Inference: Belief Propagation Popular Approach for Approximate Inference is Belief Propagation Given a model resulting from a learning procedure, an algorithm from this family is used Model trained with same inference algorithm is better than model trained with exact inference! BP is run with each iteration of Gradient Ascent to compute expected feature count E Pθ [ f i ] 7
8 Ex: Direct (Gibbs) vs BP (Clique tree) A 1.Gibbs Distribution!P Φ ( A,B,C,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 ( D,A) Factors D B P(A, B,C.D) = 1 Z φ 1(A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) where Z = φ 1 (A, B) φ 2 (B,C) φ 3 (C, D) φ 4 (D, A) A,B,C,D Unormalized Distribution C Z=7,201,840 (a) 2. Clique Tree (triangulated): Initial Potentials: Each ψ has every factor involving its arguments ψ 1 ψ 2 Clique/Cluster C 1 Cluster C 2 Sepset ( A,B,D ) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A ( B,C,D) = φ 1 ( A,B )φ 2 ( B,C )φ 3 ( C,D)φ 4 D,A 1. A,B,D {B,D} 2. B,C,D ( ) ( ) Computing Clique Beliefs (β i ), Sepset Beliefs (µ i,j ) ( A,B,D ) = P! Φ ( A,B,D ) = ψ 1 ( A,B,D ) β 1 e.g., β 1 (a 0,b 0,d 0 ) = 300, ,000 = 600,000 µ 1,2 (B,D) = β 1 C 1 C 1 S 1,2 = φ 1 (A,B)φ 2 (B,C)φ 3 (C,D)φ 4 (D,A) C ( ) = β 1 ( A,B,D ) e.g., µ 1,2 (b 0,d 0 ) = 600, = 600,200 β 2 ( B,C,D) = P! Φ B,C,D A ( ) = µ 1,2 B,D ( ) = 300, = 300,100 e.g., β 2 b 0,c 0,d 0 A C ( ) ψ 2 ( B,C,D) = ψ 2 B,C,D A ( ) Assignment a a,o ao n a" al al al AL All Clique and Sepset Beliefs 6o 6r 6t 6o 6o 6r 6r Verifying Inference maxc 600, ,030 5, ooo,5oo 1, ,000, , ,000 Assignment d 6o bl br!p Φ a 1,b 0,c 1,d 0 n,z(b, D) 600,200 1,300, 130 5, 100, ,000 ( ) = 100 ( )β 2 ( b 0,c 1,d 0 ) µ 1,2 ( b 0,d 0 ) β 1 a 1,b 0,d 0 = Assienment b0 bo b0 bl bl bt 6t co cl ct co co c1 ct d1 ll I 4o 4t 4o 4r d0 ll 5, l3z(8,c, = 100
9 Difficulty with Belief Propagation In principle, BP is run in each Gradient Ascent iteration to compute E Pθ [ f i ] used in gradient computation Due to family preservation property each feature must be a subset of a cluster C i in the cluster graph Hence to compute E Pθ [ f i ] we can compute BP marginals over C i But BP often does not converge Marginals derived often oscillate Final results depend on where we stop As a result gradient computed is unstable Hurt convergence properties of gradient descent Even more severe with line search 9
10 Convergent alternatives to Belief Propagation Three BP methods: 1. Pseudo- moment matching Reformulate task of learning with approximate inference as optimizing an alternative objective 2. Maximum Entropy approximation (CAMEL) General derivation that allows us to reformulate maximum likelihood with BP as a unified optimization problem with an approximate objective 3. MAP-based Learning Approximate expected feature counts with their counts in the single MAP assignment in current MN 10
11 Pseudo-moment Matching Begin with analysis of fixed points in learning Converged BP beliefs must satisfy Or j β i ( c i ) = ˆP j c i ( ) E βi C i ( ) f Ci We define for each sepset S i,j between C i and C j φ i β i µ i, j We use the final set of potentials as the parameterization of the Markov Network Provides a closed-form solution for both inference and learning Cannot be used with parameter regularization, nontable factors or CRFs 11 ( ) = E f D i C i Convergent point is a set of beliefs that match the data C 1 : A,B,D S 12 ={B,D} C 2 : B,C,D
12 BP and Maximum Entropy Maximum entropy dual of max likelihood Find Q(χ) Maximizing H Q (χ) Subject to E Q [f i ]=E D [f i ], i=1,.., k Made tractable by approximating H Q and E Q : H Q : Given a cluster graph U with clusters C i and sepset S i,j H Q ( χ ) H βi ( C i ) C i U E Q [f i ]: replace by E βi [f i ] ( ) H µi, j S i, j C i C j S Approximation is exact when cluster graph is a tree Method is known as CAMEL (Constrained Approximate Maximum Entropy Learning) 12
13 Example of Max Ent Learning Pairwise Markov Network A B C Variables are binary Three clusters: C 1 = {A,B}, C 2 ={B,C}, C 3 ={C,A} Log-linear model with features f 00 (x,y)=1 if x=0, y=0 and 0 otherwise for x,y, instance of C i f 11 (x,y)=1 if x=1, y=1 and 0 otherwise Three data instances (A,B,C): (0,0.0),(0,1,0),(1,0,0) Unnormalized Feature counts, pooled over all clusters, are [ ] = (3 +1+1) / 3 = 5 / 3 [ ] = ( ) / 3 = 0 E! P f 00 E! P f 11 13
14 CAMEL Optimization Problem Optimization problem takes the form with two types of constraints: Type 1 Constraints: E Q [f i ]=E D [f i ], i=1,.., k Type 2 Constraints: Marginals from Cluster-graph approximation 14
15 CAMEL Solutions CAMEL optimization is a constrained maximization problem with linear constraints and a nonconcave objective Several solution algorithms, one of which is Lagrange multipliers for all constraints and optimize over resulting new variables 15
16 Sampling-based Learning Partition function Z(θ) is a summation over an exponentially large space Reformulate as expectation wrt a distribution Q(χ) ( ) = exp ξ i Q( ξ) = Q( ξ) exp ξ Z θ θ i f i ( ξ) θ i f i ( ξ) i 1 =E Q Q χ i ( ) exp θ f i i ( χ ) Precisely the form of importance sampling estimator Can generate samples from Q and correct by weights Simplify by choosing Q to be P θ 0 for some θ 0 Z ( θ) = E Pθ 0 { } Z ( θ 0 )exp i θ i f i ( χ ) exp{ θ 0 i i f i ( χ) } exp (θ i θ 0 i )f i (ξ k ) i =Z ( θ 0 )E Pθ 0 16
17 Importance sampling Samples {z (l) } are drawn from simpler dist. q(z) E[ f ] = f (z)p(z)dz = f (z) p(z) q(z) q(z)dz = 1 L L p(z (l ) ) f (z (l ) ) q(z (l ) ) Unlike rejection sampling l=1 All of the samples are retained Proposal distribution Samples are weighted by ratios r l = p(z (l) ) / q(z (l) ) Known as importance weights Which corrects the bias introduced by wrong distribution 17
18 Samples to approximate ln Z(θ) Choose Q to be P θ 0 for some parameters θ 0. Sample instances ξ 1,..ξ k from P θ 0, to approximate the log-partition function as lnz θ Plug it into K ( ) ln 1 K 1 M k=1 l(θ : D) = θ 0 exp ( θ i θ i )f i ξ k i i E D f i (d i ) i and optimize Use a sampling procedure to generate samples from current parameter set θ t Then use gradient descent to find θ t+1 that improves log-likelihood based on the samples ( ) ( [ ]) ln Z(θ) + lnz ( θ0)
19 MAP-based Learning Another approach to inference in learning Approximating expected feature counts with the counts in the single MAP assignment to current MN Approximate gradient at assignment θ is E D f i ( χ ) f i ( ξ MAP ( θ )) where ξ MAP (θ)=arg max ξ P(ξ θ) is the MAP assignment given the current set of parameters θ Approach also called as Viterbi training Equivalent to exact optimization of approximate objective 1 M l θ : D ( ) ( ) ln P ξ MAP ( θ ) θ 19
20 Alternative Objectives A class of approximations obtained by replacing objective with one more tractable In the case of a single instance ξ l( θ : ξ) = ln P " ( ξ θ) lnz ( θ) Expanding the partition function l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Maximizing l is to increase distance (contrast) between two terms First term: Log measure Summing over all possible values of dummy variable Unnormalized measure (log-probability) of ξ Second Term: Aggregate measure of all instances 20
21 Contrastive Objective Since log-measure increases with parameters increase parameters with positive empirical expectations in ξ and decrease parameters with negative empirical expectations However second term balances the first Key difficulty second term has exponential instances in Val(χ) and requires inference in the network Approach Increase log-measure of data instances and a more tractable set of other instances, one not requiring summation over an exponential space 21
22 Two Approaches to increase probability gap 1. Pseudolikelihood and its generalizations Simplifies likelihood by considering only pairwise dependencies 2. Contrastive Optimization Contrast data with a randomly perturbed set of neighbors 22
23 Pseudolikelihood for Tractability Consider likelihood of single instance ξ : n ( ) = P ( x j x 1,..,x j 1 ) P ξ j=1 Approximate by n ( ) = P ( x j x 1,..,x j 1, x j+1,..,x n ) P ξ j=1 Since from product rule P(x 1,x 2 ) = P(x 1 ) P(x 2 x 1 ) and P(x 1,x 2,x 3 ) = P(x 1 ) P(x 2,x 3 x 1 )= P(x 1 ) P(x 2 x 1 ) P(x 3 x 1.x 2 ) Replace each product term above by conditional probability of x j given all the other variables Gives the pseudolikelihood objective l PL θ : D lnp ( x j m m ) ( ) = 1 M m j x j which eliminates exponential summation P(x j x j ) = P(x j,x j ) P(x j ) = x j '!P(x j,x j )!P(x j ',x j ),θ where x -j stands for x 1,.., x j-1, x j+1,.., x n Requires only summation over X j 23
24 Pseudolikelihood is concave Pseudolikelihood objective of a single data ξ: lnp ( x j x j ) = ln P! ( x j,x j ) ln P! ( x 'j,x j ) j j x j ' second term is partition fn. = ln P! ( ξ) ln P! ( x 'j,x j ) j x j ' Simplify each term in summation as lnp(x j x j ) = θ i f i x j,x j i:scope f i ( ) Each term is a log-conditional-likelihood term for a MN for a single variable X j conditioned on rest Function is concave in the parameters θ Gradient is ( ) = f i (x j,x j ) E x j '~P θ X j x j θ i ln x j x j X j ln exp θ f x ' i i j,x j i:scope f x i X j ' j ( ) fi (x ( ) j ',x j ) 24
25 Contrastive Optimization Likelihood and Pseudolikelihood l( θ : ξ) = ln P! ξ θ ( ) ln P!( ξ ' θ ) ξ ' Both attempt to increase log-probability gap between D and probability of a set of instances Base on this intuition, a range of methods developed: By driving the probability of the observed data higher relative to other instances, we are tuning the parameters to predict the data better 25
26 Contrastive Optimization Criteria Aim to maximize the log-probability gap Consider a single training instance ξ Maximize the contrastive objective ln P! ( ξ θ) ln P! ξ ' θ where ξ is some other instance This takes a simple form θ T [f(ξ) - f(ξ )] ( ) which is a linear function of θ and hence unbounded Choice of ξ has to change throughout the optimization Two methods for choosing ξ : 1. Contrastive divergence 2. Margin-based training 26
27 Contrastive Divergence Popularity of method has grown Used in deep learning for training layers of RBMs Contrast data instances D with set of randomly perturbed neighbors D - Maximize l CD ( θ : D D ) = E ξ~ PD " ln P " θ ξ ( ) E ξ~ P " D ln P " θ ( ξ) Where P D and P D- are distributions relative to D and D - We want model to give high probability to instances in D relative to the perturbed instances in D- 27
28 Generating D - for Contrastive Divergence Contrasted instances D - will differ at different stages in the search Given current parameters θ generate samples D - from P θ using Gibbs sampling Initialize from the instances in D, run the Markov chain only a few steps to define D - Gradient of objective is l θ CD ( θ : D D ) = E "PD f i ( χ) E PD " i f i ( χ) 28
29 Margin-Based Training When the goal is MAP assignment, an SVMbased method can be used Training set consists of pairs Given observation x[m] we would like learned model to give highest probability to y[m] Maximize the margin lnp θ y m x ( m ) max y y m lnp θ y m x m ( ) D = {( y m,x m ) } m=1 the difference between the log-probability of the target assignment y[m] and next best assignment 29 M
Learning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More informationInference as Optimization
Inference as Optimization Sargur Srihari srihari@cedar.buffalo.edu 1 Topics in Inference as Optimization Overview Exact Inference revisited The Energy Functional Optimizing the Energy Functional 2 Exact
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate
More informationStructured Variational Inference
Structured Variational Inference Sargur srihari@cedar.buffalo.edu 1 Topics 1. Structured Variational Approximations 1. The Mean Field Approximation 1. The Mean Field Energy 2. Maximizing the energy functional:
More informationExact Inference: Clique Trees. Sargur Srihari
Exact Inference: Clique Trees Sargur srihari@cedar.buffalo.edu 1 Topics 1. Overview 2. Variable Elimination and Clique Trees 3. Message Passing: Sum-Product VE in a Clique Tree Clique-Tree Calibration
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationAlternative Parameterizations of Markov Networks. Sargur Srihari
Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models Features (Ising,
More informationAlternative Parameterizations of Markov Networks. Sargur Srihari
Alternative Parameterizations of Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Three types of parameterization 1. Gibbs Parameterization 2. Factor Graphs 3. Log-linear Models with Energy functions
More informationCSC 412 (Lecture 4): Undirected Graphical Models
CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationDoes Better Inference mean Better Learning?
Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More information13 : Variational Inference: Loopy Belief Propagation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction
More informationCS Lecture 13. More Maximum Likelihood
CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood
More informationGenerative and Discriminative Approaches to Graphical Models CMSC Topics in AI
Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007 Review of Inference on Graphical Models Elimination algorithm finds single
More informationLearning Markov Networks. Presented by: Mark Berlin, Barak Gross
Learning Markov Networks Presented by: Mark Berlin, Barak Gross Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II Reminder Until now
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationProbabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference II: Mean Field Method and Variational Principle Junming Yin Lecture 15, March 7, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationClique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018
Graphical Models Clique trees & Belief Propagation Siamak Ravanbakhsh Winter 2018 Learning objectives message passing on clique trees its relation to variable elimination two different forms of belief
More informationMAP Examples. Sargur Srihari
MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationProbabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013
School of Computer Science Probabilistic Graphical Models Theory of Variational Inference: Inner and Outer Approximation Junming Yin Lecture 15, March 4, 2013 Reading: W & J Book Chapters 1 Roadmap Two
More informationFrom Distributions to Markov Networks. Sargur Srihari
From Distributions to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics The task: How to encode independencies in given distribution P in a graph structure G Theorems concerning What type of Independencies?
More informationContrastive Divergence
Contrastive Divergence Training Products of Experts by Minimizing CD Hinton, 2002 Helmut Puhr Institute for Theoretical Computer Science TU Graz June 9, 2010 Contents 1 Theory 2 Argument 3 Contrastive
More informationEnergy Based Models. Stefano Ermon, Aditya Grover. Stanford University. Lecture 13
Energy Based Models Stefano Ermon, Aditya Grover Stanford University Lecture 13 Stefano Ermon, Aditya Grover (AI Lab) Deep Generative Models Lecture 13 1 / 21 Summary Story so far Representation: Latent
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationMachine Learning 4771
Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation
More information12 : Variational Inference I
10-708: Probabilistic Graphical Models, Spring 2015 12 : Variational Inference I Lecturer: Eric P. Xing Scribes: Fattaneh Jabbari, Eric Lei, Evan Shapiro 1 Introduction Probabilistic inference is one of
More informationLikelihood Weighting and Importance Sampling
Likelihood Weighting and Importance Sampling Sargur Srihari srihari@cedar.buffalo.edu 1 Topics Likelihood Weighting Intuition Importance Sampling Unnormalized Importance Sampling Normalized Importance
More informationProbabilistic and Bayesian Machine Learning
Probabilistic and Bayesian Machine Learning Day 4: Expectation and Belief Propagation Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationMarkov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationStudy Notes on the Latent Dirichlet Allocation
Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 10 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due this Wednesday (Nov 4) in class Project milestones due next Monday (Nov 9) About half
More informationUsing Graphs to Describe Model Structure. Sargur N. Srihari
Using Graphs to Describe Model Structure Sargur N. srihari@cedar.buffalo.edu 1 Topics in Structured PGMs for Deep Learning 0. Overview 1. Challenge of Unstructured Modeling 2. Using graphs to describe
More informationUndirected graphical models
Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical
More informationNotes on Markov Networks
Notes on Markov Networks Lili Mou moull12@sei.pku.edu.cn December, 2014 This note covers basic topics in Markov networks. We mainly talk about the formal definition, Gibbs sampling for inference, and maximum
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationUNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS
UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS JONATHAN YEDIDIA, WILLIAM FREEMAN, YAIR WEISS 2001 MERL TECH REPORT Kristin Branson and Ian Fasel June 11, 2003 1. Inference Inference problems
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More informationBayesian Learning in Undirected Graphical Models
Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ Work with: Iain Murray and Hyun-Chul
More information17 Variational Inference
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms for Inference Fall 2014 17 Variational Inference Prompted by loopy graphs for which exact
More informationGraphical models. Sunita Sarawagi IIT Bombay
1 Graphical models Sunita Sarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita 2 Probabilistic modeling Given: several variables: x 1,... x n, n is large. Task: build a joint distribution function Pr(x
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More informationInference in Graphical Models Variable Elimination and Message Passing Algorithm
Inference in Graphical Models Variable Elimination and Message Passing lgorithm Le Song Machine Learning II: dvanced Topics SE 8803ML, Spring 2012 onditional Independence ssumptions Local Markov ssumption
More informationJunction Tree, BP and Variational Methods
Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationExpectation maximization tutorial
Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed
More informationAnother Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University
Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis
More informationNeed for Sampling in Machine Learning. Sargur Srihari
Need for Sampling in Machine Learning Sargur srihari@cedar.buffalo.edu 1 Rationale for Sampling 1. ML methods model data with probability distributions E.g., p(x,y; θ) 2. Models are used to answer queries,
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationVariational Inference and Learning. Sargur N. Srihari
Variational Inference and Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics in Approximate Inference Task of Inference Intractability in Inference 1. Inference as Optimization 2. Expectation Maximization
More informationProbabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm
Probabilistic Graphical Models 10-708 Homework 2: Due February 24, 2014 at 4 pm Directions. This homework assignment covers the material presented in Lectures 4-8. You must complete all four problems to
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationMarkov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationThe Ising model and Markov chain Monte Carlo
The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationChapter 8 Cluster Graph & Belief Propagation. Probabilistic Graphical Models 2016 Fall
Chapter 8 Cluster Graph & elief ropagation robabilistic Graphical Models 2016 Fall Outlines Variable Elimination 消元法 imple case: linear chain ayesian networks VE in complex graphs Inferences in HMMs and
More informationPattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM
Pattern Recognition and Machine Learning Chapter 9: Mixture Models and EM Thomas Mensink Jakob Verbeek October 11, 27 Le Menu 9.1 K-means clustering Getting the idea with a simple example 9.2 Mixtures
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Variational Inference IV: Variational Principle II Junming Yin Lecture 17, March 21, 2012 X 1 X 1 X 1 X 1 X 2 X 3 X 2 X 2 X 3 X 3 Reading: X 4
More informationProbabilistic Graphical Models Lecture Notes Fall 2009
Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University
More informationProbabilistic Graphical Models
Probabilisti Graphial Models David Sontag New York University Leture 12, April 19, 2012 Aknowledgement: Partially based on slides by Eri Xing at CMU and Andrew MCallum at UMass Amherst David Sontag (NYU)
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More information4 : Exact Inference: Variable Elimination
10-708: Probabilistic Graphical Models 10-708, Spring 2014 4 : Exact Inference: Variable Elimination Lecturer: Eric P. ing Scribes: Soumya Batra, Pradeep Dasigi, Manzil Zaheer 1 Probabilistic Inference
More informationClustering and Gaussian Mixture Models
Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap
More informationChapter 20. Deep Generative Models
Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution
More informationRepresentation of undirected GM. Kayhan Batmanghelich
Representation of undirected GM Kayhan Batmanghelich Review Review: Directed Graphical Model Represent distribution of the form ny p(x 1,,X n = p(x i (X i i=1 Factorizes in terms of local conditional probabilities
More informationTraining an RBM: Contrastive Divergence. Sargur N. Srihari
Training an RBM: Contrastive Divergence Sargur N. srihari@cedar.buffalo.edu Topics in Partition Function Definition of Partition Function 1. The log-likelihood gradient 2. Stochastic axiu likelihood and
More informationFrom Bayesian Networks to Markov Networks. Sargur Srihari
From Bayesian Networks to Markov Networks Sargur srihari@cedar.buffalo.edu 1 Topics Bayesian Networks and Markov Networks From BN to MN: Moralized graphs From MN to BN: Chordal graphs 2 Bayesian Networks
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationKyle Reing University of Southern California April 18, 2018
Renormalization Group and Information Theory Kyle Reing University of Southern California April 18, 2018 Overview Renormalization Group Overview Information Theoretic Preliminaries Real Space Mutual Information
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationGradient-Based Learning. Sargur N. Srihari
Gradient-Based Learning Sargur N. srihari@cedar.buffalo.edu 1 Topics Overview 1. Example: Learning XOR 2. Gradient-Based Learning 3. Hidden Units 4. Architecture Design 5. Backpropagation and Other Differentiation
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationVariable Elimination: Algorithm
Variable Elimination: Algorithm Sargur srihari@cedar.buffalo.edu 1 Topics 1. Types of Inference Algorithms 2. Variable Elimination: the Basic ideas 3. Variable Elimination Sum-Product VE Algorithm Sum-Product
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationPart 1: Expectation Propagation
Chalmers Machine Learning Summer School Approximate message passing and biomedicine Part 1: Expectation Propagation Tom Heskes Machine Learning Group, Institute for Computing and Information Sciences Radboud
More informationSemi-Markov/Graph Cuts
Semi-Markov/Graph Cuts Alireza Shafaei University of British Columbia August, 2015 1 / 30 A Quick Review For a general chain-structured UGM we have: n n p(x 1, x 2,..., x n ) φ i (x i ) φ i,i 1 (x i, x
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationInference in Bayesian Networks
Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Category Representation 2014-2015 Jakob Verbeek, ovember 21, 2014 Course website: http://lear.inrialpes.fr/~verbeek/mlcr.14.15
More informationLecture 8: PGM Inference
15 September 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I 1 Variable elimination Max-product Sum-product 2 LP Relaxations QP Relaxations 3 Marginal and MAP X1 X2 X3 X4
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More information