Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Similar documents
Lecture 9: PGM Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Bayesian Machine Learning - Lecture 7

13: Variational inference II

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Probabilistic Graphical Models

Lecture 8: Bayesian Networks

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Variational Inference (11/04/13)

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

9 Forward-backward algorithm, sum-product on factor graphs

Graphical Models for Collaborative Filtering

Lecture 13 : Variational Inference: Mean Field Approximation

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Machine Learning Summer School

Graphical Models and Kernel Methods

Linear Dynamical Systems

Lecture 6: Graphical Models: Learning

Learning Bayesian network : Given structure and completely observed data

Variational Inference. Sargur Srihari

Review. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic Graphical Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Gaussian Mixture Models

Learning MN Parameters with Approximation. Sargur Srihari

Lecture 4 October 18th

CRF for human beings

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Statistical Approaches to Learning and Discovery

Posterior Regularization

6.867 Machine learning, lecture 23 (Jaakkola)

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Message Passing Algorithms and Junction Tree Algorithms

Expectation Maximization (EM)

Probabilistic Graphical Models

STA 4273H: Statistical Machine Learning

4 : Exact Inference: Variable Elimination

Expectation Maximization Algorithm

CSC 412 (Lecture 4): Undirected Graphical Models

Probability Propagation

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Unsupervised Learning

Probabilistic Graphical Models

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Naïve Bayes classification

Junction Tree, BP and Variational Methods

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

STA 414/2104: Machine Learning

CS281A/Stat241A Lecture 19

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

p L yi z n m x N n xi

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 8: Graphical models for Text

Cours 7 12th November 2014

Lecture 6: Graphical Models

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

Machine Learning Lecture 14

Introduction to Graphical Models

A brief introduction to Conditional Random Fields

Machine Learning for Signal Processing Bayes Classification and Regression

Directed Probabilistic Graphical Models CMSC 678 UMBC

Learning Parameters of Undirected Models. Sargur Srihari

Latent Variable Models and EM algorithm

CS6220: DATA MINING TECHNIQUES

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Hidden Markov Models

Lecture 15. Probabilistic Models on Graph

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Undirected Graphical Models

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Statistical Learning

Basic math for biology

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

5. Sum-product algorithm

Probabilistic Graphical Models

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Graphical models: parameter learning

Probabilistic Graphical Models

Bayesian Machine Learning

An Introduction to Expectation-Maximization

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

An Introduction to Probabilistic Graphical Models

Data Mining Techniques

Clique trees & Belief Propagation. Siamak Ravanbakhsh Winter 2018

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Learning Parameters of Undirected Models. Sargur Srihari

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Based on slides by Richard Zemel

Expectation maximization tutorial

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Density Estimation. Seungjin Choi

Probabilistic Graphical Models (I)

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Expectation Maximization

Chapter 16. Structured Probabilistic Models for Deep Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Transcription:

Generative and Discriminative Approaches to Graphical Models CMSC 35900 Topics in AI Lecture 2 Yasemin Altun January 26, 2007

Review of Inference on Graphical Models Elimination algorithm finds single marginal probability p(f ) by Choosing an elimination ordering I such that f is the last node Keep track of active potentials Eliminate a node i by removing all active potentials referencing i, take product and sum over x i, put this message on the active list. Consider trees with multiple queries. This is a special case of Junction Tree Algorithm, Sum-Product algorithm. Choose a root node arbitrarily A natural elimination order is depth-first traversal of the tree.

Inference on Trees for Single Query p(r) to root i i j m ji ( x i ) j m kj ( x j ) m lj ( x j ) k l (a) (b) Elimination of j node. If j is an evidence node φ E (x j ) = 1[x j = x j ] else φ E (x j ) = 1 m ji (x i ) = φ E (x j )φ(x i, x j ) m kj (x j ) x j k c(j) The marginal of root x r is proportional to its messages p(x r x e ) = φ E (x r ) m kr (x r ) k c(r)

m 32 ( Inference on Trees for x ) 2 Multiple Query m ( x ) 42 2 X 3 X4 m 32 ( x 2 ) m 42 ( x 2 ) X 3 X4 (a) (b) X1 X 1 X1 X 1 m 12 ( x 1 ) m 12 ( x ) 12 ( x 2 ) 2 m 12 ( x 2 ) m 12 ( x 1 ) X2 X X2 2 X 2 m 32 ( x 2 ) m 42 ( x 2 ) m 32 ( x 2 ) m 42 ( x 2 ) m24 ( x 4 ) m 32 ( x 2 ) m 42 ( x 2 ) X 3 X4 X 3 X4X 4 X 3 m 23 ( x 3 ) m 24 ( x 4 ) X 4 (a) (b) (c) (d) X1 X1 Fig a) Choosing 1 as root, 2 has all the messages from nodes m 12 ( x ) below it 2 m 12 ( x 2 ) m 12 ( x 1 ) Fig b) To X2 compute p(x 2 ), m 12 X2 (x 2 ) is needed m m 32 ( x 2 ) 32 ( x ) m 2 42 ( ) Fig c) One pass from leaves to root x 2 and one pass m24 ( x 4 ) m 23 ( x ) m 3 24 ( x ) 4 X 3 X4 X 3 X4 node (c) marginals (d) backward gives the messages required to compute all

Inference on HMMs '"#!%& '"#& '"#$%& '"#!%& '"#& '"#$%&!"#!%&!"#&!"#$%&!"#!%&!"#&!"#$%& (&)*++,&)-./)(01)2(34(#4506 Hidden Markov Models are directed sequence models. Let x = (x 1,..., x l ) and y = (y 1,..., y l ) be sequences. x i O, where O is a set of possible observations. y i Y, where Y is a set of possible labels. π σ = p(y 1 = σ; θ) initial transition. T σ,σ = p(y i+1 = σ y i = σ ; θ) transition probabilities for σ, σ Y O σ,u = p(x i = u y i = σ; θ) observation probabilities for σ Y, u O N N 1 P(x, y; θ) = π y1 O yi,x i T yi+1,y i i=1 i=1

Forward-Backward Algorithm for HMMs Set y l as the root, x are evidence. Forward pass α i (σ) = m (i 1)i (x i = σ) ( ) p( x 1:i, y i = σ) = α i (σ) = α i 1 (σ )T σ,σ σ Y O σ, xi Backward pass p( x i+1:l y i = σ) = β i (σ) = T σ,σo σ, x i+1 β i+1 (σ ) σ Y Combining messages P(y i = σ x 1:l ) = p( x 1:l y i )p(y i ) p( x i:l ) p( x 1:i y i )p( x i+1:l y i )p(y i ) = α i (σ)β i (σ)

Generalization to Junction Tree Algorithm Data structure: Generate a tree whose nodes are the maximal cliques. Then, we get p(x C x V \C ) using the sum-product. We can marginalize over this to get individual marginal probabilities. Marginalization over different cliques of node i should yield to same p(x i x V \i ). There are multiple clique tree. We need a data structure that preserves global consistency via local consistency. A junction tree is a clique tree for each clique pair C i, C j, sharing some variables C s, C s is included in all the nodes of the clique tree on the path from C i and C j. There exists a junction tree of a graph iff the graph is triangulated. Elimination ordering Message-Passing Protocol: A clique C i can send a message to neighbor C j if it has received messages from all its other neighbors.

Junction Tree Algorithm Moralize if the graph is directed Triangulate the graph G (using variable elimination algorithm) Construct a junction tree J from the triangulated graph via maximal weight spanning tree. Initialize the potential of cliques C of J as the product of potentials from G all assigned factors from the model. Propagate local messages on the junction tree

Maximizing instead of Summing Finding maximum probability configurations ion and Belief Propagation both summed over Ch values of the marginal Maximizing (non-query, non-evidence) instead ofnodes Summing If j rginal probability. Same formulation can be used exactly for finding MAP Elimination and Belief Propagation both summed over Multiplication distributes over max as well as sum Pa wanted to allmaximize possible values over the ofnon-query, the marginal non-evidence (non-query, non-evidence) nodes d the probabilty to get aof marginal the single probability. best setting consistent max(ab, ac) = a max(b, c) ery and evidence? What if we wanted to maximize over the non-query, non-evidence ax max max max nodes We max p(x can to 1)p(xfind 2 x do 1)p(x MAP the 3 x 1)p(x probabilty computation 4 x 2)p(x 5 x 3)p(x of 6 xthe 2, using x 5) single exactly best setting the same consistent x1 x2 x3 x4 x5 ax p(x 1) max p(x with algorithm 2 x 1) max anyp(x query 3 x 1) replacing maxand p(x 4 x 2) evidence? max sums p(x 5 x 3)p(x with 6 x 2, max x 5) Re x1 x2 x3 x4 x5 wn as the maximum a-posteriori or MAP configuration. Giv max x p(x) = max x1 max x2 max x3 x1 x2 gation to solve this problem. max x4 max x5 p(x 1 )p(x 2 x 1 )p(x 3 x 1 )p(x 4 x 2 )p(x 5 x 3 )p(x 6 x 2, x 5 ) that (on trees), we can use an algorithm exactly like = max p(x 1 ) max p(x 2 x 1 ) max p(x 3 x 1 ) max p(x 4 x 2 ) max p(x 5 x 3 )p(x 6 x 2, x 5 ) x3 X 4 This is known as the maximum a-posteriori or MAP configuration. X 2 X It turns out that (on trees), we can use 6 X1 an algorithm exactly like belief-propagation to solve this problem. X 4 X 3 X 5 x4 x5 X1 X 2 X 6 Re ma X 3 X 5

Learning Parameter Learning: Given the graph structure, how to get the conditional probability distributions P θ (X i X Πi )? Parameters θ are unknown constants Given fully observed training data D Find their estimates maximizing the (penalized) log-likelihood ˆθ = argmax log P(D θ)( λr(θ)) θ Parameter estimation with fully observed data Parameter estimation with latent variables, Expectation Maximization (EM) Structure Learning: Given fully observed training data D, how to get the graph G and its parameters θ? After studying the discriminative framework

Parameter Estimation with Complete Observations Random variable X = (X 1,..., X n ). The data D = (x 1,..., x m ) is a set of m IID observations. Maximum likelihood estimate (MLE) can be found by maximizing the log probability of the data. In Bayes nets, l(θ; D) = log p(d θ) = log m p(x j θ) = j=1 m n j=1 i=1 log p(x j i x j, θ π j i ) i Consider X i be discrete rv with K possible values. Then, p(x i X πi, θ i ) = θ i (x i, x πi ) is a multinomial distribution st x i θ i (x i, x πi ) = 1. l(θ; D) = N(x i,πi ) log θ i (x i, x πi ) i x i,πi N(x i,πi ) = m assignment in data j=1 1[x j i,π i = x i,πi ] observed count of x i,πi

Parameter Estimation with Complete Observations l(θ; D) = i x i,πi N(x i,πi ) log θ i (x i, x πi ) Estimation of θ i is independent of θ k for k i. Estimating of θ i, ignore data associated with nodes other than i and π i and maximize N(x i,πi ) log θ i (x i, x πi ) wrt θ i. Add a Lagrangian term for normalization and solve for θ i ML estimate is the relative frequency. ˆθ i (x i, x πi ) = N(x i, x πi )/N(π i )

Parameter Estimation in HMMs '"#!%& '"#& '"#$%& '"#!%& '"#& '"#$%&!"#!%&!"#&!"#$%&!"#!%&!"#&!"#$%& (&)*++,&)-./)(01)2(34(#4506 Hidden Markov Models are directed sequence models. Let x = (x 1,..., x l ) and y = (y 1,..., y l ) be sequences. x i O, where O is a set of possible observations. y i Y, where Y is a set of possible labels. π σ = p(y 1 = σ; θ) initial transition. T σ,σ = p(y i+1 = σ y i = σ ; θ) transition probabilities for σ, σ Y O σ,u = p(x i = u y i = σ; θ) observation probabilities for σ Y, u O N N 1 P(x, y; θ) = π y1 O yi,x i T yi+1,y i i=1 i=1

Parameter Estimation in HMMs MLE estimate π σ = P(x, y; θ) = π y1 N i=1 P m j=1 1[y j 1 =σ] m Pl j 1 i=1 1[y j i+1 =σ y j i =σ ] P m Pl j 1 j=1 i=1 y j i =σ P m Pl j j=1 i=1 1[y j i =σ x j i =u] P m Pl j j=1 i=1 y j i =σ T σ,σ = P m j=1 O σ,u = N 1 O yi,x i i=1 T yi+1,y i

MLE with Complete Observations for MRF p(x θ) = C φ(x C) x Y n c φ(x c) In MRFs, Z couples the parameters. There is no closed formed solution of θ. For potential functions φ C (x C ) = exp( f (x C ), θ C ) m l(θ; D) = log p(x j θ) j=1 = j ( ) f (x C ), θ C log Z θ C = C N(x C ) f (x C ), θ C m log Z θ Take the derivative of l wrt θ, equate to 0 and solve for θ. The derivative of log Z wrt θ C is the expectation E x pθ [f (x C )].

Parameter Estimation with Latent Variables When there are latent variables, we need to marginalize over the latent variables, which leads to coupling between parameters. Let X be observed variables, Z latent variables. l(θ; D) = log z p(x, z θ) Expectation Maximization (EM) Algorithm is a general approach to maximum likelihood parameter estimation (MLE) with latent(hidden) variables. EM is a coordinate-descent algorithm to minimize Kullback-Leibler (KL) divergence

EM Algorithm Since Z is not observed, the log-likelihood of data is a marginal over Z which does not decompose. l(θ; x) := log p(x θ) = log p(x, z θ) z Replace this with expected log-likelihood wrt the averaging distribution q(z x). This is linear in l c (θ; x, z) := log p(x, z θ). l c (θ; x, z) q := z q(z x, θ) log p(x, z θ) This optimization is a lower bound on l(θ; x). Using Jensen s inequality l(θ; x) = log p(x, z θ) = log p(x, z θ) q(z x) q(z x) z z p(x, z θ) q(z x) log = L(q, θ) q(z x) z

EM Algorithm Until convergence Maximize wrt q (E step): Maximize wrt θ (M step): q t+1 = argmax L(q, θ t ) q θ t+1 = argmax L(q t+1, θ) θ

EM Algorithm Until convergence Maximize wrt q (E step): q t+1 = argmax q L(q, θ t ) Maximize wrt θ (M step): θ t+1 = argmax θ L(q t+1, θ) argmax θ L(q, θ t ) = argmax q(z x) log p(x, z θ) z θ z q(z x) log q(z x) = argmax l c (θ; x, z) q θ

EM Algorithm Until convergence Maximize wrt q (E step): q t+1 = argmax q L(q, θ t ) l(θ; x) L(q, θ) = q(z x) log p(x θ) z z = D(q(z x) p(z x, θ)) q(z x) log p(x, z θ) q(z x) Minimizing l(θ; x) L(q, θ) is equivalent to maximizing L(q, θ). KL divergence is minimizes when q(z x) = p(z x, θ t ) Intuitively, given p(x, z θ), p(z x, θ t ) is the best guess for latent variables conditioned on x. Maximize wrt θ (M step): θ t+1 = argmax θ L(q t+1, θ) M step increases a lower bound on likelihood E step closes the gap to yield l(θ t ; x) = L(q t+1, θ t )

EM for HMMs E step: Get expected counts using Forward-Backward Algorithm. q t = p(z x, θ t 1 ) M step: Find MLE estimate wrt the expected counts (relative frequency). θ t = argmax l c (θ; x, z) q t θ

Discriminative Training of HMMs st-1 s t st-1 s t o t o t (a). Maximum Entropy Markov Models (MEMM) [McCFrePer00] (Generalization to Bayes Nets is trivial.) Figure 1. (a) The dependency graph for a traditional HMM; (b) a) HMM: P(x, y; θ) = π N for our conditional maximum y1 i=1 entropy O N 1 y i,x Markov i i=1 T y model. i+1,y i b) MEMM: P(y x; θ) = π N 1 y1,x 1 i=1 T y i+1,y i,x i+1 (b) distribution, for, and an initial state

Inference and Parameter Estimation in MEMMs Forward-Backward algorithm α i (σ) = σ Y α i 1 (σ )T σ,σ, x i β i (σ) = σ Y T σ,σ, x i+1 β i+1 (σ ) Representation for σ, σ Y, u O T σ,σ,ū = 1 Z (ū, σ ) exp ( θ, f (ū, σ, σ) ) Z (ū, σ ) = exp ( θ, f (ū, σ, σ ) ) σ Coupling between θ due to Z results in no-closed-form solution. Use some convex optimization technique. Note the difference between undirected models where Z is a sum over all possible values of all nodes (Y n ). Here, it is a sum over Y.

Generative vs Discriminative Approaches Generative framework: HMMs model p(x, y) Advantages: Efficient learning algorithm. Relative frequency. Can handle missing observation (simply latent variable) Naturally incorporates prior knowledge Disadvantages: Harder problem p(y x)p(y) vs p(y x) Questionable independence assumption x i x j y i, j i Limited representation. Overlapping features are problematic. Assuming independence violates the model. p(x i y i ) = Y k p(f k (x i ) y i ) f 1 (u) = Is u the word "Brown"?, f 2 (u) = Is u capitalized? Conditioning increase the number of parameters. p(x i y i ) = Y k p(f k (x i ) y i, f 1 (x i ),..., f k 1 (x i )) Increased complexity to capture dependency between y i and x i 1, p(x i y i, y i 1 ).

Generative vs Discriminative Approaches Discriminative learning: MEMM model p(y x) Advantages: Solves more direct problem. Richer representation via kernels or feature selection. Conditioning on x, overlapping features are trivial. No independence assumption on observations. Lower asymptotic error Disadvantages: Expensive learning. There is no closed form solution as in HMMs. Therefore, we need Forward-Backward algorithm Missing values in observation is problematic due to conditioning. Requires more data (O(d)) to reach its asymptotic error, whereas generative models require O(log d), for d number of parameters [NgJor01].