Probabilistic Graphical Models

Similar documents
Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Graphical Models

Lecture 9: PGM Learning

Probabilistic Graphical Models

Probabilistic Graphical Models

Learning Bayesian network : Given structure and completely observed data

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Markov Networks.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Conditional Random Field

Intelligent Systems (AI-2)

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

13: Variational inference II

Graphical Models for Collaborative Filtering

10708 Graphical Models: Homework 2

14 : Theory of Variational Inference: Inner and Outer Approximation

Stat 5101 Lecture Notes

3 : Representation of Undirected GM

The Expectation-Maximization Algorithm

Introduction to Probabilistic Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Undirected Graphical Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Parameter learning in CRF s

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

CSC 412 (Lecture 4): Undirected Graphical Models

Alternative Parameterizations of Markov Networks. Sargur Srihari

Lecture 6: Graphical Models: Learning

Generative v. Discriminative classifiers Intuition

Pattern Recognition and Machine Learning

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Part 1: Expectation Propagation

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Undirected Graphical Models: Markov Random Fields

Machine Learning Summer School

A brief introduction to Conditional Random Fields

Learning Parameters of Undirected Models. Sargur Srihari

Intelligent Systems (AI-2)

Introduction to Machine Learning

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Dynamic models 1 Kalman filters, linearization,

Probabilistic Graphical Models & Applications

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Estimating Latent Variable Graphical Models with Moments and Likelihoods

STA 4273H: Statistical Machine Learning

Bayesian Regression Linear and Logistic Regression

14 : Theory of Variational Inference: Inner and Outer Approximation

Variational Inference. Sargur Srihari

Lecture 6: Graphical Models

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Logistic Regression. Seungjin Choi

Intelligent Systems (AI-2)

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Learning MN Parameters with Approximation. Sargur Srihari

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Lecture 13 : Variational Inference: Mean Field Approximation

CMU-Q Lecture 24:

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Probabilistic Graphical Models

Nonparametric Bayesian Methods (Gaussian Processes)

Alternative Parameterizations of Markov Networks. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari

Introduction to Machine Learning

Generalized Linear Models

PMR Learning as Inference

Machine Learning, Fall 2012 Homework 2

Probabilistic Graphical Models

CS Lecture 19. Exponential Families & Expectation Propagation

Gaussian Mixture Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Chapter 17: Undirected Graphical Models

CS281A/Stat241A Lecture 19

Week 5: Logistic Regression & Neural Networks

An Introduction to Bayesian Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 4

Lecture : Probabilistic Machine Learning

Nonparameteric Regression:

STA 4273H: Statistical Machine Learning

Decomposable Graphical Gaussian Models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Probabilistic Graphical Models (I)

Chris Bishop s PRML Ch. 8: Graphical Models

4: Parameter Estimation in Fully Observed BNs

CPSC 540: Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Generative v. Discriminative classifiers Intuition

Part 4: Conditional Random Fields

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Support Vector Machines

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Junction Tree, BP and Variational Methods

Naïve Bayes classification

Bayesian Machine Learning

Transcription:

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause

Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should be done 4 pages of writeup, NIPS format http://nips.cc/paperinformation/stylefiles 2

So far Markov Network Representation Local/Global Markov assumptions; Separation Soundness and completeness of separation Markov Network Inference Variable elimination and Junction Tree inference work exactly as in BayesNets How about Learning Markov Nets? 3

MLE for Markov Nets Log likelihood of the data 4

Log-likelihood doesn t decompose Log likelihood l(d θ) is concave function! Log Partition function log Z(θ) doesn t decompose 5

Computing the derivative Derivative C D I H G L J S Computing P(c i θ) requires inference! Can optimize using conjugate gradient etc. 6

Alternative approach: Iterative Proportional Fitting (IPF) At optimum, it must hold that Solve fixed point equation Must recompute parameters every iteration 7

Parameter learning for log-linear models Feature functions φ i (C i ) defined over cliques Log linear model over undirected graph G Feature functions φ 1 (C 1 ),,φ k (C k ) Domains C i can overlap Joint distribution How do we get weights w i? 8

Optimizing parameters Gradient of log-likelihood Thus, w is MLE 9

Regularization of parameters Put prior on parameters w 10

Summary: Parameter learning in MN MLE in BN is easy (score decomposes) MLE in MN requires inference (score doesn t decompose) Can optimize using gradient ascent or IPF 11

Generative vs. discriminative models Often, want to predict Y for inputs X Bayes optimal classifier: Predict according to P(Y X) Generative model Model P(Y), P(X Y) Use Bayes rule to compute P(Y X) Discriminative model Model P(Y X) directly! Don t model distribution P(X) over inputs X Cannot generate sample inputs Example: Logistic regression 12

Example: Logistic Regression 13

Log-linear conditional random field Define log-linear model over outputs Y No assumptions about inputs X Feature functions φ i (C i,x) defined over cliques and inputs Joint distribution 14

Example: CRFsin NLP Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 Y 7 Y 8 Y 9 Y 10 Y 11 Y 12 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9 X 10 X 11 X 12 Mrs. Greene spoke today in New York. Green chairs the finance committee Classify into Person, Location or Other 15

Example: CRFsin vision 16

Parameter learning for log-linear CRF Conditional log-likelihood of data Can maximize using conjugate gradient 17

Gradient of conditional log-likelihood Partial derivative Requires one inference per Can optimize using conjugate gradient 18

Exponential Family Distributions Distributions for log-linear models More generally: Exponential family distributions h(x): Base measure w: natural parameters φ(x): Sufficient statistics A(w): log-partition function Hereby x can be continuous (defined over any set) 19

Examples Exp. Family: Gaussian distribution h(x): Base measure w: natural parameters φ(x): Sufficient statistics A(w): log-partition function Other examples: Multinomial, Poisson, Exponential, Gamma, Weibull, chi-square, Dirichlet, Geometric, 20

Moments and gradients Correspondence between moments and log-partition function (just like in log-linear models) Can compute moments from derivatives, and derivatives from moments! MLE moment matching 21

Recall: Conjugate priors Consider parametric families of prior distributions: P(θ) = f(θ; α) α is called hyperparameters of prior A prior P(θ) = f(θ; α) is called conjugatefor a likelihood function P(D θ) if P(θ D) = f(θ; α ) Posterior has same parametric form Hyperparameters are updated based on data D Obvious questions (answered later): How to choose hyperparameters?? Why limit ourselves to conjugate priors?? 22

Conjugate priors in Exponential Family Any exponential family likelihood has a conjugate prior 23

Maximum Entropy interpretation Theorem: Exponential family distributions maximize the entropy over all distributions satisfying 24

Summary exponential family Distributions of the form Most common distributions are exponential family Multinomial, Gaussian Poisson, Exponential, Gamma, Weibull, chisquare, Dirichlet, Geometric, Log-linear Markov Networks All exponential family distributions have conjugate prior in EF Moments of sufficient stats = derivatives of log-partition function Maximum Entropy distributions ( most uncertain distributions with specified expected sufficient statistics) 25

Exponential family graphical models So far, only defined graphical models over discrete variables. Can define GMs over continuous distributions! For exponential family distributions: Can do much of what we discussed (VE, JT, parameter learning, etc.) for such exponential family models Important example: Gaussian Networks 26

Gaussian distribution σ = Standard deviation µ = mean 27

BivariateGaussian distribution 0.2 0.15 0.1 2 0.4 0.3 1 2 0.05 0 1 0-1 0.2 0.1 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2-1 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2-2 -2 28

Multivariate Gaussian distribution Joint distribution over n random variables P(X 1, X n ) σ jk = E[ (X j µ j ) (X k - µ k ) ] X j and X k independent σ jk =0 29

Marginalization Suppose (X 1,,X n ) ~ N( µ, Σ) What is P(X 1 )?? More generally: Let A={i 1,,i k } {1,,N} Write X A = (X i1,,x ik ) X A ~ N( µ A, Σ AA ) 30

Conditioning Suppose (X 1,,X n ) ~ N( µ, Σ) Decompose as (X A,X B ) What is P(X A X B )?? P(X A = x A X B = x B ) = N(x A ; µ A B, Σ A B ) where Computable using linear algebra! 31

Conditioning 2 0.4 0.3 1 0.2 0.1 0-2 -1.5-1 -0.5 0 0.5 1 1.5 2 X 1 =0.75-2 -1 0 P(X 2 X 1 =0.75) 32

Conditional linear Gaussians 33

Canonical representation of Gaussians 34

Canonical Representation Multivariate Gaussians in exponential family! Standard vs canonical form: µ= Λ -1 η Σ= Λ -1 In standard form: Marginalization is easy Will see: In canonical form, multiplication/conditioning is easy! 35

Gaussian Networks Zeros in precision matrix Λ indicate missing edges in log-linear model! 36

Inference in Gaussian Networks Can compute marginal distributions in O(n 3 )! For large numbers n of variables, still intractable If Gaussian Network has low treewidth, can use variable elimination / JT inference! Need to be able to multiply and marginalize factors! 37

Multiplying factors in Gaussians 38

Marginalizing in canonical form Recall conversion formulas µ= Λ -1 η Σ= Λ -1 Marginal distribution 39

Variable elimination In Gaussian Markov Networks, Variable elimination = Gaussian elimination (fast for low bandwidth = low treewidth matrices) 40

Tasks Read Koller& Friedman Chapters 4.6.1, 8.1-8.3 41