Lecture 9: PGM Learning

Similar documents
Lecture 8: PGM Inference

Probabilistic Graphical Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Graphical Models for Collaborative Filtering

Probabilistic Graphical Models

Part 4: Conditional Random Fields

Naïve Bayes classification

Bayesian Machine Learning - Lecture 7

14 : Theory of Variational Inference: Inner and Outer Approximation

Lecture 6: Graphical Models: Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Parameter learning in CRF s

Machine Learning Summer School

Sequential Supervised Learning

Introduction to Logistic Regression and Support Vector Machine

Support Vector Machines

14 : Theory of Variational Inference: Inner and Outer Approximation

Kernel methods, kernel SVM and ridge regression

COMP90051 Statistical Machine Learning

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CSC 412 (Lecture 4): Undirected Graphical Models

Algorithms for Predicting Structured Data

Bayesian Learning in Undirected Graphical Models

Learning MN Parameters with Approximation. Sargur Srihari

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic and Bayesian Machine Learning

Does Better Inference mean Better Learning?

3 : Representation of Undirected GM

Variational Inference. Sargur Srihari

From Bayesian Networks to Markov Networks. Sargur Srihari

Statistical Approaches to Learning and Discovery

Undirected Graphical Models

Exact Inference I. Mark Peot. In this lecture we will look at issues associated with exact inference. = =

Probabilistic Graphical Models

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

A brief introduction to Conditional Random Fields

Overfitting, Bias / Variance Analysis

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models

Unsupervised Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic Graphical Models

Markov Networks.

ICS-E4030 Kernel Methods in Machine Learning

13: Variational inference II

Alternative Parameterizations of Markov Networks. Sargur Srihari

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Random Field Models for Applications in Computer Vision

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Probabilistic Graphical Models

Chapter 16. Structured Probabilistic Models for Deep Learning

STA 4273H: Statistical Machine Learning

Markov Random Fields for Computer Vision (Part 1)

Probabilistic Graphical Models

Lecture 15. Probabilistic Models on Graph

Sampling Algorithms for Probabilistic Graphical models

Probabilistic Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Lecture 6: Graphical Models

Junction Tree, BP and Variational Methods

CS-E4830 Kernel Methods in Machine Learning

Bayesian Learning in Undirected Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Conditional Random Field

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Logistic Regression. Machine Learning Fall 2018

Bayesian Machine Learning

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Probabilistic Graphical Models

Notes on Machine Learning for and

Learning Bayesian network : Given structure and completely observed data

Bayesian Networks Inference with Probabilistic Graphical Models

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Machine Learning

Lecture 10: A brief introduction to Support Vector Machine

Lecture : Probabilistic Machine Learning

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Intelligent Systems (AI-2)

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Intelligent Systems:

Probabilistic Graphical Models

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

CSCE 478/878 Lecture 6: Bayesian Learning

Graphical Models and Kernel Methods

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Discriminative Models

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

CS Lecture 13. More Maximum Likelihood

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Posterior Regularization

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Support Vector Machine (continued)

Transcription:

13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401

Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs

Inference and Learning Given parameters (of potentials) and the graph, one can ask for: x = argmax x P(x) MAP Inference P(x c ) = x V /c P(x) Marginal Inference How to get parameters and the graph? Learning.

Learning Learn parameters if graph given (Lecture 9) Bayes Net (Directed graphical models) Markov Random Fields (Undirected or factor graphical models) Structure estimation ( to learn or estimate the graph structure, Lecture 10)

Parameters for bayesian networks For bayesian networks, P(x 1,..., x n ) = n i=1 P(x i Pa(x i )). Parameters: P(x i Pa(x i )). P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) Happy H Letter L J Job P(J L,S) P(H G,J)

Learning parameters in Bayes Net Y = Yes. N = No. Case D I G S L H J 1 Y Y Y Y Y N Y 2 N N Y N N Y N 3 Y N Y N N Y N.. P(D = d) = N D=d N total P(G = g D = d, I = i) = N G=g,D=d,I =i N D=d,I =i

Learning parameters in Bayes Net Problems?

Learning parameters in Bayes Net Problems? not minimise classification error. not much flexibility on the features nor the parameters.

Parameters for MRFs For MRFs, let V be the set of nodes, and C be the set of clusters c. P(x; θ) = exp( c C θ c(x c )), (1) Z(θ) where normaliser Z(θ) = x exp{ c C θ c (x c )}. Parameters: {θ c } c C. Inference: MAP inference x = argmax x c C θ c(x c ) log P(x) c C θ c(x c ) Marginal inference P(x c ) = x V /c P(x) Learning (parameter estimation): learn θ and the graph structure. Often assume θ c (x c ) = w, Φ c (x c ). w empirical risk minimisation (ERM).

Parameters for MRFs In learning, we look for a F that predicts labels well via y = max y Y F (x i, y; w). Given graph G = (V, E), one often assume F (x, y; w) = w, Φ(x, y) = w 1, Φ i (y (i), x) + w 2, Φ i,j (y (i), y (j), x) i V (i,j) E = θ i (y (i), x) + θ i,j (y (i), y (j), x) (MAP inference) i V (i,j) E Here w = [w 1 ; w 2 ], and Φ(x, y) = [ i V Φ i(y (i), x); (i,j) E Φ i,j(y (i), y (j), x)].

A gap between F (x i, y i ; w) and best F (x i, y; w) for y y i, that is F (x i, y i ; w) max F (x i, y; w) y Y,y y i

Structured SVM - 1 Learning parameters in MRFs Primal: min w,ξ 1 2 w 2 + C m ξ i s.t. (2a) i=1 i, y y i, w, Φ(x i, y i ) Φ(x i, y) (y i, y) ξ i. Dual is a quadratic programming (QP) problem: max α (2b) (y i, y)α i y 1 α i y α j y Φ(xi, y), Φ(x j, y ) 2 i,y y i i,j,y y i,y y j i, y y i, α i y 0, i, α i y C. y y i (3)

Structured SVM - 2 Learning parameters in MRFs Cutting plane method needs to find the label for the most violated constraint in (2b) y i = argmax (y i, y) + w, Φ(x i, y). (4) y Y With y i, one can solve following relaxed problem (with much fewer constraints) 1 m min w,ξ 2 w 2 + C ξ i s.t. (5a) i=1 i, w, Φ(x i, y i ) Φ(x i, y i ) (y i, y i ) ξ i. (5b)

Structured SVM - 3 Learning parameters in MRFs Simplified over all procedure. Input: data x i, labels y i, sample size m, number of iterations T Initialise S 0 =, w 0 = 0 (or a random vector), and t = 0. for t = 0 to T do for i = 1 to m do y i = argmax y Y,y yi w t, Φ(x i, y) + (y i, y), ξ i = [ (y i, y) + w t, Φ(x i, y i ) Φ(x i, y i ) if ξ i > 0 then Increase constraint set S t S t {y i } end if end for w t = i y S t α i y Φ(x i, y) α optimise dual QP with constraint set S t. end for ] +,

Other Other approaches using Max Margin principle such as Max Margin Markov Network (M3N),...

Main types: Maximum Entropy (MaxEnt) Maximum a Posteriori (MAP) Maximum Likelihood (ML)

Maximum Entropy Learning parameters in MRFs Maximum Entropy (ME) estimates w by maximising the entropy. That is, w = argmax P w (x, y) ln P w (x, y). w x X,y Y Duality between maximum likelihood, and maximum entropy, subject to moment matching constraints on the expectations of features.

MAP Learning parameters in MRFs Let likelihood function L(w) be the modelled probability or density for the occurrence of a sample configuration (x 1, y 1 ),..., (x m, y m ) given the probability density P w parameterised by w. That is, ) L(w) = P w ((x 1, y 1 ),..., (x m, y m ). Maximum a Posteriori (MAP) estimates w by maximising L(w) times a prior P(w). That is w = argmax L(w)P(w). (6) w Assuming {(x i, y i )} 1 i m are I.I.D. samples from P w (x, y), (6) becomes w = argmax P w (x i, y i )P(w) w 1 i m = argmin ln P w (x i, y i ) ln P(w). w 1 i m

Maximum Likelihood Maximum Likelihood (ML) is a special case of MAP when P(w) is uniform which means w = argmax P w (x i, y i ) w 1 i m = argmin ln P w (x i, y i ). w 1 i m Alternatively, one can replace the joint distribution P w (x, y) by the conditional distribution P w (y x) that gives a discriminative model called Conditional Random Fields (CRFs)

Conditional Random Fields (CRFs) - 1 Assume the conditional distribution over Y X has a form of exponential families, i.e., where P(y x; w) = exp( w, Φ(x, y) ), (7) Z(w x) Z(w x) = y Y exp( w, Φ(x, y ) ), (8) and Φ(x, y) = [ Φ i (y (i), x); Φ i,j (y (i), y (j), x)] i V w = [w 1 ; w 2 ]. (i,j) E More generally speaking, the global feature can be decomposed into local features on cliques (fully connected subgraphs).

CRFs - 2 Learning parameters in MRFs Denote (x 1,..., x m ) as X, (y 1,..., y m ) as Y. The classical approach is to maximise the conditional likelihood of Y on X, incorporating a prior on the parameters. This is a Maximum a Posteriori (MAP) estimator, which consists of maximising P(w X, Y) P(w) P(Y X; w). From the i.i.d. assumption we have P(Y X; w) = m P(y i x i ; w), i=1 and we impose a Gaussian prior on w ( ) w 2 P(w) exp 2σ 2.

CRFs - 3 Learning parameters in MRFs Maximising the posterior distribution can also be seen as minimising the negative log-posterior, which becomes our risk function R(w X, Y) R(w X, Y) = ln(p(w) P(Y X; w)) + c w 2 m = 2σ 2 ( Φ(x i, y i ), w ) ln(z(w x i )) +c, }{{} i=1 :=l L (x i,y i,w) where c is a constant and l L denotes the log loss i.e. negative log-likelihood. Now learning is equivalent to w = argmin R(w X, Y). w

CRFs - 4 Learning parameters in MRFs Above is a convex optimisation problem on w since ln Z(w x) is a convex function of w. The solution can be obtained by gradient descent since ln Z(w x) is also differentiable. We have w R(w X, Y) = m (Φ(x i, y i ) w ln(z(w x i )). i=1 It follows from direct computation that w ln Z(w x) = E y P(y x;w) [Φ(x, y)].

CRFs - 5 Learning parameters in MRFs Since Φ(x, y) are decomposed over nodes and edges, it is straightforward to show that the expectation also decomposes into expectations on nodes V and edges E E y P(y x;w) [Φ(x, y)] = E y (i) P(y (i) x;w) [Φ i(y (i), x)] i V + (ij) E E y (i),y (j) P(y (i),y (j) x;w) [Φ i,j(y (i), y (j) x)], where the node and edge expectations can be computed given P(y (i) x; w) and P(y (i), y (j) x; w), which can be computed exactly by variable elimination or junction tree or approximately using e.g. (loopy) belief propagation, or being circumvented through sampling.

That s all Learning parameters in MRFs Thanks!