Probabilistic Graphical Models

Similar documents
Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Lecture 9: PGM Learning

Support vector machines Lecture 4

Structured Prediction

Cutting Plane Training of Structural SVM

Accelerated Training of Max-Margin Markov Networks with Kernels

Machine Learning for Structured Prediction

Algorithms for Predicting Structured Data

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Subgradient Methods for Maximum Margin Structured Learning

Lab 12: Structured Prediction

Lecture Support Vector Machine (SVM) Classifiers

Sequential Supervised Learning

Probabilistic Graphical Models

Machine Learning for NLP

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

MLCC 2017 Regularization Networks I: Linear Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning Lecture 6 Note

Machine Learning And Applications: Supervised Learning-SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

2.2 Structured Prediction

Lecture 10: A brief introduction to Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Announcements - Homework

Parameter learning in CRF s

Graphical models for part of speech tagging

Discriminative Models

A brief introduction to Conditional Random Fields

Minimax risk bounds for linear threshold functions

Support Vector Machine

ICS-E4030 Kernel Methods in Machine Learning

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Probabilistic Graphical Models

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Kernelized Perceptron Support Vector Machines

Coordinate Descent and Ascent Methods

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Support Vector Machines and Bayes Regression

Machine Learning A Geometric Approach

Probabilistic Graphical Models

CSC 411 Lecture 17: Support Vector Machine

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Linear & nonlinear classifiers

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning for NLP

Discriminative Models

Structured Prediction

Support Vector Machines for Classification and Regression

Warm up: risk prediction with logistic regression

Binary Classification / Perceptron

Hidden Markov Models

with Local Dependencies

Lecture 3: Multiclass Classification

Deep Learning for Computer Vision

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Support Vector Machines and Kernel Methods

SUPPORT VECTOR MACHINE

HOMEWORK 4: SVMS AND KERNELS

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Convex Optimization Lecture 16

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing

Lecture 10: Support Vector Machine and Large Margin Classifier

Perceptron Revisited: Linear Separators. Support Vector Machines

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines. Machine Learning Fall 2017

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Maximum Margin Classifiers

Structural Learning with Amortized Inference

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

Linear Regression (continued)

Dan Roth 461C, 3401 Walnut

Learning by constraints and SVMs (2)

Statistical Learning Theory and the C-Loss cost function

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Algorithms for NLP. Classifica(on III. Taylor Berg- Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Linear classifiers Lecture 3

Learning Parameters of Undirected Models. Sargur Srihari

Probabilistic Graphical Models & Applications

Bits of Machine Learning Part 1: Supervised Learning

On the Generalization Ability of Online Strongly Convex Programming Algorithms

Lecture 13: Structured Prediction

Name (NetID): (1 Point)

SVMs, Duality and the Kernel Trick

Logistic Regression. Machine Learning Fall 2018

Dual Decomposition for Inference

Transcription:

Probabilistic Graphical Models David Sontag New York University Lecture 12, April 23, 2013 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 1 / 24

What notion of best should learning be optimizing? This depends on what we want to do 1 Density estimation: we are interested in the full distribution so later we can compute whatever conditional probabilities we want) 2 Specific prediction tasks: we are using the distribution to make a prediction 3 Structure or knowledge discovery: we are interested in the model itself David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 2 / 24

Density estimation for conditional models Suppose we want to predict a set of variables Y given some others X, e.g., for segmentation or stereo vision input: two images! output: disparity! We concentrate on predicting py X), and use a conditional loss function lossx, y, ˆM) = log ˆpy x). Since the loss function only depends on ˆpy x), suffices to estimate the conditional distribution, not the joint David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 3 / 24

Density estimation for conditional models CRF: py x) = 1 φ c x, y c ), Zx) = Zx) c C ŷ φ c x, ŷ c ) c C Parameterization as log-linear model: Weights w R d. Feature vectors f c x, y c ) R d. φ c x, y c ; w) = expw f c x, y c )) [ ] Empirical risk minimization with CRFs, i.e. min ˆM E D lossx, y, ˆM) : w ML = arg min w = arg max w = arg max w 1 D x,y) D x,y) D c w x,y) D log py x; w) ) log φ c x, y c ; w) log Zx; w) c ) f c x, y c ) x,y) D log Zx; w) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 4 / 24

Application: Part-of-speech tagging United flies some large jet N V D A N United 1 flies 2 some 3 large 4 jet 5 David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 5 / 24

Graphical model formulation of POS tagging Graphical model formulation given: a sentence of length n and a tag set T one variable for each word, takes values in T edge potentials θi 1, i, t, t) foralli n, t, t T example: United 1 flies 2 some 3 large 4 jet 5 T = {A, D, N, V } note: for probabilistic HMM θi 1, i, t, t) = logpw i t)pt t )) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 6 / 24

Features for POS tagging Edge potentials: Fully parameterize T T features and weights), i.e. θ i 1,i t, t) = w T t,t where the superscript T denotes that these are the weights for the transitions Node potentials: Introduce features for the presence or absence of certain attributes of each word e.g., initial letter capitalized, suffix is ing ), for each possible tag T #attributes features and weights) This part is conditional on the input sentence! Edge potential same for all edges. Same for node potentials. David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 7 / 24

Structured prediction Often we learn a model for the purpose of structured prediction, in which given x we predict y by finding the MAP assignment: argmax ˆpy x) y Rather than learn using log-loss density estimation), we use a loss function better suited to the specific task One reasonable choice would be the classification error: E x,y) p [1I{ y y s.t. ˆpy x) ˆpy x) }] which is the probability over all x, y) pairs sampled from p that our classifier selects the right labels If p is in the model family, training with log-loss density estimation) and classification error would perform similarly given sufficient data) Otherwise, better to directly go for what we care about classification error) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 8 / 24

Structured prediction Consider the empirical risk for 0-1 loss classification error): 1 1I{ y y s.t. ˆpy x) ˆpy x) } D x,y) D Each constraint ˆpy x) ˆpy x) is equivalent to w f c x, y c) log Zx; w) w f c x, y c ) log Zx; w) c The log-partition function cancels out on both sides. Re-arranging, we have: w f c x, y c) ) f c x, y c ) 0 c c Said differently, the empirical risk is zero when x, y) D and y y, w f c x, y c ) ) f c x, y c) > 0. c c David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 9 / 24 c

Structured prediction Empirical risk is zero when x, y) D and y y, w f c x, y c ) ) f c x, y c) > 0. c c In the simplest setting, learning corresponds to finding a weight vector w that satisfies all of these constraints when possible) This is a linear program LP)! How many constraints does it have? D Y exponentially many! Thus, we must avoid explicitly representing this LP This lecture is about algorithms for solving this LP or some variant) in a tractable manner David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 10 / 24

Structured perceptron algorithm Input: Training examples D = {x m, y m )} Let fx, y) = c f cx, y c ). Then, the constraints that we want to satisfy are ) w fx m, y m ) fx m, y) > 0, y y m The perceptron algorithm uses MAP inference in its inner loop: MAPx m ; w) = arg max y Y w fxm, y) The maximization can often be performed efficiently by using the structure! The perceptron algorithm is then: 1 Start with w = 0 2 While the weight vector is still changing: 3 For m = 1,..., D 4 y MAPx m ; w) 5 w w + fx m, y m ) fx m, y) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 11 / 24

Structured perceptron algorithm If the training data is separable, the perceptron algorithm is guaranteed to find a weight vector which perfectly classifies all of the data When separable with margin γ, number of iterations is at most where R = max m,y fx m, y) 2 ) 2 2R, γ In practice, one stops after a certain number of outer iterations called epochs), and uses the average of all weights The averaging can be understood as a type of regularization to prevent overfitting David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 12 / 24

Allowing slack We can equivalently write the constraints as ) w fx m, y m ) fx m, y) 1, y y m Suppose there do not exist weights w that satisfy all constraints Introduce slack variables ξ m 0, one per data point, to allow for constraint violations: ) w fx m, y m ) fx m, y) 1 ξ m, y y m Then, minimize the sum of the slack variables, min ξ 0 m ξ m, subject to the above constraints David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 13 / 24

Structural SVM support vector machine) subject to: min w,ξ ξ m + C w 2 m ) w fx m, y m ) fx m, y) 1 ξ m, m, y y m ξ m 0, m This is a quadratic program QP). Solving for the slack variables in closed form, we obtain ) ) ξm = max 0, max 1 w fx m, y m ) fx m, y) y Y Thus, we can re-write the whole optimization problem as ) ) min max 0, max 1 w fx m, y m ) fx m, y) + C w 2 w y Y m David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 14 / 24

Hinge loss )) We can view max 0, max y Y 1 w fx m, y m ) fx m, y) as a loss function, called hinge loss When w fx m, y m ) w fx m, y) for all y i.e., correct prediction), this takes a value between 0 and 1 When y such that w fx m, y) w fx m, y m ) i.e., incorrect prediction), this takes a value 1 Thus, this always upper bounds the 0-1 loss! Minimizing hinge loss is good because it minimizes an upper bound on the 0-1 loss prediction error) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 15 / 24

Better Metrics It doesn t always make sense to penalize all incorrect predictions equally! We can change the constraints to ) w fx m, y m ) fx m, y) y, y m ) ξ m, where y, y m ) 0 is a measure of how far the assignment y is from the true assignment y m This is called margin scaling as opposed to slack scaling) We assume that y, y) = 0, which allows us to say that the constraint holds for all y, rather than just y y m A frequently used metric for MRFs is Hamming distance, where y, y m ) = i V 1I[y i y m i ] y, David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 16 / 24

Structural SVM with margin scaling min w m max y Y )) y, y m ) w fx m, y m ) fx m, y) + C w 2 How to solve this? Many methods! 1 Cutting-plane algorithm Tsochantaridis et al., 2005) 2 Stochastic subgradient method Ratliff et al., 2007) 3 Dual Loss Primal Weights algorithm Meshi et al., 2010) 4 Frank-Wolfe algorithm Lacoste-Julien et al., 2013) David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 17 / 24

Stochastic subgradient method min w m max y Y )) y, y m ) w fx m, y m ) fx m, y) + C w 2 Although this objective is convex, it is not differentiable everywhere We can use a subgradient method to minimize instead of gradient descent) ) The subgradient of max y Y y, y m ) w fx m, y m ) fx m, y) at w t) is fx m, ŷ) fx m, y m ), where ŷ is one of the maximizers with respect to w t), i.e. ŷ = arg max y Y y, ym ) + w t) fx m, y) This maximization is called loss-augmented MAP inference David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 18 / 24

Loss-augmented inference ŷ = arg max y Y y, ym ) + w t) fx m, y) When y, y m ) = i V 1I[y i yi m ], this corresponds to adding additional single-node potentials θ i y i ) = 1 if y i y m, and 0 otherwise If MAP inference was previously exactly solvable by a combinatorial algorithm, loss-augmented MAP inference typically is too The Hamming distance pushes the MAP solution away from the true assignment y m David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 19 / 24

Cutting-plane algorithm subject to: w min w,ξ ) fx m, y m ) fx m, y) ξ m + C w 2 m y, y m ) ξ m, m, y Y m ξ m 0, m Start with Y m = {y m }. Solve for the optimal w, ξ Then, look to see if any of the unused constraints that are violated To find a violated constraint for data point m, simply solve the loss-augmented inference problem: ŷ = arg max y Y y, ym ) + w fx m, y) If ŷ Y m, do nothing. Otherwise, let Y m = Y m {ŷ} Repeat until no new constraints are added. Then we are optimal! David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 20 / 24

Cutting-plane algorithm Can prove that, in order to solve the structural SVM up to ɛ additive) accuracy, takes a polynomial number of iterations In practice, terminates very quickly David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 21 / 24

Block-Coordinate Frank-Wolfe Optimization for Structural SVMs Summary of convergence rates Table 1. Convergence rates given in the number of calls to the oracles for di erent optimization algorithms for the str tural SVM objective 1) in the case of a Markov random field structure, to reach a specific accuracy " measured for di er types of gaps, in term of the number of training examples n, regularization parameter, sizeofthelabelspace Y, m imum feature norm R := max i,y k iy)k 2 some minor terms were ignored for succinctness). Table inspired from Zha et al., 2011). Notice that only stochastic subgradient and our proposed algorithm have rates independent of n. Optimization algorithm Online Primal/Dual Type of guarantee Oracle type # Oracle calls dual extragradient Taskar no primal- dual saddle point gap Bregman projection O nr log Y et al., 2006) " online exponentiated gradient yes dual expected dual error expectation O n+log Y )R 2 Collins et al., 2008) " q excessive gap reduction log Y no primal-dual duality gap expectation O nr Zhang et al., 2011) " nr BMRM Teo et al., 2010) no primal primal error maximization O 2 " 1-slack SVM-Struct Joachims nr no primal-dual duality gap maximization O 2 et al., 2009) " stochastic subgradient R yes primal primal error w.h.p. maximization Õ 2 Shalev-Shwartz et al., 2010a) " this paper: block-coordinate R yes primal-dual expected duality gap maximization O 2 Frank-Wolfe " Thm. 3 lary 3 gives a similar convergence rate as our TheoremR 3. rate in the duality gap as the full Frank-Wolfe meth same Balamurugan as before. et al. 2011) n=number propose toofapprox- imately solve a quadratic problem on each example leads to a simple online algorithm that yields a so trainingfor examples. the dual structural λ is the SVMregularization optimization problem using constant SMO, but correpsonding they do not provide toany 2C/n) rate guarantees. The online-eg method implements a variant for stochastic algorithms: no step-size sequence ne tion to an issue that is notoriously di cult to addr of dual coordinate descent, but it requires an expectation oracle and Collins et al. 2008) estimate its primal ciently computed in closed-form. Further, at the c to be tuned since the optimal step-size can be e convergence Davidat Sontag only ONYU) 1/" 2. Graphical Models of an additional pass Lecture through 12, April the 23, 2013 data which 22 / 24co

Application to segmentation & support inference ECCV-12 submission ID 1079 3 Input RGB Surface Normals Aligned Normals Segmentation Input Depth Inpainted Depth 3D Planes Support Relations Image Depth Map 1. Major Surfaces 2. Surface Normals 3. Align Point Cloud Point Cloud RGB Image Regions Features Segmenta Feature Extrac {X i,n i } {R j } {F j } Joint with Nathan Silberman & Rob Fergus) Support Classi ca Structure Labels Support Rela onships Fig. 1. Overview of algorithm. Our algorithm flows from left to right. Given an input image with raw and inpainted depth maps, we compute surface normals and align them to the room by finding three dominant orthogonal directions. We then David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 23 / 24

b obj Application to machine translation ly b) Dice and Distance Word alignment between languages: consul r b le un de les grands objectifs de les consultations est de faire en sorte que la relance profite é g a l e m e n t à tous. le un de les grands objectifs de les consultations est de faire en sorte que la relance profite é g a l e m e n t à tous. benefits all. one of the major objectives of these consultations is to make sure that the recovery benefits all. phic, Taskar, and BothShort Lacoste-Julien, Klein d) All 05) features David Sontag NYU) Graphical Models Lecture 12, April 23, 2013 24 / 24