Part 4: Conditional Random Fields

Similar documents
Probabilistic Graphical Models & Applications

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs

Lecture 9: PGM Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Overfitting, Bias / Variance Analysis

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

STA 4273H: Statistical Machine Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Statistical Data Mining and Machine Learning Hilary Term 2016

Logistic Regression. William Cohen

Probabilistic Graphical Models

Regression with Numerical Optimization. Logistic

Nonparametric Bayesian Methods (Gaussian Processes)

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 340: Machine Learning and Data Mining

Machine Learning for Signal Processing Bayes Classification and Regression

Undirected Graphical Models

Linear and logistic regression

Introduction to Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction To Graphical Models

Reading Group on Deep Learning Session 1

Generative v. Discriminative classifiers Intuition

Logistic Regression. Machine Learning Fall 2018

CSE446: Clustering and EM Spring 2017

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conditional Random Field

Algorithms for Predicting Structured Data

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear discriminant functions

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Probabilistic Graphical Models

Variational Autoencoders

13: Variational inference II

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Neural Network Training

Probabilistic Graphical Models

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Lecture 13 : Variational Inference: Mean Field Approximation

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

Linear & nonlinear classifiers

CPSC 540: Machine Learning

ECE521 week 3: 23/26 January 2017

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Intelligent Systems (AI-2)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Machine Learning 2017

Probabilistic Graphical Models in Computer Vision (IN2329)

Learning Bayesian network : Given structure and completely observed data

Classification: The rest of the story

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSC 411: Lecture 04: Logistic Regression

Stochastic Gradient Descent

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Classification Logistic Regression

Bayesian Machine Learning

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

Ad Placement Strategies

Loss Functions and Optimization. Lecture 3-1

Machine Learning Lecture 5

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Introduction to Logistic Regression and Support Vector Machine

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Undirected Graphical Models: Markov Random Fields

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Graphical Models for Collaborative Filtering

Variational Inference via Stochastic Backpropagation

Machine Learning Practice Page 2 of 2 10/28/13

HOMEWORK #4: LOGISTIC REGRESSION

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CSCI-567: Machine Learning (Spring 2019)

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Logistic Regression. COMP 527 Danushka Bollegala

CSCE 478/878 Lecture 6: Bayesian Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 1: Supervised Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Ch 4. Linear Models for Classification

COMS 4771 Regression. Nakul Verma

1 Machine Learning Concepts (16 points)

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

HOMEWORK #4: LOGISTIC REGRESSION

Introduction to Machine Learning

Lecture : Probabilistic Machine Learning

Transcription:

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39

Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution. Let D = {(x 1, y 1 ),..., (x N, y N )} be i.i.d. samples from d(x, y). or Find a distribution p(y x) that we can use as a proxy for d(y x). Given a parametrized family of distributions, p(y x, w), find the parameter w making p(y x, w) closest to d(y x). Open questions: What do we mean by closest? What s a good candidate for p(y x, w)? How to actually find w? conceptually, and numerically 2 / 39

Principle of Parsimony (Parsimoney, aka Occam s razor) Pluralitas non est ponenda sine neccesitate. William of Ockham We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Isaac Newton Make everything as simple as possible, but not simpler. Albert Einstein Use the simplest explanation that covers all the facts. what we ll use 3 / 39

1) Define what aspects we consider relevant facts about the data. 2) Pick the simplest distribution reflecting that. Definition (Simplicity Entropy) The simplicity of a distribution p is given by its entropy: H(p) = z Z p(z) log p(z) Definition (Relevant Facts Feature Functions) By φ i : Z R for i = 1,..., D we denote a set of feature functions that express everything we want to be able to model about our data. For example: the grayvalue of a pixel, a bag-of-words histogram of an image, the time of day an image was taken, a flag if a pixel is darker than half of its neighbors. 4 / 39

Principle (Maximum Entropy Principle) Let z 1,..., z N be samples from a distribution d(z). Let φ 1,..., φ D be feature functions, and denote by µ i := 1 N n φ i(z n ) their means over the sample set. The maximum entropy distribution, p, is the solution to max H(p) p is a prob.distr. }{{} be as simple as possible Theorem (Exponential Family Distribution) subject to E z p(z) {φ i (z)} = µ i. }{{} be faithful to what we know Under some very reasonable conditions, the maximum entropy distribution has the form p(z) = 1 Z exp ( i w iφ i (z) ) for some parameter vector w = (w 1,..., w D ) and constant Z. 5 / 39

Example: Let Z = R, φ 1 (z) = z, φ 2 (z) = z 2. The exponential family distribution is p(z) = 1 Z(w) exp( w 1z + w 2 z 2 ) = b2 a Z(a, b) exp( a( z b ) 2 ) for a = w2, b = w 1 w 2. It s a Gaussian! Given examples z 1,..., z N, we can compute a and b, and derive w. 6 / 39

Example: Let Z = {1,..., K}, φ k (z) = z = k, for k = 1,..., K. The exponential family distribution is p(z) = 1 Z(w) exp( w k φ k (z) ) k exp(w 1 )/Z for z = 1, exp(w 2 )/Z for z = 2, =... exp(w K )/Z for z = K. with Z = exp(w 1 ) + + exp(w K ). It s a Multinomial! 7 / 39

Example: Let Z = {0, 1} N M image grid, φ i (y) := y i for each pixel i, φ NM (y) = i j y i y j (summing over all 4-neighbor pairs) The exponential family distribution is p(z) = 1 Z(w) exp( w, φ(y) + w y i y j ) i,j It s a (binary) Markov Random Field! 8 / 39

Conditional Random Field Learning Assume: a set of i.i.d. samples D = {(x n, y n )},...,N, (x n, y n ) d(x, y) feature functions ( φ 1 (x, y),..., φ D (x, y) ) : φ(x, y) parametrized family p(y x, w) = 1 Z(x,w) exp( w, φ(x, y) ) Task: adjust w of p(y x, w) based on D. Many possible technique to do so: Expectation Matching Maximum Likelihood Best Approximation MAP estimation of w Punchline: they all turn out to be (almost) the same! 9 / 39

Maximum Likelihood Parameter Estimation Idea: maximize conditional likelihood of observing outputs y 1,..., y N for inputs x 1,..., x N w = argmax w R D p(y 1,..., y N x 1,..., x N, w) i.i.d. = argmax w R D log( ) = argmin w R D N p(y n x n, w) N log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 10 / 39

Best Approximation Idea: find p(y x, w) that is closest to d(y x) Definition (Similarity between conditional distributions) For fixed x X : KL-divergence measure similarity KL cond (p d)(x) := y Y d(y x) log d(y x) p(y x, w) For x d(x), compute expectation: } KL tot (p d) : = E x d(x) {KL cond (p d)(x) = x X d(x, y) log d(y x) p(y x, w) y Y 11 / 39

Best Approximation Idea: find p(y x, w) of minimal KL tot -distance to d(y x) w = argmin w R D d(x, y) log x X y Y d(y x) p(y x, w) w R D (x,y) X Y drop const. = argmin d(x, y) log p(y x, w) N (x n,y n ) d(x,y) argmin w R D log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 12 / 39

MAP Estimation of w Idea: Treat w as random variable; maximize posterior probability p(w D) p(w D) Bayes = p(x1, y 1,..., x n, y n w)p(w) i.i.d. = p(w) p(d) N p(w): prior belief on w (cannot be estimated from data). p(y n x n, w) p(y n x n ) w = argmax w R D = argmin w R D [ = argmin w R D [ ] p(w D) = argmin log p(w D) w R D [ N log p(w) log p(w) N ] log p(y n x n, w) + log p(y n x n ) }{{} indep. of w ] log p(y n x n, w) 13 / 39

Choices for p(w): p(w) : const. w [ = argmin log p(w) w R D w [ = argmin w R D N log p(y n x n, w) ] (uniform; in R D not really a distribution) N log p(y n x n, w) + const. ] }{{} negative conditional log-likelihood p(w) := const. e 1 2σ 2 w 2 w = argmin w R D [ 1 (Gaussian) N 2σ 2 w 2 + log p(y n x n, w) }{{} regularized negative conditional log-likelihood + const. ] 14 / 39

Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) log e w,φ(xn,y) ] (σ 2 makes it unregularized) y Y Probabilistic parameter estimation or training means solving w = argmin L(w). w R D Same optimization problem as for multi-class logistic regression. 15 / 39

128.000 32.000 128.000 Negative Conditional Log-Likelihood (Toy Example) negative log likelihood σ 2 =0.01 negative log likelihood σ 2 =0.10 3 3 2 1 512.000 256.000 512.000 2 1 128.000 64.000 0 1 64.000 16.000 32.000 1024.000 0 1 4.000 2.000 8.000 16.000 2 3 2 1 0 1 2 3 4 5 negative log likelihood σ 2 =1.00 3 2 1 0 1 128.000 64.000 32.000 0.500 8.000 16.000 4.000 2 3 2 1 0 1 2 3 4 5 1.000 2.000 2 3 2 1 0 1 2 3 4 5 negative log likelihood σ 2 2.5 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 64.000 32.000 16.000 8.000 4.000 2.000 0.500 1.000 0.031 0.062 0.125 0.250 0.016 0.004 0.008 0.002 0.000 0.001 0.000 0.000 0.000 2.0 3 2 1 0 1 2 3 4 5 0.000 16 / 39

Steepest Descent Minimization minimize L(w) input tolerance ɛ > 0 1: w cur 0 2: repeat 3: v w L(w cur ) 4: η argmin η R L(w cur ηv) 5: w cur w cur ηv 6: until v < ɛ output w cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 17 / 39

L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = 1 N σ 2 w [ φ(x n, y n y Y ) e w,φ(xn,y) φ(x n, y) ] ȳ Y e w,φ(xn,ȳ) = 1 N σ 2 w [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y = 1 N σ 2 w [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] L(w) = 1 N σ 2 Id D D + [E y p(y x n,w)φ(x n, y)][e y p(y x n,w)φ(x n, y)] 18 / 39

L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y C -differentiable on all R D. 19 / 39

w L(w) = 1 N σ 2 w [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] For σ : E y p(y x n,w)φ(x n, y) = φ(x n, y n ) w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We aim for expectation matching: E y p φ(x, y) = φ(x, y obs ) but discriminatively: only for x {x 1,..., x n }. 20 / 39

L(w) = 1 N σ 2 Id D D + [E y p(y x n,w)φ(x n, y)][e y p(y x n,w)φ(x n, y)] positive definite Hessian matrix L(w) is convex w L(w) = 0 implies global minimum. 21 / 39

Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: many probabilistic derivations lead to same optimization problem minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. we re done. For conditional random fields: we re not in safe waters, yet! 22 / 39

Task: Compute v = w L(w cur ), evaluate L(w cur + ηv): L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = 1 N σ 2 w [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = 2 640 480 10 92475 ranking N images: Y = N!, e.g. N = 1000: Y 10 2568. We must use the structure in Y, or we re lost. 23 / 39

Solving the Training Optimization Problem Numerically w L(w) = 1 N σ 2 w [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes 100s to 1,000,000s K: number of possible labels of each output nodes 2 to 100s 24 / 39

Probabilistic Inference to the Rescue Remember: in a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x,w) φ(x, y) = E y p(y x,w) φ F (x, y F ) F F = ( E yf p(y F x,w)φ F (x, y F ) ) F F E yf p(y F x,w)φ F (x, y F ) = Factor marginals µ F = p(y F x, w) p(y F x, w) φ }{{} F (x, y F ) y F Y F }{{} factor marginals K F terms are much smaller than complete joint distribution p(y x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 25 / 39

Solving the Training Optimization Problem Numerically w L(w) = 1 N σ 2 w [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 26 / 39

What, if the training set D is too large (e.g. millions of examples)? Simplify Model Train simpler model (smaller factors) Less expressive power results might get worse rather than better Subsampling Create random subset D D. Train model using D Ignores all information in D \ D Parallelize Train multiple models in parallel. Merge the models. Follows the multi-core/gpu trend How to actually merge? (or?) Doesn t really save computation. 27 / 39

What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Keep maximizing p(w D). In each gradient descent step: Create random subset D D, often just 1 3 elements! Follow approximate gradient w L(w) = 1 σ 2 w (x n,y n ) D [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Line search no longer possible. Extra parameter: stepsize η SGD converges to argmin w L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008. also: http://leon.bottou.org/research/largescale 28 / 39

Solving the Training Optimization Problem Numerically w L(w) = 1 N σ 2 w [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK 2 ND) (if BP is possible): L(w) = 1 2σ 2 w 2 N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK 2 ND) per evaluation of L N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 100s to 10000s M: number of output nodes K: number of possible labels of each output nodes 29 / 39

Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words w i, φ i (y i, x) : local classifier (like logistic-regression) φ i,j (y i, y j ) = y i = y j R 1 : test for same label w ij, φ ij (y i, y j ) : penalizer for label changes (if w ij > 0) combined: argmax y p(y x) is smoothed version of local cues original local classification local + smoothness 30 / 39

Typical feature functions in handwriting recognition: φ i (y i, x) R 1000 : image representation (pixels, gradients) w i, φ i (y i, x) : local classifier if x i is letter y i φ i,j (y i, y j ) = e yi e yj R 26 26 : letter/letter indicator w ij, φ ij (y i, y j ) : encourage/suppress letter combinations combined: argmax y p(y x) is corrected version of local cues local classification local + correction 31 / 39

Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG w i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local classification local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.] 32 / 39

Typical feature functions for CRFs in Computer Vision: φ i (y i, x): local representation, high-dimensional w i, φ i (y i, x) : local classifier φ i,j (y i, y j ): prior knowledge, low-dimensional w ij, φ ij (y i, y j ) : penalize outliers learning adjusts parameters: unary wi : learn local classifiers and their importance binary w ij : learn importance of smoothing/penalization argmax y p(y x) is cleaned up version of local prediction 33 / 39

Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i (x) can be stronger classifiers, e.g. non-linear SVMs Disadvantage: if local classifiers are bad, CRF training cannot fix that. 34 / 39

Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: w L(w) = w σ 2 N [ φ(x n, y n ) y Y p(y x n, w) φ(x n, y) ] R D A lot of research on accelerating CRF training: problem solution method(s) Y too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 35 / 39

Summary CRF Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y, feature function φ : X R D. (x n, y n ) i.i.d. d(x, y) 1 Task: find parameter vector w such that Z exp( w, φ(x, y) ) d(y x). CRF solution derived by minimizing negative conditional log-likelihood: w = argmin w 1 2σ 2 w 2 N [ w, φ(x n, y n ) log e w,φ(xn,y) ] y Y convex optimization problem gradient descent works training needs repeated runs of probabilistic inference 36 / 39

Sebastian Nowozin and Christoph Lampert Structured Models in Computer Vision Part 4. Conditional Random Fields Extra I: Beyond Fully Supervised Learning So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 37 / 39

Graphical model: three types of variables x X always observed, y Y observed only in training, z Z never observed (latent). Marginalization over Latent Variables Construct conditional likelihood as usual: p(y, z x, w) = Derive p(y x, w) by marginalizing over z: p(y x, w) = 1 exp( w, φ(x, y, z) ) Z(x, w) 1 Z(x, w) p(y, z x, w) z Z 38 / 39

Negative regularized conditional log-likelihood: L(w) = λ w 2 = λ w 2 = λ w 2 + N log p(y n x n, w) N log p(y n, z x n, w) z Z N log exp( w, φ(x n, y n, z) ) z Z N log z Z y Y exp( w, φ(x n, y, z) ) L is not convex in w can have local minima no agreed on best way for treating latent variables 39 / 39