Introduction To Graphical Models

Similar documents
Learning with Structured Inputs and Outputs

Probabilistic Graphical Models & Applications

Intelligent Systems:

Markov Random Fields for Computer Vision (Part 1)

Introduction to Probabilistic Graphical Models

Undirected Graphical Models

Part 4: Conditional Random Fields

Learning with Structured Inputs and Outputs

Undirected graphical models

Probabilistic Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Part 7: Structured Prediction and Energy Minimization (2/2)

Intelligent Systems (AI-2)

Undirected Graphical Models: Markov Random Fields

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Lecture 9: PGM Learning

Part 6: Structured Prediction and Energy Minimization (1/2)

Lecture 6: Graphical Models

Probabilistic Graphical Models

Variational Inference (11/04/13)

Intelligent Systems (AI-2)

Discrete Inference and Learning Lecture 3

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Structured Prediction

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Probabilistic Graphical Models Lecture Notes Fall 2009

CS Lecture 4. Markov Random Fields

Graphical Models and Kernel Methods

Probabilistic Graphical Networks: Definitions and Basic Results

CSC 412 (Lecture 4): Undirected Graphical Models

Machine Learning 4771

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Submodularity in Machine Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CPSC 540: Machine Learning

Lecture 8: PGM Inference

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Probabilistic Graphical Models

3 : Representation of Undirected GM

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

6.867 Machine learning, lecture 23 (Jaakkola)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Lecture 15. Probabilistic Models on Graph

CPSC 540: Machine Learning

STA 4273H: Statistical Machine Learning

13: Variational inference II

Probabilistic Graphical Models

13 : Variational Inference: Loopy Belief Propagation

The Ising model and Markov chain Monte Carlo

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Junction Tree, BP and Variational Methods

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

Linear Classifiers IV

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Object Detection Grammars

Intelligent Systems (AI-2)

A brief introduction to Conditional Random Fields

Graphical models. Sunita Sarawagi IIT Bombay

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

COMS 4771 Introduction to Machine Learning. Nakul Verma

4 : Exact Inference: Variable Elimination

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Naïve Bayes classification

Machine Learning, Fall 2009: Midterm

Expectation Maximization (EM)

STA 4273H: Statistical Machine Learning

Machine Learning for NLP

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Conditional Random Field

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Directed and Undirected Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Probabilistic Machine Learning

Recent Advances in Bayesian Inference Techniques

Chapter 16. Structured Probabilistic Models for Deep Learning

12 : Variational Inference I

Bayesian Learning in Undirected Graphical Models

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

CPSC 540: Machine Learning

Probabilistic Graphical Models for Image Analysis - Lecture 1

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lab 12: Structured Prediction

Inference in Bayesian Networks

Machine Learning Lecture 2

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Sequential Supervised Learning

Does Better Inference mean Better Learning?

Pushmeet Kohli Microsoft Research

ML4NLP Multiclass Classification

Structured Learning with Approximate Inference

Probabilistic and Bayesian Machine Learning

Machine Learning for Data Science (CS4786) Lecture 24

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Lecture 13: Structured Prediction

CSCE 478/878 Lecture 6: Bayesian Learning

Algorithms for Predicting Structured Data

Transcription:

Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013 1 / 6

Peter Gehler Introduction to Graphical Models Extended version in book form Sebastian Nowozin and Christoph Lampert Structured Learning and Prediction in Computer Vision ca 200 pages Available free online http://pub.ist.ac.at/~chl/ Slides mainly based on a tutorial version from Christoph Thanks! 2 / 6

Peter Gehler Introduction to Graphical Models Literature Recommendation David Barber Bayesian Reasoning and Machine Learning 670 pages Available free online http://web4.cs.ucl.ac.uk/ staff/d.barber/pmwiki/pmwiki. php?n=brml.online 3 / 6

Peter Gehler Introduction to Graphical Models Standard Regression: f : X R. Structured Output Learning: f : X Y. 4 / 6

Peter Gehler Introduction to Graphical Models Standard Regression: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 6

Peter Gehler Introduction to Graphical Models What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 6 / 6

Probabilistic Graphical Models

Peter Gehler Introduction to Graphical Models Example: Human Pose Estimation x X y Y Given an image, where is a person and how is it articulated? f : X Y Image x, but what is human pose y Y precisely? 2 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} Entire Body: y = (y head, y torso, y left lower arm,...} Y 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y X ψ(y head, x) Y head Image x X Example y head Head detector Idea: Have a head classifier (SVM, NN,...) ψ(y head, x) R + 4 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y X ψ(y head, x) Y head Image x X Example y head Head detector Idea: Have a head classifier (SVM, NN,...) ψ(y head, x) R + Evaluate everywhere and record score 4 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y X ψ(y head, x) Y head Image x X Example y head Head detector Idea: Have a head classifier (SVM, NN,...) ψ(y head, x) R + Evaluate everywhere and record score Repeat for all body parts 4 / 24

Peter Gehler Introduction to Graphical Models Human Pose Estimation X X ψ(y head, x) ψ(y torso, x) Y head Y torso Image x X Compute y = (yhead, y torso, ) = argmax ψ(y head, x)ψ(y torso, x) y head,y torso, 5 / 24

Peter Gehler Introduction to Graphical Models Human Pose Estimation X X ψ(y head, x) ψ(y torso, x) Y head Y torso Image x X Compute y = (yhead, y torso, ) = argmax ψ(y head, x)ψ(y torso, x) y head,y torso, = (argmax ψ(y head, x), argmax ψ(y torso, x), ) y head y torso 5 / 24

Peter Gehler Introduction to Graphical Models Human Pose Estimation Image x X Prediction y Y Compute y = (yhead, y torso, ) = argmax ψ(y head, x)ψ(y torso, x) y head,y torso, = (argmax ψ(y head, x), argmax ψ(y torso, x), ) y head y torso Great! Problem solved!? 5 / 24

Peter Gehler Introduction to Graphical Models Idea: Connect up the body X X ψ(y head, x) ψ(y torso, x) ψ(y torso, y arm ) Y head Y torso ψ(y head, y torso ) Head-Torso Model Ensure head is on top of torso Compute y = ψ(y head, y torso ) R + argmax ψ(y head, x)ψ(y torso, x)ψ(y head, y torso ) y head,y torso, but this does not decompose anymore! left image from Ben Sapp 6 / 24

Peter Gehler Introduction to Graphical Models The recipe Structured output function, X = anything, Y = anything 1) Define auxiliary function, g : X Y R, e.g. g(x, y) = i ψ i(y i, x) i j ψ ij (y i, y j, x) 2) Obtain f : X Y by maximimization: f(x) = argmax g(x, y) y Y 7 / 24

Peter Gehler Introduction to Graphical Models A Probabilistic View Computer Vision problems usually deal with uncertain information Incomplete information (observe static images, projections, etc) Annotation is noisy (wrong or ambiguous cases)... Uncertainty is captured by (conditional) probability distributions: p(y x) for input x X, how likely is y Y the correct output? We can also phrase this as what s the probability of observing y given x? how strong is our belief in y if we know x? 8 / 24

Peter Gehler Introduction to Graphical Models A Probabilistic View on f : X Y Structured output function, X = anything, Y = anything We need to define an auxiliary function, g : X Y R. e.g. g(x, y) := p(y x). Then maximimization f(x) = argmax y Y g(x, y) = argmax p(y x) y Y becomes maximum a posteriori (MAP) prediction. Interpretation: The MAP estimate y Y, is the most probable value (there can be multiple). 9 / 24

Peter Gehler Introduction to Graphical Models Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y 10 / 24

Peter Gehler Introduction to Graphical Models Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y Example: binary ( Bernoulli ) variable y Y = {0, 1} 2 values, 1 degree of freedom p(y) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1 y 10 / 24

Peter Gehler Introduction to Graphical Models Conditional Probability Distributions x X y Y p(y x) 0 (positivity) x X y Yp(y x) = 1 (normalization w.r.t. y) For example: binary prediction X = {images}, y Y = {0, 1} each x: 2 values, 1 d.o.f. one (or two) function 11 / 24

Peter Gehler Introduction to Graphical Models Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization 12 / 24

Peter Gehler Introduction to Graphical Models Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization Example: predicting the center point of an object y Y = {(1, 1),..., (width, height)} for each x: Y = W H values, y = (y 1, y 2 ) Y 1 Y 2 with Y 1 = {(1,..., width} and Y 2 = {1,..., height}. each x: Y 1 Y 2 = W H values, 12 / 24

Peter Gehler Introduction to Graphical Models Structured objects: predicting M variables jointly Y = {1, K} {1, K} {1, K} For each x: K M values, K M 1 d.o.f. K M functions Example: Object detection with variable size bounding box Y {1,..., W } {1,..., H} {1,..., W } {1,..., H} y = (left, top, right, bottom) For each x: 1 4W (W 1)H(H 1) values (millions to billions...) 13 / 24

Peter Gehler Introduction to Graphical Models Example: image denoising Y = {640 480 RGB images} For each x: 16777216 307200 values in p(y x), 10 2,000,000 functions too much! 14 / 24

Peter Gehler Introduction to Graphical Models Example: image denoising Y = {640 480 RGB images} For each x: 16777216 307200 values in p(y x), 10 2,000,000 functions too much! We cannot consider all possible distributions, we must impose structure. 14 / 24

Peter Gehler Introduction to Graphical Models Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. 15 / 24

Peter Gehler Introduction to Graphical Models Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. 15 / 24

Peter Gehler Introduction to Graphical Models Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. The graph encodes conditional independence assumptions between the variables: for N(i) are the neighbors of node i in the graph p(y i y V \{i} ) = p(y i y N (i)) with y V \{i} = (y 1,..., y i 1, y i+1, y n ). 15 / 24

Peter Gehler Introduction to Graphical Models Example: Pictorial Structures for Articulated Pose Estimation X F (1) top Y top...... Y head F (2) top,head... Y rhnd......... Yrarm Y torso Y larm......... Y rleg Y lleg...... Y lhnd Y rfoot Y lfoot In principle, all parts depend on each other. Knowing where the head is puts constraints on where the feet can be. But conditional independences as specified by the graph: If we know where the left leg is, the left foot s position does not depend on the torso position anymore, etc. p(y lfoot y top,..., y torso,..., y rfoot, x) = p(y lfoot y lleg, x) 16 / 24

Peter Gehler Introduction to Graphical Models Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F, E), E V F variable nodes V (circles), factor nodes F (boxes), edges E between variable and factor nodes. each factor F F connects a subset of nodes, write F = {v1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph 17 / 24

Peter Gehler Introduction to Graphical Models Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F, E), E V F variable nodes V (circles), factor nodes F (boxes), edges E between variable and factor nodes. each factor F F connects a subset of nodes, write F = {v1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Factorization into potentials ψ at factors: p(y) = 1 ψ F (y F ) Z F F 17 / 24

Peter Gehler Introduction to Graphical Models Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F, E), E V F variable nodes V (circles), factor nodes F (boxes), edges E between variable and factor nodes. each factor F F connects a subset of nodes, write F = {v1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Factorization into potentials ψ at factors: p(y) = 1 ψ F (y F ) = 1 Z Z ψ 1(Y l )ψ 2 (Y j, Y l )ψ 3 (Y i, Y j )ψ 4 (Y i, Y k, Y l ) F F 17 / 24

Peter Gehler Introduction to Graphical Models Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F, E), E V F variable nodes V (circles), factor nodes F (boxes), edges E between variable and factor nodes. each factor F F connects a subset of nodes, write F = {v1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Factorization into potentials ψ at factors: p(y) = 1 ψ F (y F ) = 1 Z Z ψ 1(Y l )ψ 2 (Y j, Y l )ψ 3 (Y i, Y j )ψ 4 (Y i, Y k, Y l ) F F Z is a normalization constant, called partition function: Z = ψ F (y F ). y Y F F 17 / 24

Peter Gehler Introduction to Graphical Models Conditional Distributions How to model p(y x)? Potentials become also functions of (part of) x: ψ F (y F ; x F ) instead of just ψ F (y F ) p(y x) = 1 Z(x) ψ F (y F ; x F ) F F X i Y i X j Y j Partition function depends on x F Factor graph Z(x) = ψ F (y F ; x F ). y Y F F Note: x is treated just as an argument, not as a random variable. Conditional random fields (CRFs) 18 / 24

Peter Gehler Introduction to Graphical Models Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: ψ F (y F ; x F ) = exp( E F (y F ; x F )), or equivalently E F (y F ; x F ) = log(ψ F (y F ; x F )). 19 / 24

Peter Gehler Introduction to Graphical Models Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: ψ F (y F ; x F ) = exp( E F (y F ; x F )), or equivalently E F (y F ; x F ) = log(ψ F (y F ; x F )). p(y x) can be written as p(y x) = 1 Z(x) ψ F (y F ; x F ) F F = 1 Z(x) exp( E F (y F ; x F )) = 1 Z(x) exp( E(y; x)) F F for E(y; x) = F F E F (y F ; x F ) 19 / 24

Peter Gehler Introduction to Graphical Models Conventions: Energy Minimization argmax y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. 20 / 24

Peter Gehler Introduction to Graphical Models Conventions: Energy Minimization argmax y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. In practice, one typically models the energy function directly. the probability distribution is uniquely determined by it. 20 / 24

Peter Gehler Introduction to Graphical Models Example: An Energy Function for Image Segmentation Foreground/background image segmentation X = [0, 255] W H, Y = {0, 1} W H foreground: y i = 1, background: y i = 0. graph: 4-connected grid Each output pixel depends on local grayvalue (inputs) neighboring outputs Energy function components ( Ising model): E i (y i = 1, x i ) = 1 1 255 x i E i (y i = 0, x i ) = 1 255 x i x i bright y i rather foreground, x i dark y i rather background E ij (0, 0) = E ij (1, 1) = 0, E ij (0, 1) = E ij (1, 0) = ω for ω > 0 prefer that neighbors have the same label labeling smooth 21 / 24

Peter Gehler Introduction to Graphical Models E(y; x) = i ( (1 1 255 x i) y i = 1 + 1 ) 255 x i y i = 0 + w y i y j i j input image segmentation segmentation from from thresholding minimal energy 22 / 24

Peter Gehler Introduction to Graphical Models What to do with Structured Prediction Models? Case 1) p(y x) is known MAP Prediction Predict f : X Y by solving y = argmax y Y = argmin y Y p(y x) E(y, x) Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. 23 / 24

Peter Gehler Introduction to Graphical Models What to do with Structured Prediction Models? Case 2) p(y x) is unknown, but we have training data Parameter Learning Assume fixed graph structure, learn potentials/energies (ψ F ) Among other tasks (learn the graph structure, variables, etc.) Topic of Wednesdays lecture 24 / 24

Example: Pictorial Structures input image x argmax y p(y x) p(y i x) MAP makes a single (structured) prediction (point estimate) best overall pose Marginal probabilities p(y i x) give us potential positions uncertainty of the individual body parts. 1 / 24

Example: Man-made structure detection input image x argmax y p(y x) p(y i x) Task: Pixel depicts a man made structure or not? y i {0, 1} Middle: MAP inference Right: variable marginals Attention: Max-Marginals MAP 2 / 24

Probabilistic Inference Compute p(y F x) and Z(x). 3 / 24

Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H 4 / 24

Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 exp( E(y; x)). Z(x) 4 / 24

Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 exp( E(y; x)). Z(x) Problem: We don t know Z(x), and computing it using Z(x) = y Y exp( E(y; x)) looks expensive (the sum has Y i Y j Y k Y l terms). A lot research has been done on how to efficiently compute Z(x). 4 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y exp( E(y)) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= exp( E(y)) y Y = y i Y i y j Y j y k Y k y l Y l exp( E(y i, y j, y k, y l )) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= exp( E(y)) y Y = y i Y i y j Y j y k Y k = y i Y i y j Y j y k Y k exp( E(y i, y j, y k, y l )) y l Y l y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k = y i y j y k y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) y l exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k = y i y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y j y k y l = exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r H Yk R Y k Y i Y j Y k Y l F G H Z= exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l }{{} r H Yk (y k ) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r H Yk R Y k Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l }{{} r H Yk (y k ) = exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r G Yj R Y j Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r G Yj R Y j Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) = exp( E F (y i, y j ))r G Yj (y j ) y i y j 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r F Yi R Y i Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) = exp( E F (y i, y j ))r G Yj (y j ) y i y j = y i r F Yi (y i ) 5 / 24

Example: Inference on Trees Y i Y j Y k Y l F G H I Y m Z = exp( E(y)) y Y = y i Y i y j Y i y k Y i y l Y i y m Y m exp( (E F (y i, y j ) + + E I (y k, y m ))) 6 / 24

Example: Inference on Trees Y i Y j Y k Y l F G H I Y m Z = y i Y i exp( E F (y i, y j )) exp( E G (y j, y k )) y j Y j y k Y k exp( E H (y k, y l )) exp( E I (y k, y m )) y l Y l y m Y m }{{}}{{} r H Yk (y k ) r I Yk (y k ) 6 / 24

Example: Inference on Trees r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) (r H Yk (y k ) r I Yk (y k )) y k Y k exp( E G (y j, y k )) 6 / 24

Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) (r H Yk (y k ) r I Yk (y k )) }{{} q Yk G(y k ) y k Y k exp( E G (y j, y k )) 6 / 24

Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) y k Y k exp( E G (y j, y k ))q Yk G(y k ) 6 / 24

Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message...... F r F Yi q Yi F Y i...... 7 / 24

Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively update messages...... F r F Yi q Yi F After convergence: Z and p(y F ) can be obtained from the messages. Y i...... Belief Propagation 7 / 24

Example: Pictorial Structures X F (1) top Y top...... Y head F (2) top,head... Y rhnd......... Yrarm Y torso Y larm......... Y rleg Y lleg...... Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973) Body-part variables, states: discretized tuple (x, y, s, θ) (x, y) position, s scale, and θ rotation 8 / 24

Example: Pictorial Structures input image x p(y i x) Exact marginals although state space is huge and thus partition function is a huge sum. Z(x) = exp ( E(y; x)) all bodies y 9 / 24

Belief Propagation in Loopy Graphs Can we do message passing also in graphs with loops? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order. Suggested solution: Loopy Belief Propagation (LBP) initialize all messages as constant 1 pass messages until convergence 10 / 24

Belief Propagation in Loopy Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Loopy Belief Propagation is very popular, but has some problems: it might not converge (e.g. oscillate) even if it does, the computed probabilities are only approximate. Many improved message-passing schemes exist (see tutorial book). 11 / 24

Probabilistic Inference Variational Inference / Mean Field Task: Compute marginals p(y F x) for general p(y x) Idea: Approximate p(y x) by simpler q(y) and use marginals from that. q = argmin D KL (q(y) p(y x)) q Q E.g. Naive Mean Field: Q all distributions of the form q(y) = i V q i (y i ). q h q e q f q g q i q j q k q l q m 12 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo Task: Compute marginals p(y F x) for general p(y x) Idea: Rephrase as computing the expected value of a quantity: E y p(y x,w) [h(x, y)], for some (well-behaved) function h : X Y R. For probabilistic inference, this step is easy. Set h F,z (x, y) := y F = z, then E y p(y x,w) [h F,z (x, y)] = p(y x) y F = z y Y = p(y F x) y F = z = p(y F = z x). y F Y F 13 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the dimension of Y. 14 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the dimension of Y. Problem: Producing i.i.d. samples, y (s), from p(y x) is hard. Solution: We can get away with a sequence of dependent samples Monte-Carlo Markov Chain (MCMC) sampling 14 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo One example how to do MCMC sampling: Gibbs sampler Initialize y (0) = (y 1,..., y d ) arbitrarily For s = 1,..., S: 1. Select a variable y i, 2. Re-sample y i p(y i y (s 1) V \{i}, x). 3. Output sample y (s) = (y (s 1) 1,..., y (s 1) i 1, y i, y (s 1) i+1,..., y (s 1) d ) p(y i y (s) V \{i}, x) = p(y i, y (t) V \{i} x) y i Y i p(y i, y (t) V \{i} x) exp( E(y i, y (t), x) = y i Y i exp( E(y i, y (t), x) 15 / 24

MAP Prediction Compute y = argmax y p(y x). 16 / 24

MAP Prediction Belief Propagation / Message Passing B 10. Y m A B Y i Y j Y k F 9. C D E A 1. 6. 8. Y k Y l 3. 5. D E 2. 4. Y i Y j 7. C F G Y l Y m Y n H I J K L Y o Y p Y q One can also derive message passing algorithms for MAP prediction. In trees: guaranteed to converge to optimal solution. In loopy graphs: convergence not guaranteed, approximate solution. 17 / 24

MAP Prediction Graph Cuts For loopy graphs, we can find the global optimum only in special cases: Binary output variables: Y i = {0, 1} for i = 1,..., d, Energy function with only unary and pairwise terms E(y; x, w) = i E i (y i ; x) + i j E i,j (y i, y j ; x) 18 / 24

MAP Prediction Graph Cuts For loopy graphs, we can find the global optimum only in special cases: Binary output variables: Y i = {0, 1} for i = 1,..., d, Energy function with only unary and pairwise terms E(y; x, w) = i E i (y i ; x) + i j E i,j (y i, y j ; x) Restriction 1 (positive unary potentials): E F (y i ; x, w tf ) 0 (always achievable by reparametrization) Restriction 2 (regular/submodular/attractive pairwise potentials) E F (y i, y j ; x, w tf ) = 0, if y i = y j, E F (y i, y j ; x, w tf ) = E F (y j, y i ; x, w tf ) 0, otherwise. (not always achievable, depends on the task) 18 / 24

Construct auxiliary undirected graph One node {i} i V per variable Two extra nodes: source s, sink t Edges Edge Graph cut weight {i, j} E F (y i = 0, y j = 1; x, w tf ) {i, s} E F (y i = 1; x, w tf ) {i, t} E F (y i = 0; x, w tf ) Find linear s-t-mincut {i, t} {i, s} s i j k l m n Solution defines optimal binary labeling of the original energy minimization problem GraphCuts algorithms (Approximate) multi-class extensions exist, see tutorial book. t 19 / 24

GraphCuts Example Image segmentation energy: E(y; x) = ( (1 1 255 x i) y i = 1 + 1 ) 255 x i y i = 0 + w y i y j i i j All conditions to apply GraphCuts are fulfilled. E i (y i, x) 0, E ij (y i, y j ) = 0 for y i = y j, E ij (y i, y j ) = w > 0 for y i y j. input image thresholding GraphCuts 20 / 24

MAP Prediction Linear Programming Relaxation More general alternative, Y i = {1,..., K}: E(y; x) = i E i (y i ; x) + ij E ij (y i, y j ; x) Linearize the energy using indicator functions: E i (y i ; x) = K E i (k; x) y }{{} i = k = =:a ik k=1 K a i;k µ i;k k=1 for new variables µ i;k {0, 1} with k µ i;k = 1. E ij (y i, y j ; x) = K K k=1 l=1 E ij (k, l; x) }{{} =:a ij;kl y i = k y j = l = K a ij;kl µ ij;kl for new variables µ ij;kl {0, 1} with l µ ij;kl = µ i;k and k µ ij;kl = µ j;l. k=1 21 / 24

MAP Prediction Linear Programming Relaxation Energy minimization becomes y µ := argmin a i;k µ i;k + µ i ij subject to a ij;kl µ ij;kl = argmin Aµ µ µ i;k {0, 1} µ ij;kl {0, 1} µ i;k = 1, µ ij;kl = µ i;k, k l µ ij;kl = µ j;l k Integer variables, linear objective function, linear constraints: Integer linear program (ILP) Unfortunately, ILPs are in general NP-hard. 22 / 24

MAP Prediction Linear Programming Relaxation Energy minimization becomes y µ := argmin a i;k µ i;k + µ i ij subject to a ij;kl µ ij;kl = argmin Aµ µ µ i;k [0, 1] {0, 1} µ ij;kl [0, 1] {0, 1} µ i;k = 1, µ ij;kl = µ i;k, µ ij;kl = µ j;l k l Integer real-values variables, linear objective function, linear constraints: Linear program (LP) relaxation LPs can be solved very efficiently, µ yields approximate solution for y. k 23 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

Optimal Prediction Predict with loss function (ȳ, y). 1 / 6

Optimal Prediction Optimal prediction is minimum expected risk an expectation y = argmin ȳ Y (ȳ, y)p(y x) y Y X i X j Y i Y j 2 / 6

Optimal Prediction Optimal prediction is minimum expected risk an expectation y = argmin (ȳ, y)p(y x) ȳ Y y Y = argmin (ȳ, y) ψ F (y F ; x) ȳ Y y Y F X i Y i X j Y j Can think of as another CRF factor Reuse inference techniques (ȳ, ) 2 / 6

Example: Hamming loss Count the number of mislabeled variables: H (y, y) = 1 V I(y i y i ) i V Makes more sense than 0/1 loss for image segmentation Optimal: predict maximum marginals (exercise) y = (argmax p(y 1 x), argmax p(y 2 x),...) y 1 y 2 3 / 6

Example: Pixel error If we can add elements in Y i (pixel intensities, optical flow vectors, etc.). Sum of squared errors Q (y, y) = 1 V y i y i 2. i V Used, e.g., in stereo reconstruction, part-based object detection. Optimal: predict marginal mean (exercise) y = (E p(y x) [y 1 ], E p(y x) [y 2 ],...) 4 / 6

Example: Task specific losses Object detection bounding boxes, or arbitrary regions detection ground truth image Area overlap loss: AO (y, y) = 1 area(y y) area(y y) = 1 Used, e.g., in PASCAL VOC challenges for object detection, because it scale-invariants (no bias for or against big objects). 5 / 6

Summary: Inference and Prediction Two main tasks for a given probability distribution p(y x): Probabilistic Inference Compute p(y I x) for a subset I of variables, in particular p(y i x) (Loopy) Belief Propagation, Variation Inference, Sampling,... MAP Prediction Identify y Y that maximizes p(y x) (minimizes energy) (Loopy) Belief Propagation, GraphCuts, LP-relaxation, custom,... Structured prediction comes with structured loss functions, : Y Y R. Loss Function (y, y) is loss (or cost) for predicting y Y if y Y is correct. Task specific: use 0/1-loss, Hamming loss, area overlap,... 6 / 6

More information: http://ps.is.tue.mpg.de/ Max Planck Institute for Intelligent Systems Other groups on Campus Empirical Inference (Machine Learning) Perceiving Systems (Computer Vision) Autonomous Motion (Robotics)