Introduction To Graphical Models

Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013 1 / 6

Peter Gehler Introduction to Graphical Models Extended version in book form Sebastian Nowozin and Christoph Lampert Structured Learning and Prediction in Computer Vision ca 200 pages Available free online http://pub.ist.ac.at/~chl/ Slides mainly based on a tutorial version from Christoph Thanks! 2 / 6

Peter Gehler Introduction to Graphical Models Literature Recommendation David Barber Bayesian Reasoning and Machine Learning 670 pages Available free online http://web4.cs.ucl.ac.uk/ staff/d.barber/pmwiki/pmwiki. php?n=brml.online 3 / 6

Peter Gehler Introduction to Graphical Models Standard Regression: f : X R. Structured Output Learning: f : X Y. 4 / 6

Peter Gehler Introduction to Graphical Models Standard Regression: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 6

Peter Gehler Introduction to Graphical Models What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 6 / 6

Probabilistic Graphical Models

Peter Gehler Introduction to Graphical Models Example: Human Pose Estimation x X y Y Given an image, where is a person and how is it articulated? f : X Y Image x, but what is human pose y Y precisely? 2 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} Entire Body: y = (y head, y torso, y left lower arm,...} Y 3 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y X ψ(y head, x) Y head Image x X Example y head Head detector Idea: Have a head classifier (SVM, NN,...) ψ(y head, x) R + 4 / 24

Peter Gehler Introduction to Graphical Models Human Pose Y X ψ(y head, x) Y head Image x X Example y head Head detector Idea: Have a head classifier (SVM, NN,...) ψ(y head, x) R + Evaluate everywhere and record score 4 / 24

Peter Gehler Introduction to Graphical Models Human Pose Estimation X X ψ(y head, x) ψ(y torso, x) Y head Y torso Image x X Compute y = (yhead, y torso, ) = argmax ψ(y head, x)ψ(y torso, x) y head,y torso, 5 / 24

Peter Gehler Introduction to Graphical Models Human Pose Estimation Image x X Prediction y Y Compute y = (yhead, y torso, ) = argmax ψ(y head, x)ψ(y torso, x) y head,y torso, = (argmax ψ(y head, x), argmax ψ(y torso, x), ) y head y torso Great! Problem solved!? 5 / 24

Peter Gehler Introduction to Graphical Models Idea: Connect up the body X X ψ(y head, x) ψ(y torso, x) ψ(y torso, y arm ) Y head Y torso ψ(y head, y torso ) Head-Torso Model Ensure head is on top of torso Compute y = ψ(y head, y torso ) R + argmax ψ(y head, x)ψ(y torso, x)ψ(y head, y torso ) y head,y torso, but this does not decompose anymore! left image from Ben Sapp 6 / 24

Peter Gehler Introduction to Graphical Models The recipe Structured output function, X = anything, Y = anything 1) Define auxiliary function, g : X Y R, e.g. g(x, y) = i ψ i(y i, x) i j ψ ij (y i, y j, x) 2) Obtain f : X Y by maximimization: f(x) = argmax g(x, y) y Y 7 / 24

Peter Gehler Introduction to Graphical Models A Probabilistic View Computer Vision problems usually deal with uncertain information Incomplete information (observe static images, projections, etc) Annotation is noisy (wrong or ambiguous cases)... Uncertainty is captured by (conditional) probability distributions: p(y x) for input x X, how likely is y Y the correct output? We can also phrase this as what s the probability of observing y given x? how strong is our belief in y if we know x? 8 / 24

Peter Gehler Introduction to Graphical Models A Probabilistic View on f : X Y Structured output function, X = anything, Y = anything We need to define an auxiliary function, g : X Y R. e.g. g(x, y) := p(y x). Then maximimization f(x) = argmax y Y g(x, y) = argmax p(y x) y Y becomes maximum a posteriori (MAP) prediction. Interpretation: The MAP estimate y Y, is the most probable value (there can be multiple). 9 / 24

Peter Gehler Introduction to Graphical Models Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y 10 / 24

Peter Gehler Introduction to Graphical Models Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y Example: binary ( Bernoulli ) variable y Y = {0, 1} 2 values, 1 degree of freedom p(y) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1 y 10 / 24

Peter Gehler Introduction to Graphical Models Conditional Probability Distributions x X y Y p(y x) 0 (positivity) x X y Yp(y x) = 1 (normalization w.r.t. y) For example: binary prediction X = {images}, y Y = {0, 1} each x: 2 values, 1 d.o.f. one (or two) function 11 / 24

Peter Gehler Introduction to Graphical Models Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization Example: predicting the center point of an object y Y = {(1, 1),..., (width, height)} for each x: Y = W H values, y = (y 1, y 2 ) Y 1 Y 2 with Y 1 = {(1,..., width} and Y 2 = {1,..., height}. each x: Y 1 Y 2 = W H values, 12 / 24

Peter Gehler Introduction to Graphical Models Structured objects: predicting M variables jointly Y = {1, K} {1, K} {1, K} For each x: K M values, K M 1 d.o.f. K M functions Example: Object detection with variable size bounding box Y {1,..., W } {1,..., H} {1,..., W } {1,..., H} y = (left, top, right, bottom) For each x: 1 4W (W 1)H(H 1) values (millions to billions...) 13 / 24

Peter Gehler Introduction to Graphical Models Example: image denoising Y = {640 480 RGB images} For each x: 16777216 307200 values in p(y x), 10 2,000,000 functions too much! 14 / 24

Peter Gehler Introduction to Graphical Models Example: image denoising Y = {640 480 RGB images} For each x: 16777216 307200 values in p(y x), 10 2,000,000 functions too much! We cannot consider all possible distributions, we must impose structure. 14 / 24

Peter Gehler Introduction to Graphical Models Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. The graph encodes conditional independence assumptions between the variables: for N(i) are the neighbors of node i in the graph p(y i y V \{i} ) = p(y i y N (i)) with y V \{i} = (y 1,..., y i 1, y i+1, y n ). 15 / 24

Peter Gehler Introduction to Graphical Models Example: Pictorial Structures for Articulated Pose Estimation X F (1) top Y top...... Y head F (2) top,head... Y rhnd......... Yrarm Y torso Y larm......... Y rleg Y lleg...... Y lhnd Y rfoot Y lfoot In principle, all parts depend on each other. Knowing where the head is puts constraints on where the feet can be. But conditional independences as specified by the graph: If we know where the left leg is, the left foot s position does not depend on the torso position anymore, etc. p(y lfoot y top,..., y torso,..., y rfoot, x) = p(y lfoot y lleg, x) 16 / 24

Peter Gehler Introduction to Graphical Models Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F, E), E V F variable nodes V (circles), factor nodes F (boxes), edges E between variable and factor nodes. each factor F F connects a subset of nodes, write F = {v1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Factorization into potentials ψ at factors: p(y) = 1 ψ F (y F ) Z F F 17 / 24

Peter Gehler Introduction to Graphical Models Conditional Distributions How to model p(y x)? Potentials become also functions of (part of) x: ψ F (y F ; x F ) instead of just ψ F (y F ) p(y x) = 1 Z(x) ψ F (y F ; x F ) F F X i Y i X j Y j Partition function depends on x F Factor graph Z(x) = ψ F (y F ; x F ). y Y F F Note: x is treated just as an argument, not as a random variable. Conditional random fields (CRFs) 18 / 24

Peter Gehler Introduction to Graphical Models Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: ψ F (y F ; x F ) = exp( E F (y F ; x F )), or equivalently E F (y F ; x F ) = log(ψ F (y F ; x F )). p(y x) can be written as p(y x) = 1 Z(x) ψ F (y F ; x F ) F F = 1 Z(x) exp( E F (y F ; x F )) = 1 Z(x) exp( E(y; x)) F F for E(y; x) = F F E F (y F ; x F ) 19 / 24

Peter Gehler Introduction to Graphical Models Conventions: Energy Minimization argmax y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. In practice, one typically models the energy function directly. the probability distribution is uniquely determined by it. 20 / 24

Peter Gehler Introduction to Graphical Models Example: An Energy Function for Image Segmentation Foreground/background image segmentation X = [0, 255] W H, Y = {0, 1} W H foreground: y i = 1, background: y i = 0. graph: 4-connected grid Each output pixel depends on local grayvalue (inputs) neighboring outputs Energy function components ( Ising model): E i (y i = 1, x i ) = 1 1 255 x i E i (y i = 0, x i ) = 1 255 x i x i bright y i rather foreground, x i dark y i rather background E ij (0, 0) = E ij (1, 1) = 0, E ij (0, 1) = E ij (1, 0) = ω for ω > 0 prefer that neighbors have the same label labeling smooth 21 / 24

Peter Gehler Introduction to Graphical Models E(y; x) = i ( (1 1 255 x i) y i = 1 + 1 ) 255 x i y i = 0 + w y i y j i j input image segmentation segmentation from from thresholding minimal energy 22 / 24

Peter Gehler Introduction to Graphical Models What to do with Structured Prediction Models? Case 1) p(y x) is known MAP Prediction Predict f : X Y by solving y = argmax y Y = argmin y Y p(y x) E(y, x) Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. 23 / 24

Peter Gehler Introduction to Graphical Models What to do with Structured Prediction Models? Case 2) p(y x) is unknown, but we have training data Parameter Learning Assume fixed graph structure, learn potentials/energies (ψ F ) Among other tasks (learn the graph structure, variables, etc.) Topic of Wednesdays lecture 24 / 24

Example: Pictorial Structures input image x argmax y p(y x) p(y i x) MAP makes a single (structured) prediction (point estimate) best overall pose Marginal probabilities p(y i x) give us potential positions uncertainty of the individual body parts. 1 / 24

Example: Man-made structure detection input image x argmax y p(y x) p(y i x) Task: Pixel depicts a man made structure or not? y i {0, 1} Middle: MAP inference Right: variable marginals Attention: Max-Marginals MAP 2 / 24

Probabilistic Inference Compute p(y F x) and Z(x). 3 / 24

Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H 4 / 24

Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 exp( E(y; x)). Z(x) Problem: We don t know Z(x), and computing it using Z(x) = y Y exp( E(y; x)) looks expensive (the sum has Y i Y j Y k Y l terms). A lot research has been done on how to efficiently compute Z(x). 4 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y exp( E(y)) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= exp( E(y)) y Y = y i Y i y j Y j y k Y k y l Y l exp( E(y i, y j, y k, y l )) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= exp( E(y)) y Y = y i Y i y j Y j y k Y k = y i Y i y j Y j y k Y k exp( E(y i, y j, y k, y l )) y l Y l y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k = y i y j y k y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) y l exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing Z= Y i Y j Y k Y l F G H y i Y i y j Y j y k Y k = y i y l Y l exp( (E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ))) exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y j y k y l = exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r H Yk R Y k Y i Y j Y k Y l F G H Z= exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l }{{} r H Yk (y k ) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r H Yk R Y k Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k )) exp( E H (y k, y l )) y i y j y k y l }{{} r H Yk (y k ) = exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r G Yj R Y j Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r G Yj R Y j Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) = exp( E F (y i, y j ))r G Yj (y j ) y i y j 5 / 24

Probabilistic Inference Belief Propagation / Message Passing r F Yi R Y i Y i Y j Y k F G H Y l Z= exp( E F (y i, y j )) exp( E G (y j, y k ))r H Yk (y k ) y i y j y k }{{} r G Yj (y j ) = exp( E F (y i, y j ))r G Yj (y j ) y i y j = y i r F Yi (y i ) 5 / 24

Example: Inference on Trees Y i Y j Y k Y l F G H I Y m Z = exp( E(y)) y Y = y i Y i y j Y i y k Y i y l Y i y m Y m exp( (E F (y i, y j ) + + E I (y k, y m ))) 6 / 24

Example: Inference on Trees Y i Y j Y k Y l F G H I Y m Z = y i Y i exp( E F (y i, y j )) exp( E G (y j, y k )) y j Y j y k Y k exp( E H (y k, y l )) exp( E I (y k, y m )) y l Y l y m Y m }{{}}{{} r H Yk (y k ) r I Yk (y k ) 6 / 24

Example: Inference on Trees r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) (r H Yk (y k ) r I Yk (y k )) y k Y k exp( E G (y j, y k )) 6 / 24

Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) (r H Yk (y k ) r I Yk (y k )) }{{} q Yk G(y k ) y k Y k exp( E G (y j, y k )) 6 / 24

Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m Z = y i Y i y j Y j exp( E F (y i, y j )) y k Y k exp( E G (y j, y k ))q Yk G(y k ) 6 / 24

Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message...... F r F Yi q Yi F Y i...... 7 / 24

Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively update messages...... F r F Yi q Yi F After convergence: Z and p(y F ) can be obtained from the messages. Y i...... Belief Propagation 7 / 24

Example: Pictorial Structures X F (1) top Y top...... Y head F (2) top,head... Y rhnd......... Yrarm Y torso Y larm......... Y rleg Y lleg...... Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973) Body-part variables, states: discretized tuple (x, y, s, θ) (x, y) position, s scale, and θ rotation 8 / 24

Example: Pictorial Structures input image x p(y i x) Exact marginals although state space is huge and thus partition function is a huge sum. Z(x) = exp ( E(y; x)) all bodies y 9 / 24

Belief Propagation in Loopy Graphs Can we do message passing also in graphs with loops? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order. Suggested solution: Loopy Belief Propagation (LBP) initialize all messages as constant 1 pass messages until convergence 10 / 24

Belief Propagation in Loopy Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Loopy Belief Propagation is very popular, but has some problems: it might not converge (e.g. oscillate) even if it does, the computed probabilities are only approximate. Many improved message-passing schemes exist (see tutorial book). 11 / 24

Probabilistic Inference Variational Inference / Mean Field Task: Compute marginals p(y F x) for general p(y x) Idea: Approximate p(y x) by simpler q(y) and use marginals from that. q = argmin D KL (q(y) p(y x)) q Q E.g. Naive Mean Field: Q all distributions of the form q(y) = i V q i (y i ). q h q e q f q g q i q j q k q l q m 12 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo Task: Compute marginals p(y F x) for general p(y x) Idea: Rephrase as computing the expected value of a quantity: E y p(y x,w) [h(x, y)], for some (well-behaved) function h : X Y R. For probabilistic inference, this step is easy. Set h F,z (x, y) := y F = z, then E y p(y x,w) [h F,z (x, y)] = p(y x) y F = z y Y = p(y F x) y F = z = p(y F = z x). y F Y F 13 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the dimension of Y. Problem: Producing i.i.d. samples, y (s), from p(y x) is hard. Solution: We can get away with a sequence of dependent samples Monte-Carlo Markov Chain (MCMC) sampling 14 / 24

Probabilistic Inference Sampling / Markov-Chain Monte Carlo One example how to do MCMC sampling: Gibbs sampler Initialize y (0) = (y 1,..., y d ) arbitrarily For s = 1,..., S: 1. Select a variable y i, 2. Re-sample y i p(y i y (s 1) V \{i}, x). 3. Output sample y (s) = (y (s 1) 1,..., y (s 1) i 1, y i, y (s 1) i+1,..., y (s 1) d ) p(y i y (s) V \{i}, x) = p(y i, y (t) V \{i} x) y i Y i p(y i, y (t) V \{i} x) exp( E(y i, y (t), x) = y i Y i exp( E(y i, y (t), x) 15 / 24

MAP Prediction Compute y = argmax y p(y x). 16 / 24

MAP Prediction Belief Propagation / Message Passing B 10. Y m A B Y i Y j Y k F 9. C D E A 1. 6. 8. Y k Y l 3. 5. D E 2. 4. Y i Y j 7. C F G Y l Y m Y n H I J K L Y o Y p Y q One can also derive message passing algorithms for MAP prediction. In trees: guaranteed to converge to optimal solution. In loopy graphs: convergence not guaranteed, approximate solution. 17 / 24

MAP Prediction Graph Cuts For loopy graphs, we can find the global optimum only in special cases: Binary output variables: Y i = {0, 1} for i = 1,..., d, Energy function with only unary and pairwise terms E(y; x, w) = i E i (y i ; x) + i j E i,j (y i, y j ; x) Restriction 1 (positive unary potentials): E F (y i ; x, w tf ) 0 (always achievable by reparametrization) Restriction 2 (regular/submodular/attractive pairwise potentials) E F (y i, y j ; x, w tf ) = 0, if y i = y j, E F (y i, y j ; x, w tf ) = E F (y j, y i ; x, w tf ) 0, otherwise. (not always achievable, depends on the task) 18 / 24

Construct auxiliary undirected graph One node {i} i V per variable Two extra nodes: source s, sink t Edges Edge Graph cut weight {i, j} E F (y i = 0, y j = 1; x, w tf ) {i, s} E F (y i = 1; x, w tf ) {i, t} E F (y i = 0; x, w tf ) Find linear s-t-mincut {i, t} {i, s} s i j k l m n Solution defines optimal binary labeling of the original energy minimization problem GraphCuts algorithms (Approximate) multi-class extensions exist, see tutorial book. t 19 / 24

GraphCuts Example Image segmentation energy: E(y; x) = ( (1 1 255 x i) y i = 1 + 1 ) 255 x i y i = 0 + w y i y j i i j All conditions to apply GraphCuts are fulfilled. E i (y i, x) 0, E ij (y i, y j ) = 0 for y i = y j, E ij (y i, y j ) = w > 0 for y i y j. input image thresholding GraphCuts 20 / 24

MAP Prediction Linear Programming Relaxation More general alternative, Y i = {1,..., K}: E(y; x) = i E i (y i ; x) + ij E ij (y i, y j ; x) Linearize the energy using indicator functions: E i (y i ; x) = K E i (k; x) y }{{} i = k = =:a ik k=1 K a i;k µ i;k k=1 for new variables µ i;k {0, 1} with k µ i;k = 1. E ij (y i, y j ; x) = K K k=1 l=1 E ij (k, l; x) }{{} =:a ij;kl y i = k y j = l = K a ij;kl µ ij;kl for new variables µ ij;kl {0, 1} with l µ ij;kl = µ i;k and k µ ij;kl = µ j;l. k=1 21 / 24

MAP Prediction Linear Programming Relaxation Energy minimization becomes y µ := argmin a i;k µ i;k + µ i ij subject to a ij;kl µ ij;kl = argmin Aµ µ µ i;k {0, 1} µ ij;kl {0, 1} µ i;k = 1, µ ij;kl = µ i;k, k l µ ij;kl = µ j;l k Integer variables, linear objective function, linear constraints: Integer linear program (ILP) Unfortunately, ILPs are in general NP-hard. 22 / 24

MAP Prediction Linear Programming Relaxation Energy minimization becomes y µ := argmin a i;k µ i;k + µ i ij subject to a ij;kl µ ij;kl = argmin Aµ µ µ i;k [0, 1] {0, 1} µ ij;kl [0, 1] {0, 1} µ i;k = 1, µ ij;kl = µ i;k, µ ij;kl = µ j;l k l Integer real-values variables, linear objective function, linear constraints: Linear program (LP) relaxation LPs can be solved very efficiently, µ yields approximate solution for y. k 23 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. 24 / 24

MAP Prediction Custom solutions: E.g. branch-and-bound Note: we just try to solve an optimization problem y = argmin E(y; x) y Y We can use any optimization technique that fits the problem. For low-dimensional Y, such as bounding boxes: branch-and-bound: 24 / 24

Optimal Prediction Predict with loss function (ȳ, y). 1 / 6

Optimal Prediction Optimal prediction is minimum expected risk an expectation y = argmin ȳ Y (ȳ, y)p(y x) y Y X i X j Y i Y j 2 / 6

Optimal Prediction Optimal prediction is minimum expected risk an expectation y = argmin (ȳ, y)p(y x) ȳ Y y Y = argmin (ȳ, y) ψ F (y F ; x) ȳ Y y Y F X i Y i X j Y j Can think of as another CRF factor Reuse inference techniques (ȳ, ) 2 / 6

Example: Hamming loss Count the number of mislabeled variables: H (y, y) = 1 V I(y i y i ) i V Makes more sense than 0/1 loss for image segmentation Optimal: predict maximum marginals (exercise) y = (argmax p(y 1 x), argmax p(y 2 x),...) y 1 y 2 3 / 6

Example: Pixel error If we can add elements in Y i (pixel intensities, optical flow vectors, etc.). Sum of squared errors Q (y, y) = 1 V y i y i 2. i V Used, e.g., in stereo reconstruction, part-based object detection. Optimal: predict marginal mean (exercise) y = (E p(y x) [y 1 ], E p(y x) [y 2 ],...) 4 / 6

Example: Task specific losses Object detection bounding boxes, or arbitrary regions detection ground truth image Area overlap loss: AO (y, y) = 1 area(y y) area(y y) = 1 Used, e.g., in PASCAL VOC challenges for object detection, because it scale-invariants (no bias for or against big objects). 5 / 6

Summary: Inference and Prediction Two main tasks for a given probability distribution p(y x): Probabilistic Inference Compute p(y I x) for a subset I of variables, in particular p(y i x) (Loopy) Belief Propagation, Variation Inference, Sampling,... MAP Prediction Identify y Y that maximizes p(y x) (minimizes energy) (Loopy) Belief Propagation, GraphCuts, LP-relaxation, custom,... Structured prediction comes with structured loss functions, : Y Y R. Loss Function (y, y) is loss (or cost) for predicting y Y if y Y is correct. Task specific: use 0/1-loss, Hamming loss, area overlap,... 6 / 6

More information: http://ps.is.tue.mpg.de/ Max Planck Institute for Intelligent Systems Other groups on Campus Empirical Inference (Machine Learning) Perceiving Systems (Computer Vision) Autonomous Motion (Robotics)