Learning with Structured Inputs and Outputs

Size: px

Start display at page:

Download "Learning with Structured Inputs and Outputs"

Melina Woods
5 years ago
Views:

Machine Learning and Intelligence School 2015 July 29-August 5, 2015 Saint

1 Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August 5, 2015 Saint Petersburg, Russia Slides: (lots of material courtesy of S. Nowozin,

Ad: PhD/PostDoc Positions at IST Austria, Vienna IST Austria Graduate School English language program 1 + 3 yr PhD program full salary PostDoc positions in my group machine learning structured output

2 Ad: PhD/PostDoc Positions at IST Austria, Vienna IST Austria Graduate School English language program yr PhD program full salary PostDoc positions in my group machine learning structured output learning transfer learning computer vision visual scene understanding curiosity driven basic research competitive salary, no mandatory teaching,... Internships in 2016: ask me! More information: or ask me during a break

3 Schedule Monday Monday Monday Monday Wednesday Wednesday Introduction to Graphical Models Probabilistic Inference MAP Prediction / Energy Minimization Structured Loss Functions Learning Conditional Random Fields Learning Structured Support Vector Machines Slides available on my home page: 3 / 82

4 Related tutorial in book form Foundations and Trends in Computer Graphics and Vision now publisher Available as PDF on 4 / 82

5 Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 5 / 82

6 Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y Y is a number (real or integer) Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 82

the way in which the parts belong together.

7 What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 82

8 What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Speech Processing: Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) Information Retrieval: Ranking (output: ordered list of documents) This lecture: mainly examples from Computer Vision 7 / 82

9 Example: Human Pose Estimation x X y Y Given an image, where is a person and how is it articulated? f : X Y Image x, but what is y Y precisely? Image: Flickr.com user lululemon athletica 8 / 82

10 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

11 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

12 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

13 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

14 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

15 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

16 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} same for torso, left arm, right arm,... Entire Body: y = (y head, y torso, y left-lower-arm,...) Y 9 / 82

17 Example: Human Pose Estimation Idea: Have a head detector (CNN, SVM, RF,...) f head : X R 10 / 82

18 Example: Human Pose Estimation Idea: Have a head detector (CNN, SVM, RF,...) f head : X R Evaluate for every possible location and record score Same construction for all other body parts 10 / 82

19 Example: Human Pose Estimation Image x X Put together body from individual parts y best = (y best head, ybest torso, ) 11 / 82

20 Example: Human Pose Estimation Image x X Prediction y best Y Put together body from individual parts y best = (y best head, ybest torso, ) Each part looks reasonable, but overall makes no sense 11 / 82

21 Example: Human Pose Estimation Image: Ben Sapp Enforce relations between parts For example, head must be connected to torso Problem: y best (y best head, ybest torso, ) independent decisions for each body part are not optimal anymore 12 / 82

22 Example: Human Pose Estimation Image: Ben Sapp Enforce relations between parts For example, head must be connected to torso Problem: y best (y best head, ybest torso, ) independent decisions for each body part are not optimal anymore Needs structured output prediction function f : X Y 12 / 82

23 The general recipe Normal prediction function, X = anything, Y = R Extract feature vector from x and compute a number from it e.g. f(x) = w, φ(x) + b 13 / 82

24 The general recipe Normal prediction function, X = anything, Y = R Extract feature vector from x and compute a number from it e.g. f(x) = w, φ(x) + b Structured output prediction function, X = anything, Y = anything 1) Define auxiliary function, g : X Y R, e.g. g(x, y) = i ψ i(y i, x) i j ψ ij (y i, y j, x) 2) Construct f : X Y from g, e.g., f(x) = argmax y Y g(x, y) Challenges: how to learn g(x, y) from training data? how to compute f(x) from g(x, y)? 13 / 82

25 Part 1: Probabilistic Graphical Models

26 A Probabilistic View Machine Learning almost always deals with uncertain information training examples are collected randomly (e.g. from the web) annotation is noisy (there can be mistakes, or ambiguous cases) tasks cannot be solved with 100 percent certainty, because of incomplete information ( guess what number I think of ), or inherent randomness ( guess a coin toss ) Uncertainty is captured by (conditional) probability distributions: p(y x) for input x X, how likely is y Y the correct output? Probabilities make sense even for deterministic objects: how certain are we that y is the right answer to x? 15 / 82

27 A Probabilistic View on f : X Y Structured output function, X = anything, Y = anything We need to define an auxiliary function, g : X Y R. e.g. g(x, y) := p(y x). Then prediction by maximimization f(x) = argmax y Y g(x, y) = argmax p(y x) y Y becomes maximum a posteriori (MAP) prediction. Interpretation: the most probable one. If you have to pick a single output, y Y, use the one you are most confident about. 16 / 82

28 Excursus: Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y 17 / 82

29 Excursus: Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y Example: binary ( Bernoulli ) variable y Y = {0, 1} 2 values, 1 degree of freedom p(y) y 17 / 82

30 Excursus: Conditional Probability Distributions x X y Y p(y x) 0 (positivity) x X y Y p(y x) = 1 (normalization w.r.t. y) For example: binary prediction X = {images}, y Y = {1, 7} each x: 2 values, 1 d.o.f. one (or two) function 18 / 82

31 Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization 19 / 82

32 Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization Example: predicting the center point of an object y Y = {(1, 1),..., (width, height)} for each x: Y = W H values, y = (y 1, y 2 ) Y 1 Y 2 with Y 1 = {(1,..., width} and Y 2 = {1,..., height}. each x: Y 1 Y 2 = W H values, Image: Derek Magee, University of Leeds (TU Darmstadt cows dataset) 19 / 82

33 Structured objects: predicting M variables jointly Y = {1, K} {1, K} {1, K} For each x: K M values, K M 1 d.o.f. K M functions Example: Pose estimation Y part = {1,..., W } {1,..., H} {1,..., 360} Y = Y head Y left-arm Y right-foot For each x: (360W H) #body parts values many billions function 20 / 82

34 Example: image denoising Y = { RGB images} For each x: (255 3 ) values in p(y x), over 10 2,000,000 functions too much! 21 / 82

35 Example: image denoising Y = { RGB images} For each x: (255 3 ) values in p(y x), over 10 2,000,000 functions too much! We cannot consider all possible distributions, we must impose structure. 21 / 82

36 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. 22 / 82

37 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. 22 / 82

38 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. The graph encodes conditional independence assumptions between the variables: Let N(i) be the neighbors of node i in the graph (V, E). Then p(y i y V \{i} ) = p(y i y N(i) ) with y V \{i} = (y 1,..., y i 1, y i+1, y n ). 22 / 82

39 Example: Pictorial Structures for Articulated Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot All parts depend on each other. Knowing where the head is puts constraints on where the feet can be. But conditional independences as specified by the graph: If we know where the left leg is, the left foot s position does not depend on the torso or the head position anymore, etc. p(y left-foot y top,..., y torso,..., y right-foot, x) = p(y left-foot y left-leg, x) 23 / 82

40 Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F), variable nodes V, factor nodes F, each factor F F connects a subset of nodes, write F = {v 1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph 24 / 82

41 Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F), variable nodes V, factor nodes F, each factor F F connects a subset of nodes, write F = {v 1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Distribution factorizes into potentials ψ at factors: p(y) = 1 ψ F (y F ) Z F F Z is a normalization constant, called partition function: Z = ψ F (y F ). y Y F F 24 / 82

42 Conditional Distributions How to model p(y x)? Potentials become also functions of (part of) x: ψ F (y F ; x F ) instead of just ψ F (y F ) p(y x) = 1 Z(x) ψ F (y F ; x F ) F F X i Y i X j Y j Partition function depends on x F Factor graph Z(x) = ψ F (y F ; x F ). y Y F F Note: x is treated just as an argument, not as a random variable. Conditional random fields (CRFs) 25 / 82

43 Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: ψ F (y F ; x F ) = exp( E F (y F ; x F )), or equivalently E F (y F ; x F ) = log(ψ F (y F ; x F )). 26 / 82

44 Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: or equivalently ψ F (y F ; x F ) = exp( E F (y F ; x F )), E F (y F ; x F ) = log(ψ F (y F ; x F )). p(y x) can be written as Gibbs distribution p(y x) = 1 Z(x) ψ F (y F ; x F ) F F = 1 Z(x) exp( E F (y F ; x F )) = 1 exp( E(y; x)) Z(x) F F for E(y; x) = F F E F (y F ; x F ) 26 / 82

45 Conventions: Energy Minimization argmax y Y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. 27 / 82

46 Conventions: Energy Minimization argmax y Y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. In practice, one typically just models the energy function the probability distribution is uniquely determined by it. 27 / 82

47 Example: An Energy Function for Human Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot E(y; x) = i {head,torso,... } E i (y i ; x i ) + (i,j) E ij (y i, y j ) unary factors (depend on one label): appearance e.g. E head (y; x) Does location y in image x look like a head? pairwise factors (depend on two labels): geometry e.g. Ehead-torso (y head, y torso ) Is location y head above location y torso? 28 / 82

48 Example: An Energy Function for Image Segmentation Object segmentation: e.g. horse X : Y: Energy function components ( Ising model): { low if x i is the right color, e.g. brown E i (y i = 1, x i ) = high otherwise E i (y i = 0, x i ) = E i (y i = 1, x i ) { low if y i = y j E i (y i, y j ) = high otherwise prefer that neighbors have the same label smooth labelings 29 / 82

49 What to do with Structured Prediction Models? Case 1) p(y x) is known MAP Prediction Predict f : X Y by optimization y = argmax y Y p(y y) = argmin y Y E(y, x) Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. 30 / 82

50 What to do with Structured Prediction Models? input image argmax y p(y x) p(y i x) for i = 1,..., 6 MAP makes a single (structured) prediction best overall pose Marginal probabilities p(y i x) give us potential positions uncertainty of the individual body parts. Images: Buffy Stickmen dataset (Ferrari et al.) 31 / 82

51 What to do with Structured Prediction Models? Case 2) p(y x) is unknown, but we have training data Structure Learning Learn graph structure from training data. Variable Learning Learn, whether to use additional (latent) variables, and which ones. (input and output variables are fixed by the task we try to solve). Parameter Learning Assume a fixed factor graph, learn parameters of the energy. lecture on Wednesday 32 / 82

52 Part 2: Probabilistic Inference p(y F x)

53 Probabilistic Inference Overview Exact Inference Belief Propagation on chains Belief Propagation on trees Junction tree algorithm Approximate Inference Loopy Belief Propagation MCMC Sampling Variational Inference / Mean Field 34 / 82

54 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H 35 / 82

55 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 Z(x) e E(y;x). 35 / 82

56 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 Z(x) e E(y;x). Problem: We don t know Z(x), and computing it using Z(x) = y Y e E(y;x) looks expensive (the sum has Y i Y j Y k Y l terms). A lot research has been done on how to efficiently compute Z(x). 35 / 82

57 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y e E(y) 36 / 82

58 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y = e E(y) e E(yi,yj,yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

59 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y = e E(y) e E(yi,yj,yk,yl) y i Y i y j Y j y k Y k y l Y l = e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

60 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

61 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l = y i y j y k y l e E F (y i,y j) e E G(y j,y k) e E H(y k,y l) 36 / 82

62 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l = y i y j = y i y k y j e E F (y i,y j) e E F (y i,y j) e E G(y j,y k) e E H(y k,y l) y l y k e E G(y j,y k) y l e E H(y k,y l) 36 / 82

63 Probabilistic Inference Belief Propagation r H Yk R Y k Y i Y j Y k Y l F G H Z= y i y j e E F (y i,y j) y k e E G(y j,y k) y l e E H(y k,y l) }{{} r H Yk (y k ) 36 / 82

64 Probabilistic Inference Belief Propagation r H Yk R Y k Y i Y j Y k F G H Y l Z= y i = y i y j e E F (y i,y j) y j e E F (y i,y j) y k e E G(y j,y k) y l e E H(y k,y l) }{{} r H Yk (y k ) y k e E G(y j,y k) r H Yk (y k ) 36 / 82

65 Probabilistic Inference Belief Propagation r G Yj R Y j Y i Y j Y k F G H Y l Z = y i y j e E F (y i,y j) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

66 Probabilistic Inference Belief Propagation r G Yj R Y j Y i Y j Y k F G H Y l Z = y i = y i y j e E F (y i,y j) y j e E F (y i,y j) r G Yj (y j ) }{{} r F Yi (y i ) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

67 Probabilistic Inference Belief Propagation r F Yi R Y i Y i Y j Y k F G H Y l Z = y i = y i y j e E F (y i,y j) y j e E F (y i,y j) r G Yj (y j ) }{{} r F Yi (y i ) = y i r F Yi (y i ) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

68 Example: Inference on Trees Y i Y j Y k Y l F G H I 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = y Y = e E(y) Y m e EF (yi,yj)... EI(yk,ym) y i Y i y j Y j y k Y k y l Y l y m Y m 37 / 82

69 Example: Inference on Trees Y i Y j Y k Y l F G H I 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) e EG(yj,yk) y i Y i y j Y j y k Y k y l Y l e EH(yk,yl) } {{ } r H Yk (y k ) Y m y m Y m e EI(yk,ym) } {{ } r I Yk (y k ) 37 / 82

70 Example: Inference on Trees r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j Y m y k Y k e EG(yj,yk) r H Yk (y k ) r I Yk (y k ) 37 / 82

71 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j Y m e EG(yj,yk) r H Yk (y k ) r I Yk (y k ) }{{} y k Y k q Yk G(y k ) 37 / 82

72 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j y k Y k e EG(yj,yk) q Yk G(y k ) 37 / 82

73 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children 3) etc. Z = e EF (yi,yj) y i Y i y j Y j y k Y k e EG(yj,yk) q Yk G(y k ) 37 / 82

74 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message 38 / 82

75 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively updates messages After convergence: Z and p(y F ) can be obtained from the messages, e.g. p(y i = y i ) r F Yi (y i ) F :(i,f ) E (Sum-Product) Belief Propagation 38 / 82

76 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively updates messages After convergence: Z and p(y F ) can be obtained from the messages, e.g. p(y i = y i ) r F Yi (y i ) F :(i,f ) E (Sum-Product) Belief Propagation Easier to implement than to explain / 82

77 Example: Pictorial Structures X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973) Belief propagation is the state-of-the-art for prediction and inference 39 / 82

78 Example: Pictorial Structures Marginal probabilities p(y i x) give us potential positions uncertainty of the body parts. 40 / 82

79 Belief Propagation in Cyclic Graphs Belief propagation does not work for graph with cycles graph with factors of size more than 2 41 / 82

80 Belief Propagation in Cyclic Graphs Belief propagation does not work for graph with cycles graph with factors of size more than 2 We can construct equivalent chain/tree models: Yi Yi Yj Y k Y l Yj Y k Y l Ym ~ Y Ym ~ Y Ỹ = (Y i, Y j, Y m ) with state space Ỹ = Y i Y j Y m General procedure: junction tree algorithm Problem: exponentially growing state space BP inefficient 41 / 82

81 Belief Propagation in Cyclic Graphs What if we do belief propagation even though the graphs has cycles? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order where to start? 42 / 82

82 Belief Propagation in Cyclic Graphs What if we do belief propagation even though the graphs has cycles? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order where to start? Loopy Belief Propagation (LBP) initialize all messages as constant 1 pass messages using rules of BP until a stop criterion 42 / 82

83 Belief Propagation in Cyclic Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problems: loopy BP might not converge (e.g. it can oscillate) even if it does, the computed probabilities are only approximate. Several improved schemes exist, some even convergent (but approximate) Exact inference in general cycles graph is #P-hard. 43 / 82

84 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Task: Compute marginals p(y F x) for general p(y x) Idea: Rephrase as computing the expected value of a function: E y p(y x,w) [h(x, y)], for some (well-behaved) function h : X Y R. For probabilistic inference, this step is easy. h F,z (x, y) := y F = z, Then E y p(y x,w) [h F,z (x, y)] = p(y x) y F = z y Y = p(y F x) y F = z = p(y F = z x). y F Y F 44 / 82

85 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the size of Y. 45 / 82

86 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the size of Y. Problem: Producing i.i.d. samples, y (s), from p(y x) is hard. Solution: We can get away with a sequence of dependent samples Monte-Carlo Markov Chain (MCMC) sampling 45 / 82

87 Probabilistic Inference Sampling / Markov-Chain Monte Carlo One example how to do MCMC sampling: Gibbs sampler Initialize y (1) = (y 1,..., y d ) arbitrarily For s = 1,..., S: 1. Select an index i, 2. Re-sample y i p(y i y (s) V \{i}, x). 3. Output sample y (s+1) = (y (s) 1,..., y(s) i 1, y i, y (s) i+1,..., y(s) d ) p(y i y (s) V \{i}, x) = = p(y i, y (s) V \{i} x) y i Y i p(y i, y (s) V \{i} x) e E(y i,y (s),x) y i Y i e E(y i,y (s),x) 46 / 82

88 Probabilistic Inference Variational Inference / Mean Field Task: Compute marginals p(y F x) for general p(y x) Idea: Approximate p(y x) by simpler q(y) and use marginals from that. For example q = argmin q Q p(y F x) q (y F ) D KL (q(y) p(y x)) qe qf qg qh qi qj qk ql qm original model tree approximation product of unary factors 47 / 82

89 Probabilistic Inference Variational Inference / Mean Field Special case: (Naive) Mean Field for p(y x) = 1 Z(x) e E(y,x) n p(y x) q(y) = q i (y i ) No closed for expression for q, but optimality condition: qi (y i ) e E y\{y i } Q{E(y,x)} for Q(y 1,..., y i 1, y i+1,..., y n ) = qj (y j ) j i i V Iterative scheme: initialize q i (e.g. uniform) repeat until convergence for i V in any order: update q i while keeping the others fixed 48 / 82

90 Example: State-of-the-Art in Image Segmentation Image: [Chen et al., Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015] see also [Krähenbühl, Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011] 49 / 82

91 Example: State-of-the-Art in Image Segmentation Image: [Chen et al., Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015] see also [Krähenbühl, Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011] 49 / 82

92 Probabilistic Inference Summary Task: compute (marginal) probabilities p(y F x) Exact Probabilistic Inference Only possible for certain models: trees/forests: sum-product belief propagation general graphs: junction chain algorithm (if tractable) Approximate Probabilistic Inference Many techniques with different properties and guarantees: loopy belief propagation MCMC sampling (e.g. Gibbs sampling) variational inference (e.g. mean field)... Best choice depends on model and requirements. 50 / 82

93 Part 3: MAP Prediction / Energy Minimization argmax y p(y x) / argmin y E(y, x)

94 MAP Prediction / Energy Minimization Exact Energy Minimization Belief Propagation on chains/trees Graph-Cuts for submodular energies Integer Linear Programming Approximate Energy Minimization Linear Programming Relaxations Local Search Methods Iterative Conditional Modes Multi-label Graph Cuts Simulated Annealing 52 / 82

95 Example: Pictorial Structures / Deformable Parts Model X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973), (Yang and Ramanan, 2013), (Pishchulin et al., 2012) 53 / 82

96 Example: Pictorial Structures / Deformable Parts Model most likely configuration y = argmax y Y p(y x) = argmin E(y, x) y 54 / 82

97 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min y E(y) = min y i,y j,y k,y l E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) 55 / 82

98 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min E H (y k, y l ) ]] y i,y j y l y k 55 / 82

99 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) ]] 55 / 82

100 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r H Yk R Y k Y i Y j Y k Y l F G H min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) = min y i,y j [ EF (y i, y j ) + min y k E G (y j, y k ) + r H Yk (y k ) ] ]] 55 / 82

101 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r H Yk R Y k Y i Y j Y k Y l F G H min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) ]] 55 / 82

102 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r G Yj R Y j Y i Y j Y k F G H Y l min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) = min y i,y j [ EF (y i, y j ) + r G Yj (y j ) ]... ]] 55 / 82

103 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r G Yj R Y j Y i Y j Y k F G H Y l min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) = min y i,y j [ EF (y i, y j ) + r G Yj (y j ) ]... actual argmax by backtracking which choices were maximal ]] 55 / 82

104 Energy Minimization Belief Propagation Tree models: q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m q H Yk (y k ) = min yl E H (y k, y l ) q I Yk (y k ) = min ym E I (y k, y m ) q Yk G(y k ) = q H Yk (y k ) + q I Yk (y k ) min-sum (more common max-sum) belief propagation 56 / 82

105 Belief Propagation in Cyclic Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Loopy Max-Sum Belief Propagation Same problem as in probabilistic inference: no guarantee of convergence no guarantee of optimality Some convergent variants exist, e.g. TRW-S [Kolmogorov, PAMI 2006] 57 / 82

106 Cyclic Graphs In general, MAP prediction/energy minimization in models with cycles or higher-order terms is intractable (NP-hard). Some important exceptions: low tree-width [Lauritzen, Spiegelhalter, 1988] binary states, pairwise submodular interactions [Boykov, Jolly, 2001] binary states, only pairwise interactions, planar graph [Globerson, Jaakkola, 2006] special (Potts P n ) higher order factors [Kohli, Kumar, 2007] perfect graph structure [Jebara, 2009] Y i Y k Y j Y l 58 / 82

107 Submodular Energy Functions Binary variables: Y i = {0, 1} for all i V Energy function: unary and pairwise factors E(y; x, w) = E i (y i ) + E ij (y i, y j ) i V (i,j) E X i Y i X j Y j 59 / 82

108 Submodular Energy Functions Binary variables: Y i = {0, 1} for all i V Energy function: unary and pairwise factors E(y; x, w) = E i (y i ) + E ij (y i, y j ) i V (i,j) E Restriction 1 (without loss of generality): E i (y i ) 0 (always achievable by adding a constant to E) Restriction 2 (submodularity): X i Y i X j Y j E ij (y i, y j ) = 0, if y i = y j, E ij (y i, y j ) = E ij (y j, y i ) 0, otherwise. neighbors prefer to have the same labels 59 / 82

109 Graph-Cuts Algorithm for Submodular Energy Minimization [Greig et al., 1989] If conditions are fulfilled, energy minimization can be performed by a solving an s-t-mincut problem: construct auxiliary undirected graph one node {i} i V per variable two extra nodes: source s, sink t {i, s} s weighted edges Edge weight {i, j} E ij (y i = 0, y j = 1) {i, s} E i (y i = 1) {i, t} E i (y i = 0) find s-t-cut of minimal weight (polynomial time using max-flow theorem) {i, t} i j k l m n From minimal weight cut we recover labeling of minimal energy: y i = 1 if edge {i, s} is cut. Otherwise y i = 0 60 / 82 t

110 Integer Linear Programming (ILP) General energy E(y) = F E F (y F ) variables with more than 2 states higher-order factors (more than 2 variables) non-submodular factors Y i Y k Y j Y l Formulate as integer linear program (ILP) linear objective function linear constraints variables to optimize over are integer-valued ILPs are in general NP-hard, but some individual instances can be solved standard toolboxes: e.g. CPLEX, Gurobi, COIN-OR, / 82

111 Integer Linear Programming (ILP) Example: Encode assignment in indicator variables: z 1 {0, 1} Y 1 z 1;k = 1 y 1 = k z 2 {0, 1} Y 2 z 2;l = 1 y 2 = l z 12 {0, 1} Y 1 Y 2 z 12;kl = 1 y 1 = k y 2 = l 62 / 82

112 Integer Linear Programming (ILP) Example: Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 y 1 = 2 (y 1, y 2) = (2, 3) y 2 = 3 Encode assignment in indicator variables: z 1 {0, 1} Y 1 z 1;k = 1 y 1 = k z 2 {0, 1} Y 2 z 2;l = 1 y 2 = l z 12 {0, 1} Y 1 Y 2 z 12;kl = 1 y 1 = k y 2 = l Constraints: z 1;k = 1, z 2;l = 1, z 12;kl = 1 k Y 1 l Y 2 k,l Y 1 Y 2 z 12;kl = z 2;l z 12;kl = z 1;k k Y 1 l Y 2 (indicator property) (consistency) 62 / 82

113 Integer Linear Programming (ILP) Example: E(y 1, y 2 ) = E 1 (y 1 ) + E 12 (y 1, y 2 ) + E 2 (y 2 ) Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 θ 1 θ 1,2 θ 2 Define coefficient vectors: θ 1 R Y 1 θ 1;k = E 1 (k) θ 2 R Y 2 θ 2;l = E 2 (l) θ 12 R Y 1 Y 2 θ 12;kl = E 1,2 (k, l) Energy is a linear function of unknown z: E(y 1, y 2 ) = θ i;k y i = k + θ ij;kl y i = k y j = l i V k Y i i,j E k,l Y i Y j = θ i;k z i;k + θ ij;kl z ij;kl i V k Y i (i,j) E (k,l) Y i Y j 63 / 82

114 Integer Linear Programming (ILP) subject to min z i V k Y i θ i;k z i;k + (i,j) E z i;k {0, 1} for all i V, k Y i, (k,l) Y i Y j θ ij;kl z ij;kl z ij;kl {0, 1} for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, 64 / 82

115 Integer Linear Programming (ILP) subject to min z i V k Y i θ i;k z i;k + (i,j) E z i;k {0, 1} for all i V, k Y i, (k,l) Y i Y j θ ij;kl z ij;kl z ij;kl {0, 1} for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, NP-hard to solve because of integrality constraints. 64 / 82

116 Linear Programming (LP) Relaxation min θ i;k z i;k + θ ij;kl z ij;kl z i V k Y i (i,j) E (k,l) Y i Y j subject to z i;k {0, 1} z i;k [0, 1] for all i V, k Y i, z ij;kl {0, 1} z ij;kl [0, 1] for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, Relax constraints tractable optimization problem 65 / 82

117 Linear Programming (LP) Relaxation Solution z LP might have fractional values no corresponding labeling y Y round LP solution to {0, 1} values Problem: rounded solution usually not optimal, i.e. not identical to ILP solution LP relaxations perform approximate energy minimization 66 / 82

118 Linear Programming (LP) Relaxation Example: color quantization Example: stereo reconstruction Images: Berkeley Segmentation Dataset 67 / 82

119 Local Search Avoid getting fractional solutions: energy minimization by local search 68 / 82

120 Local Search Avoid getting fractional solutions: energy minimization by local search Y y 0 choose starting labeling y 0 68 / 82

121 Local Search Avoid getting fractional solutions: energy minimization by local search Y N (y 0 ) y 0 choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings 68 / 82

122 Local Search Avoid getting fractional solutions: energy minimization by local search Y N (y 0 ) y 0 y 1 choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings find minimizer within neighborhood, y 1 = argmin y N (y 0 ) E(y) 68 / 82

123 Local Search Avoid getting fractional solutions: energy minimization by local search Y y 0 y 1 )y2 N (y 0 ) N (y 1 N (y 2 ) y 3 N (y ) y N (y 3 ) choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings find minimizer within neighborhood, y 1 = argmin y N (y 0 ) E(y) iterate until no more changes 68 / 82

124 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i / 82

125 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i ICM procedure: neighborhood N (y) = i V N i(y) all states reachable from y by changing a single variable y t+1 = argmin E(y) by exhaustive search y N (y t ) ( i Y i evaluations) 69 / 82

126 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i ICM procedure: neighborhood N (y) = i V N i(y) all states reachable from y by changing a single variable y t+1 = argmin E(y) by exhaustive search y N (y t ) ( i Y i evaluations) Observation: larger neighborhood sizes are better ICM: N (y) linear in V many iterations to explore exponentially large Y ideal: N (y) exponential in V, but: we must ensure that argmin y N (y) E(y) remains tractable 69 / 82

127 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) 70 / 82

128 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) Example: semantic segmentation 70 / 82

129 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) Algorithm initialize y 0 arbitrarily (e.g. everything label 0) repeat for any α L construct neighborhood: N (y) = { (ȳ 1,..., ȳ V ) : ȳ i {y i, α} } each variable can keep its value or switch to α solve y argminy N (y) E(y) until y has not changed for a whole iteration 70 / 82

130 Multilabel Graph-Cut: α-expansion Theorem [Boykov et al. 2001] If all pairwise terms are metric, i.e. for all (i, j) E E ij (k, l) 0 with E ij (k, l) = 0 k = l E ij (k, l) = E ij (l, k) E ij (k, l) E ij (k, m) + E ij (m, l) for all k, l, m Then argmin y N (y) E(y) can be solved optimally using GraphCut. Theorem [Veksler 2001]. The solution, y α, returned by α-expansion fulfills E(y α ) 2c min y Y E(y) max k l E ij (k, l) for c = max (i,j) E min k l E ij (k, l) 71 / 82

131 Example: Semantic Segmentation E(y) = i V E i (y i ) + λ y i y j (i,j) E Potts model E ij (k, l) 0 E ij (k, l) = 0 k = l E ij (k, l) = E ij (l, k) E ij (k, l) E ij (k, m) + E ij (m, l) c = max (i,j) E max k l E ij (k,l) min k l E ij (k,l) = 1 factor-2 approximation guarantee: E(y α ) 2 min y Y E(y) 72 / 82

Example: Stereo Estimation E(y) = i V E i

132 Example: Stereo Estimation E(y) = i V E i (y i ) + λ y i y j (i,j) E y i y j is metric c = max (i,j) E max k l E ij (k,l) min k l E ij (k,l) = L 1 weak guarantees, but often close to optimal labelings in practice Images: Middlebury stereo vision dataset 73 / 82

133 MAP Prediction / Energy Minimization Summary Task: compute argmin y Y E(y x) Exact Energy Minimization Only possible for certain models: trees/forests: max-sum belief propagation general graphs: junction chain algorithm (if tractable) submodular energies: GraphCut general graphs: integer linear programming (if tractable) Approximate Energy Minimization Many techniques with different properties and guarantees: linear programs relaxations ICM α-expansion... Best choice depends on model and requirements. 74 / 82

134 Part 4: Structured Loss Functions (ȳ, y)

135 Loss function How to judge if a (structured) prediction is good? Define a loss function : Y Y R +, (ȳ, y) measures the loss incurred by predicting y when ȳ is correct. The loss function is application dependent 76 / 82

136 Example 1: 0/1 loss Loss is 0 for perfect prediction, 1 otherwise: { 0 if ȳ = y 0/1 (ȳ, y) = ȳ y = 1 otherwise Every mistake is equally bad. Usually not very useful in structured prediction. 77 / 82

137 Example 2: Hamming loss Count the number of mislabeled variables: H (ȳ, y) = 1 V ȳ i y i i V Used, e.g., for graph labeling tasks 78 / 82

138 Example 3: Squared error If we can add elements in Y i (pixel intensities, optical flow vectors, etc.). Sum of squared errors Q (ȳ, y) = 1 V ȳ i y i 2. i V Used, e.g., in stereo reconstruction, part-based object detection. 79 / 82

139 Example 4: Task specific losses Object detection bounding boxes, or arbitrarily shaped regions detection ground truth image Intersection-over-union loss: IoU (bary, y) = 1 area(ȳ y) area(ȳ y) = 1 Used, e.g., in PASCAL VOC challenges for object detection, because its scale-invariance (no bias for or against big objects). 80 / 82

140 Making Bayes-optimal Predictions Given a distribution p(y x), what is the best way to predict f : X Y? Bayesian decision theory: pick f(x) that causes minimal expected loss: f(x) = argmin R (y) y Y for R (y) = Eȳ p(y x) { (ȳ, y)} = ȳ Y (ȳ, y)p(ȳ x) For many loss functions not tractable, but some exceptions: R 0/1 (y) = 1 p(y x), so f(x) = argmax y p(y x) R H (y) = 1 i V p(y i x), so f(x) = (y 1,..., y n ) for y i = argmax k Yi p(y i = k x) 81 / 82

141 Summary Probabilistic Inference Compute p(y F x) for a subset F of variables, in particular p(y i x) some exact and many approximate techniques MAP Prediction / Energy Minimization Identify y Y that maximizes p(y x) (minimizes E(y, x)) some exact and many approximate techniques Loss Function Measure the quality of a prediction by a loss function, : Y Y R. task specific: 0/1-loss, Hamming loss, intersection-of-union,... Optimal prediction rule: f(x) = argmin y E y p(ȳ x) (ȳ, y)) 82 / 82

Introduction To Graphical Models

Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013