Learning with Structured Inputs and Outputs

Size: px
Start display at page:

Download "Learning with Structured Inputs and Outputs"

Transcription

1 Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August 5, 2015 Saint Petersburg, Russia Slides: (lots of material courtesy of S. Nowozin,

2 Ad: PhD/PostDoc Positions at IST Austria, Vienna IST Austria Graduate School English language program yr PhD program full salary PostDoc positions in my group machine learning structured output learning transfer learning computer vision visual scene understanding curiosity driven basic research competitive salary, no mandatory teaching,... Internships in 2016: ask me! More information: or ask me during a break

3 Schedule Monday Monday Monday Monday Wednesday Wednesday Introduction to Graphical Models Probabilistic Inference MAP Prediction / Energy Minimization Structured Loss Functions Learning Conditional Random Fields Learning Structured Support Vector Machines Slides available on my home page: 3 / 82

4 Related tutorial in book form Foundations and Trends in Computer Graphics and Vision now publisher Available as PDF on 4 / 82

5 Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 5 / 82

6 Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y Y is a number (real or integer) Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 82

7 What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 82

8 What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Speech Processing: Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) Information Retrieval: Ranking (output: ordered list of documents) This lecture: mainly examples from Computer Vision 7 / 82

9 Example: Human Pose Estimation x X y Y Given an image, where is a person and how is it articulated? f : X Y Image x, but what is y Y precisely? Image: Flickr.com user lululemon athletica 8 / 82

10 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

11 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

12 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

13 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

14 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

15 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} 9 / 82

16 Example: Human Pose Estimation Example y head Body Part: y head = (u, v, θ) where (u, v) center, θ rotation (u, v) {1,..., M} {1,..., N}, θ {0, 45, 90,...} same for torso, left arm, right arm,... Entire Body: y = (y head, y torso, y left-lower-arm,...) Y 9 / 82

17 Example: Human Pose Estimation Idea: Have a head detector (CNN, SVM, RF,...) f head : X R 10 / 82

18 Example: Human Pose Estimation Idea: Have a head detector (CNN, SVM, RF,...) f head : X R Evaluate for every possible location and record score Same construction for all other body parts 10 / 82

19 Example: Human Pose Estimation Image x X Put together body from individual parts y best = (y best head, ybest torso, ) 11 / 82

20 Example: Human Pose Estimation Image x X Prediction y best Y Put together body from individual parts y best = (y best head, ybest torso, ) Each part looks reasonable, but overall makes no sense 11 / 82

21 Example: Human Pose Estimation Image: Ben Sapp Enforce relations between parts For example, head must be connected to torso Problem: y best (y best head, ybest torso, ) independent decisions for each body part are not optimal anymore 12 / 82

22 Example: Human Pose Estimation Image: Ben Sapp Enforce relations between parts For example, head must be connected to torso Problem: y best (y best head, ybest torso, ) independent decisions for each body part are not optimal anymore Needs structured output prediction function f : X Y 12 / 82

23 The general recipe Normal prediction function, X = anything, Y = R Extract feature vector from x and compute a number from it e.g. f(x) = w, φ(x) + b 13 / 82

24 The general recipe Normal prediction function, X = anything, Y = R Extract feature vector from x and compute a number from it e.g. f(x) = w, φ(x) + b Structured output prediction function, X = anything, Y = anything 1) Define auxiliary function, g : X Y R, e.g. g(x, y) = i ψ i(y i, x) i j ψ ij (y i, y j, x) 2) Construct f : X Y from g, e.g., f(x) = argmax y Y g(x, y) Challenges: how to learn g(x, y) from training data? how to compute f(x) from g(x, y)? 13 / 82

25 Part 1: Probabilistic Graphical Models

26 A Probabilistic View Machine Learning almost always deals with uncertain information training examples are collected randomly (e.g. from the web) annotation is noisy (there can be mistakes, or ambiguous cases) tasks cannot be solved with 100 percent certainty, because of incomplete information ( guess what number I think of ), or inherent randomness ( guess a coin toss ) Uncertainty is captured by (conditional) probability distributions: p(y x) for input x X, how likely is y Y the correct output? Probabilities make sense even for deterministic objects: how certain are we that y is the right answer to x? 15 / 82

27 A Probabilistic View on f : X Y Structured output function, X = anything, Y = anything We need to define an auxiliary function, g : X Y R. e.g. g(x, y) := p(y x). Then prediction by maximimization f(x) = argmax y Y g(x, y) = argmax p(y x) y Y becomes maximum a posteriori (MAP) prediction. Interpretation: the most probable one. If you have to pick a single output, y Y, use the one you are most confident about. 16 / 82

28 Excursus: Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y 17 / 82

29 Excursus: Probability Distributions y Y p(y) 0 (positivity) p(y) = 1 (normalization) y Y Example: binary ( Bernoulli ) variable y Y = {0, 1} 2 values, 1 degree of freedom p(y) y 17 / 82

30 Excursus: Conditional Probability Distributions x X y Y p(y x) 0 (positivity) x X y Y p(y x) = 1 (normalization w.r.t. y) For example: binary prediction X = {images}, y Y = {1, 7} each x: 2 values, 1 d.o.f. one (or two) function 18 / 82

31 Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization 19 / 82

32 Multi-class prediction, y Y = {1,..., K} each x: K values, K 1 d.o.f. K 1 functions or 1 vector-valued function with K 1 outputs Typically: K functions, plus explicit normalization Example: predicting the center point of an object y Y = {(1, 1),..., (width, height)} for each x: Y = W H values, y = (y 1, y 2 ) Y 1 Y 2 with Y 1 = {(1,..., width} and Y 2 = {1,..., height}. each x: Y 1 Y 2 = W H values, Image: Derek Magee, University of Leeds (TU Darmstadt cows dataset) 19 / 82

33 Structured objects: predicting M variables jointly Y = {1, K} {1, K} {1, K} For each x: K M values, K M 1 d.o.f. K M functions Example: Pose estimation Y part = {1,..., W } {1,..., H} {1,..., 360} Y = Y head Y left-arm Y right-foot For each x: (360W H) #body parts values many billions function 20 / 82

34 Example: image denoising Y = { RGB images} For each x: (255 3 ) values in p(y x), over 10 2,000,000 functions too much! 21 / 82

35 Example: image denoising Y = { RGB images} For each x: (255 3 ) values in p(y x), over 10 2,000,000 functions too much! We cannot consider all possible distributions, we must impose structure. 21 / 82

36 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. 22 / 82

37 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. 22 / 82

38 Probabilistic Graphical Models A (probabilistic) graphical model defines a family of probability distributions over a set of random variables, by means of a graph. Popular classes of graphical models, Undirected graphical models (Markov random fields), Directed graphical models (Bayesian networks), Factor graphs, Others: chain graphs, influence diagrams, etc. The graph encodes conditional independence assumptions between the variables: Let N(i) be the neighbors of node i in the graph (V, E). Then p(y i y V \{i} ) = p(y i y N(i) ) with y V \{i} = (y 1,..., y i 1, y i+1, y n ). 22 / 82

39 Example: Pictorial Structures for Articulated Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot All parts depend on each other. Knowing where the head is puts constraints on where the feet can be. But conditional independences as specified by the graph: If we know where the left leg is, the left foot s position does not depend on the torso or the head position anymore, etc. p(y left-foot y top,..., y torso,..., y right-foot, x) = p(y left-foot y left-leg, x) 23 / 82

40 Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F), variable nodes V, factor nodes F, each factor F F connects a subset of nodes, write F = {v 1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph 24 / 82

41 Factor Graphs Decomposable output y = (y 1,..., y V ) Graph: G = (V, F), variable nodes V, factor nodes F, each factor F F connects a subset of nodes, write F = {v 1,..., v F } and y F = (y v1,..., y v F ) Y i Y j Y k Y l Factor graph Distribution factorizes into potentials ψ at factors: p(y) = 1 ψ F (y F ) Z F F Z is a normalization constant, called partition function: Z = ψ F (y F ). y Y F F 24 / 82

42 Conditional Distributions How to model p(y x)? Potentials become also functions of (part of) x: ψ F (y F ; x F ) instead of just ψ F (y F ) p(y x) = 1 Z(x) ψ F (y F ; x F ) F F X i Y i X j Y j Partition function depends on x F Factor graph Z(x) = ψ F (y F ; x F ). y Y F F Note: x is treated just as an argument, not as a random variable. Conditional random fields (CRFs) 25 / 82

43 Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: ψ F (y F ; x F ) = exp( E F (y F ; x F )), or equivalently E F (y F ; x F ) = log(ψ F (y F ; x F )). 26 / 82

44 Conventions: Potentials and Energy Functions Assume ψ F (y F ) > 0. Then instead of potentials, we can also work with energies: or equivalently ψ F (y F ; x F ) = exp( E F (y F ; x F )), E F (y F ; x F ) = log(ψ F (y F ; x F )). p(y x) can be written as Gibbs distribution p(y x) = 1 Z(x) ψ F (y F ; x F ) F F = 1 Z(x) exp( E F (y F ; x F )) = 1 exp( E(y; x)) Z(x) F F for E(y; x) = F F E F (y F ; x F ) 26 / 82

45 Conventions: Energy Minimization argmax y Y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. 27 / 82

46 Conventions: Energy Minimization argmax y Y p(y x) = argmax y Y = argmax y Y = argmax y Y = argmin y Y 1 exp( E(y; x)) Z(x) exp( E(y; x)) E(y; x) E(y; x). MAP prediction can be performed by energy minimization. In practice, one typically just models the energy function the probability distribution is uniquely determined by it. 27 / 82

47 Example: An Energy Function for Human Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot E(y; x) = i {head,torso,... } E i (y i ; x i ) + (i,j) E ij (y i, y j ) unary factors (depend on one label): appearance e.g. E head (y; x) Does location y in image x look like a head? pairwise factors (depend on two labels): geometry e.g. Ehead-torso (y head, y torso ) Is location y head above location y torso? 28 / 82

48 Example: An Energy Function for Image Segmentation Object segmentation: e.g. horse X : Y: Energy function components ( Ising model): { low if x i is the right color, e.g. brown E i (y i = 1, x i ) = high otherwise E i (y i = 0, x i ) = E i (y i = 1, x i ) { low if y i = y j E i (y i, y j ) = high otherwise prefer that neighbors have the same label smooth labelings 29 / 82

49 What to do with Structured Prediction Models? Case 1) p(y x) is known MAP Prediction Predict f : X Y by optimization y = argmax y Y p(y y) = argmin y Y E(y, x) Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. 30 / 82

50 What to do with Structured Prediction Models? input image argmax y p(y x) p(y i x) for i = 1,..., 6 MAP makes a single (structured) prediction best overall pose Marginal probabilities p(y i x) give us potential positions uncertainty of the individual body parts. Images: Buffy Stickmen dataset (Ferrari et al.) 31 / 82

51 What to do with Structured Prediction Models? Case 2) p(y x) is unknown, but we have training data Structure Learning Learn graph structure from training data. Variable Learning Learn, whether to use additional (latent) variables, and which ones. (input and output variables are fixed by the task we try to solve). Parameter Learning Assume a fixed factor graph, learn parameters of the energy. lecture on Wednesday 32 / 82

52 Part 2: Probabilistic Inference p(y F x)

53 Probabilistic Inference Overview Exact Inference Belief Propagation on chains Belief Propagation on trees Junction tree algorithm Approximate Inference Loopy Belief Propagation MCMC Sampling Variational Inference / Mean Field 34 / 82

54 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H 35 / 82

55 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 Z(x) e E(y;x). 35 / 82

56 Probabilistic Inference Belief Propagation Assume y = (y i, y j, y k, y l ), Y = Y i Y j Y k Y l, and an energy function E(y; x) compatible with the following factor graph: Y i Y j Y k Y l F G H Task 1: for any y Y, compute p(y x), using p(y x) = 1 Z(x) e E(y;x). Problem: We don t know Z(x), and computing it using Z(x) = y Y e E(y;x) looks expensive (the sum has Y i Y j Y k Y l terms). A lot research has been done on how to efficiently compute Z(x). 35 / 82

57 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y e E(y) 36 / 82

58 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y = e E(y) e E(yi,yj,yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

59 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H For notational simplicity, we drop the dependence on (fixed) x: Z= y Y = e E(y) e E(yi,yj,yk,yl) y i Y i y j Y j y k Y k y l Y l = e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

60 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l 36 / 82

61 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l = y i y j y k y l e E F (y i,y j) e E G(y j,y k) e E H(y k,y l) 36 / 82

62 Probabilistic Inference Belief Propagation Y i Y j Y k Y l F G H Z= e EF (yi,yj) EG(yj,yk) EH(yk,yl) y i Y i y j Y j y k Y k y l Y l = y i y j = y i y k y j e E F (y i,y j) e E F (y i,y j) e E G(y j,y k) e E H(y k,y l) y l y k e E G(y j,y k) y l e E H(y k,y l) 36 / 82

63 Probabilistic Inference Belief Propagation r H Yk R Y k Y i Y j Y k Y l F G H Z= y i y j e E F (y i,y j) y k e E G(y j,y k) y l e E H(y k,y l) }{{} r H Yk (y k ) 36 / 82

64 Probabilistic Inference Belief Propagation r H Yk R Y k Y i Y j Y k F G H Y l Z= y i = y i y j e E F (y i,y j) y j e E F (y i,y j) y k e E G(y j,y k) y l e E H(y k,y l) }{{} r H Yk (y k ) y k e E G(y j,y k) r H Yk (y k ) 36 / 82

65 Probabilistic Inference Belief Propagation r G Yj R Y j Y i Y j Y k F G H Y l Z = y i y j e E F (y i,y j) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

66 Probabilistic Inference Belief Propagation r G Yj R Y j Y i Y j Y k F G H Y l Z = y i = y i y j e E F (y i,y j) y j e E F (y i,y j) r G Yj (y j ) }{{} r F Yi (y i ) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

67 Probabilistic Inference Belief Propagation r F Yi R Y i Y i Y j Y k F G H Y l Z = y i = y i y j e E F (y i,y j) y j e E F (y i,y j) r G Yj (y j ) }{{} r F Yi (y i ) = y i r F Yi (y i ) y k e E G(y j,y k) r H Yk (y k ) }{{} r G Yj (y j ) 36 / 82

68 Example: Inference on Trees Y i Y j Y k Y l F G H I 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = y Y = e E(y) Y m e EF (yi,yj)... EI(yk,ym) y i Y i y j Y j y k Y k y l Y l y m Y m 37 / 82

69 Example: Inference on Trees Y i Y j Y k Y l F G H I 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) e EG(yj,yk) y i Y i y j Y j y k Y k y l Y l e EH(yk,yl) } {{ } r H Yk (y k ) Y m y m Y m e EI(yk,ym) } {{ } r I Yk (y k ) 37 / 82

70 Example: Inference on Trees r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j Y m y k Y k e EG(yj,yk) r H Yk (y k ) r I Yk (y k ) 37 / 82

71 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j Y m e EG(yj,yk) r H Yk (y k ) r I Yk (y k ) }{{} y k Y k q Yk G(y k ) 37 / 82

72 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children Z = e EF (yi,yj) y i Y i y j Y j y k Y k e EG(yj,yk) q Yk G(y k ) 37 / 82

73 Example: Inference on Trees q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m 1) pick a root (here: i) 2) and sort sums such that parents nodes are left of their children 3) etc. Z = e EF (yi,yj) y i Y i y j Y j y k Y k e EG(yj,yk) q Yk G(y k ) 37 / 82

74 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message 38 / 82

75 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively updates messages After convergence: Z and p(y F ) can be obtained from the messages, e.g. p(y i = y i ) r F Yi (y i ) F :(i,f ) E (Sum-Product) Belief Propagation 38 / 82

76 Factor Graph Sum-Product Algorithm Message : pair of vectors at each factor graph edge (i, F ) E 1. r F Yi R Yi : factor-to-variable message 2. q Yi F R Yi : variable-to-factor message Algorithm iteratively updates messages After convergence: Z and p(y F ) can be obtained from the messages, e.g. p(y i = y i ) r F Yi (y i ) F :(i,f ) E (Sum-Product) Belief Propagation Easier to implement than to explain / 82

77 Example: Pictorial Structures X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973) Belief propagation is the state-of-the-art for prediction and inference 39 / 82

78 Example: Pictorial Structures Marginal probabilities p(y i x) give us potential positions uncertainty of the body parts. 40 / 82

79 Belief Propagation in Cyclic Graphs Belief propagation does not work for graph with cycles graph with factors of size more than 2 41 / 82

80 Belief Propagation in Cyclic Graphs Belief propagation does not work for graph with cycles graph with factors of size more than 2 We can construct equivalent chain/tree models: Yi Yi Yj Y k Y l Yj Y k Y l Ym ~ Y Ym ~ Y Ỹ = (Y i, Y j, Y m ) with state space Ỹ = Y i Y j Y m General procedure: junction tree algorithm Problem: exponentially growing state space BP inefficient 41 / 82

81 Belief Propagation in Cyclic Graphs What if we do belief propagation even though the graphs has cycles? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order where to start? 42 / 82

82 Belief Propagation in Cyclic Graphs What if we do belief propagation even though the graphs has cycles? A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problem: There is no well-defined leaf to root order where to start? Loopy Belief Propagation (LBP) initialize all messages as constant 1 pass messages using rules of BP until a stop criterion 42 / 82

83 Belief Propagation in Cyclic Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Problems: loopy BP might not converge (e.g. it can oscillate) even if it does, the computed probabilities are only approximate. Several improved schemes exist, some even convergent (but approximate) Exact inference in general cycles graph is #P-hard. 43 / 82

84 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Task: Compute marginals p(y F x) for general p(y x) Idea: Rephrase as computing the expected value of a function: E y p(y x,w) [h(x, y)], for some (well-behaved) function h : X Y R. For probabilistic inference, this step is easy. h F,z (x, y) := y F = z, Then E y p(y x,w) [h F,z (x, y)] = p(y x) y F = z y Y = p(y F x) y F = z = p(y F = z x). y F Y F 44 / 82

85 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the size of Y. 45 / 82

86 Probabilistic Inference Sampling / Markov-Chain Monte Carlo Expectations can be computed/approximated by sampling: For fixed x, let y (1), y (2),... be i.i.d. samples from p(y x), then E y p(y x) [h(x, y)] 1 S S h(x, y (s) ). s=1 The law of large numbers guarantees convergence for S, For S independent samples, approximation error is O(1/ S), independent of the size of Y. Problem: Producing i.i.d. samples, y (s), from p(y x) is hard. Solution: We can get away with a sequence of dependent samples Monte-Carlo Markov Chain (MCMC) sampling 45 / 82

87 Probabilistic Inference Sampling / Markov-Chain Monte Carlo One example how to do MCMC sampling: Gibbs sampler Initialize y (1) = (y 1,..., y d ) arbitrarily For s = 1,..., S: 1. Select an index i, 2. Re-sample y i p(y i y (s) V \{i}, x). 3. Output sample y (s+1) = (y (s) 1,..., y(s) i 1, y i, y (s) i+1,..., y(s) d ) p(y i y (s) V \{i}, x) = = p(y i, y (s) V \{i} x) y i Y i p(y i, y (s) V \{i} x) e E(y i,y (s),x) y i Y i e E(y i,y (s),x) 46 / 82

88 Probabilistic Inference Variational Inference / Mean Field Task: Compute marginals p(y F x) for general p(y x) Idea: Approximate p(y x) by simpler q(y) and use marginals from that. For example q = argmin q Q p(y F x) q (y F ) D KL (q(y) p(y x)) qe qf qg qh qi qj qk ql qm original model tree approximation product of unary factors 47 / 82

89 Probabilistic Inference Variational Inference / Mean Field Special case: (Naive) Mean Field for p(y x) = 1 Z(x) e E(y,x) n p(y x) q(y) = q i (y i ) No closed for expression for q, but optimality condition: qi (y i ) e E y\{y i } Q{E(y,x)} for Q(y 1,..., y i 1, y i+1,..., y n ) = qj (y j ) j i i V Iterative scheme: initialize q i (e.g. uniform) repeat until convergence for i V in any order: update q i while keeping the others fixed 48 / 82

90 Example: State-of-the-Art in Image Segmentation Image: [Chen et al., Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015] see also [Krähenbühl, Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011] 49 / 82

91 Example: State-of-the-Art in Image Segmentation Image: [Chen et al., Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs, ICLR 2015] see also [Krähenbühl, Koltun. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, NIPS 2011] 49 / 82

92 Probabilistic Inference Summary Task: compute (marginal) probabilities p(y F x) Exact Probabilistic Inference Only possible for certain models: trees/forests: sum-product belief propagation general graphs: junction chain algorithm (if tractable) Approximate Probabilistic Inference Many techniques with different properties and guarantees: loopy belief propagation MCMC sampling (e.g. Gibbs sampling) variational inference (e.g. mean field)... Best choice depends on model and requirements. 50 / 82

93 Part 3: MAP Prediction / Energy Minimization argmax y p(y x) / argmin y E(y, x)

94 MAP Prediction / Energy Minimization Exact Energy Minimization Belief Propagation on chains/trees Graph-Cuts for submodular energies Integer Linear Programming Approximate Energy Minimization Linear Programming Relaxations Local Search Methods Iterative Conditional Modes Multi-label Graph Cuts Simulated Annealing 52 / 82

95 Example: Pictorial Structures / Deformable Parts Model X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Tree-structured model for articulated pose (Felzenszwalb and Huttenlocher, 2000), (Fischler and Elschlager, 1973), (Yang and Ramanan, 2013), (Pishchulin et al., 2012) 53 / 82

96 Example: Pictorial Structures / Deformable Parts Model most likely configuration y = argmax y Y p(y x) = argmin E(y, x) y 54 / 82

97 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min y E(y) = min y i,y j,y k,y l E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) 55 / 82

98 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min E H (y k, y l ) ]] y i,y j y l y k 55 / 82

99 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) ]] 55 / 82

100 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r H Yk R Y k Y i Y j Y k Y l F G H min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) = min y i,y j [ EF (y i, y j ) + min y k E G (y j, y k ) + r H Yk (y k ) ] ]] 55 / 82

101 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r H Yk R Y k Y i Y j Y k Y l F G H min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) ]] 55 / 82

102 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r G Yj R Y j Y i Y j Y k F G H Y l min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) = min y i,y j [ EF (y i, y j ) + r G Yj (y j ) ]... ]] 55 / 82

103 Energy Minimization Belief Propagation Chain model: same trick as for inference: belief propagation r G Yj R Y j Y i Y j Y k F G H Y l min E(y) = min E F (y i, y j ) + E G (y j, y k ) + E H (y k, y l ) y y i,y j,y k,y l [ [ = min EF (y i, y j ) + min EG (y j, y k ) + min y i,y j y k y l E H (y k, y l ) }{{} r H Yk (y k ) [ ] = min EF (y i, y j ) + min E G (y j, y k ) + r H Yk (y k ) y i,y j y k }{{} r G Yj (y j ) = min y i,y j [ EF (y i, y j ) + r G Yj (y j ) ]... actual argmax by backtracking which choices were maximal ]] 55 / 82

104 Energy Minimization Belief Propagation Tree models: q Yk G(y k ) r H Yk (y k ) Y i Y j Y k F G H r I Yk (y k ) I Y l Y m q H Yk (y k ) = min yl E H (y k, y l ) q I Yk (y k ) = min ym E I (y k, y m ) q Yk G(y k ) = q H Yk (y k ) + q I Yk (y k ) min-sum (more common max-sum) belief propagation 56 / 82

105 Belief Propagation in Cyclic Graphs A B Y i Y j Y k A B Y i Y j Y k C D E F G Y l Y m Y n C D E F G Y l Y m Y n H I J K L Y o Y p Y q H I J K L Y o Y p Y q Loopy Max-Sum Belief Propagation Same problem as in probabilistic inference: no guarantee of convergence no guarantee of optimality Some convergent variants exist, e.g. TRW-S [Kolmogorov, PAMI 2006] 57 / 82

106 Cyclic Graphs In general, MAP prediction/energy minimization in models with cycles or higher-order terms is intractable (NP-hard). Some important exceptions: low tree-width [Lauritzen, Spiegelhalter, 1988] binary states, pairwise submodular interactions [Boykov, Jolly, 2001] binary states, only pairwise interactions, planar graph [Globerson, Jaakkola, 2006] special (Potts P n ) higher order factors [Kohli, Kumar, 2007] perfect graph structure [Jebara, 2009] Y i Y k Y j Y l 58 / 82

107 Submodular Energy Functions Binary variables: Y i = {0, 1} for all i V Energy function: unary and pairwise factors E(y; x, w) = E i (y i ) + E ij (y i, y j ) i V (i,j) E X i Y i X j Y j 59 / 82

108 Submodular Energy Functions Binary variables: Y i = {0, 1} for all i V Energy function: unary and pairwise factors E(y; x, w) = E i (y i ) + E ij (y i, y j ) i V (i,j) E Restriction 1 (without loss of generality): E i (y i ) 0 (always achievable by adding a constant to E) Restriction 2 (submodularity): X i Y i X j Y j E ij (y i, y j ) = 0, if y i = y j, E ij (y i, y j ) = E ij (y j, y i ) 0, otherwise. neighbors prefer to have the same labels 59 / 82

109 Graph-Cuts Algorithm for Submodular Energy Minimization [Greig et al., 1989] If conditions are fulfilled, energy minimization can be performed by a solving an s-t-mincut problem: construct auxiliary undirected graph one node {i} i V per variable two extra nodes: source s, sink t {i, s} s weighted edges Edge weight {i, j} E ij (y i = 0, y j = 1) {i, s} E i (y i = 1) {i, t} E i (y i = 0) find s-t-cut of minimal weight (polynomial time using max-flow theorem) {i, t} i j k l m n From minimal weight cut we recover labeling of minimal energy: y i = 1 if edge {i, s} is cut. Otherwise y i = 0 60 / 82 t

110 Integer Linear Programming (ILP) General energy E(y) = F E F (y F ) variables with more than 2 states higher-order factors (more than 2 variables) non-submodular factors Y i Y k Y j Y l Formulate as integer linear program (ILP) linear objective function linear constraints variables to optimize over are integer-valued ILPs are in general NP-hard, but some individual instances can be solved standard toolboxes: e.g. CPLEX, Gurobi, COIN-OR, / 82

111 Integer Linear Programming (ILP) Example: Encode assignment in indicator variables: z 1 {0, 1} Y 1 z 1;k = 1 y 1 = k z 2 {0, 1} Y 2 z 2;l = 1 y 2 = l z 12 {0, 1} Y 1 Y 2 z 12;kl = 1 y 1 = k y 2 = l 62 / 82

112 Integer Linear Programming (ILP) Example: Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 y 1 = 2 (y 1, y 2) = (2, 3) y 2 = 3 Encode assignment in indicator variables: z 1 {0, 1} Y 1 z 1;k = 1 y 1 = k z 2 {0, 1} Y 2 z 2;l = 1 y 2 = l z 12 {0, 1} Y 1 Y 2 z 12;kl = 1 y 1 = k y 2 = l Constraints: z 1;k = 1, z 2;l = 1, z 12;kl = 1 k Y 1 l Y 2 k,l Y 1 Y 2 z 12;kl = z 2;l z 12;kl = z 1;k k Y 1 l Y 2 (indicator property) (consistency) 62 / 82

113 Integer Linear Programming (ILP) Example: E(y 1, y 2 ) = E 1 (y 1 ) + E 12 (y 1, y 2 ) + E 2 (y 2 ) Y 1 Y 2 Y 1 Y 1 Y 2 Y 2 θ 1 θ 1,2 θ 2 Define coefficient vectors: θ 1 R Y 1 θ 1;k = E 1 (k) θ 2 R Y 2 θ 2;l = E 2 (l) θ 12 R Y 1 Y 2 θ 12;kl = E 1,2 (k, l) Energy is a linear function of unknown z: E(y 1, y 2 ) = θ i;k y i = k + θ ij;kl y i = k y j = l i V k Y i i,j E k,l Y i Y j = θ i;k z i;k + θ ij;kl z ij;kl i V k Y i (i,j) E (k,l) Y i Y j 63 / 82

114 Integer Linear Programming (ILP) subject to min z i V k Y i θ i;k z i;k + (i,j) E z i;k {0, 1} for all i V, k Y i, (k,l) Y i Y j θ ij;kl z ij;kl z ij;kl {0, 1} for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, 64 / 82

115 Integer Linear Programming (ILP) subject to min z i V k Y i θ i;k z i;k + (i,j) E z i;k {0, 1} for all i V, k Y i, (k,l) Y i Y j θ ij;kl z ij;kl z ij;kl {0, 1} for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, NP-hard to solve because of integrality constraints. 64 / 82

116 Linear Programming (LP) Relaxation min θ i;k z i;k + θ ij;kl z ij;kl z i V k Y i (i,j) E (k,l) Y i Y j subject to z i;k {0, 1} z i;k [0, 1] for all i V, k Y i, z ij;kl {0, 1} z ij;kl [0, 1] for all (i, j) E, (k, l) Y i Y j, z i;k = 1, for all i V, k Y i z ij;kl = 1, for all (i, j) E, k,l Y i Y j k Y i z ij;kl = z j;l for all (i, j) E, l Y j, l Y j z ij;kl = z i;k for all (i, j) E, k Y i, Relax constraints tractable optimization problem 65 / 82

117 Linear Programming (LP) Relaxation Solution z LP might have fractional values no corresponding labeling y Y round LP solution to {0, 1} values Problem: rounded solution usually not optimal, i.e. not identical to ILP solution LP relaxations perform approximate energy minimization 66 / 82

118 Linear Programming (LP) Relaxation Example: color quantization Example: stereo reconstruction Images: Berkeley Segmentation Dataset 67 / 82

119 Local Search Avoid getting fractional solutions: energy minimization by local search 68 / 82

120 Local Search Avoid getting fractional solutions: energy minimization by local search Y y 0 choose starting labeling y 0 68 / 82

121 Local Search Avoid getting fractional solutions: energy minimization by local search Y N (y 0 ) y 0 choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings 68 / 82

122 Local Search Avoid getting fractional solutions: energy minimization by local search Y N (y 0 ) y 0 y 1 choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings find minimizer within neighborhood, y 1 = argmin y N (y 0 ) E(y) 68 / 82

123 Local Search Avoid getting fractional solutions: energy minimization by local search Y y 0 y 1 )y2 N (y 0 ) N (y 1 N (y 2 ) y 3 N (y ) y N (y 3 ) choose starting labeling y 0 construct neighborhood N (y 0 ) Y of labelings find minimizer within neighborhood, y 1 = argmin y N (y 0 ) E(y) iterate until no more changes 68 / 82

124 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i / 82

125 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i ICM procedure: neighborhood N (y) = i V N i(y) all states reachable from y by changing a single variable y t+1 = argmin E(y) by exhaustive search y N (y t ) ( i Y i evaluations) 69 / 82

126 Iterated Conditional Modes (ICM) [Besag, 1986] Define local neighborhoods: N i (y) = {(y 1,..., y i 1, ȳ, y i+1,..., y n ) ȳ Y i } for i V. all labeling reachable from y by changing value of y i ICM procedure: neighborhood N (y) = i V N i(y) all states reachable from y by changing a single variable y t+1 = argmin E(y) by exhaustive search y N (y t ) ( i Y i evaluations) Observation: larger neighborhood sizes are better ICM: N (y) linear in V many iterations to explore exponentially large Y ideal: N (y) exponential in V, but: we must ensure that argmin y N (y) E(y) remains tractable 69 / 82

127 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) 70 / 82

128 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) Example: semantic segmentation 70 / 82

129 Multilabel Graph-Cut: α-expansion E(y) with unary and pairwise terms Y i = L = {1,..., K} for i V (multi-class) Algorithm initialize y 0 arbitrarily (e.g. everything label 0) repeat for any α L construct neighborhood: N (y) = { (ȳ 1,..., ȳ V ) : ȳ i {y i, α} } each variable can keep its value or switch to α solve y argminy N (y) E(y) until y has not changed for a whole iteration 70 / 82

130 Multilabel Graph-Cut: α-expansion Theorem [Boykov et al. 2001] If all pairwise terms are metric, i.e. for all (i, j) E E ij (k, l) 0 with E ij (k, l) = 0 k = l E ij (k, l) = E ij (l, k) E ij (k, l) E ij (k, m) + E ij (m, l) for all k, l, m Then argmin y N (y) E(y) can be solved optimally using GraphCut. Theorem [Veksler 2001]. The solution, y α, returned by α-expansion fulfills E(y α ) 2c min y Y E(y) max k l E ij (k, l) for c = max (i,j) E min k l E ij (k, l) 71 / 82

131 Example: Semantic Segmentation E(y) = i V E i (y i ) + λ y i y j (i,j) E Potts model E ij (k, l) 0 E ij (k, l) = 0 k = l E ij (k, l) = E ij (l, k) E ij (k, l) E ij (k, m) + E ij (m, l) c = max (i,j) E max k l E ij (k,l) min k l E ij (k,l) = 1 factor-2 approximation guarantee: E(y α ) 2 min y Y E(y) 72 / 82

132 Example: Stereo Estimation E(y) = i V E i (y i ) + λ y i y j (i,j) E y i y j is metric c = max (i,j) E max k l E ij (k,l) min k l E ij (k,l) = L 1 weak guarantees, but often close to optimal labelings in practice Images: Middlebury stereo vision dataset 73 / 82

133 MAP Prediction / Energy Minimization Summary Task: compute argmin y Y E(y x) Exact Energy Minimization Only possible for certain models: trees/forests: max-sum belief propagation general graphs: junction chain algorithm (if tractable) submodular energies: GraphCut general graphs: integer linear programming (if tractable) Approximate Energy Minimization Many techniques with different properties and guarantees: linear programs relaxations ICM α-expansion... Best choice depends on model and requirements. 74 / 82

134 Part 4: Structured Loss Functions (ȳ, y)

135 Loss function How to judge if a (structured) prediction is good? Define a loss function : Y Y R +, (ȳ, y) measures the loss incurred by predicting y when ȳ is correct. The loss function is application dependent 76 / 82

136 Example 1: 0/1 loss Loss is 0 for perfect prediction, 1 otherwise: { 0 if ȳ = y 0/1 (ȳ, y) = ȳ y = 1 otherwise Every mistake is equally bad. Usually not very useful in structured prediction. 77 / 82

137 Example 2: Hamming loss Count the number of mislabeled variables: H (ȳ, y) = 1 V ȳ i y i i V Used, e.g., for graph labeling tasks 78 / 82

138 Example 3: Squared error If we can add elements in Y i (pixel intensities, optical flow vectors, etc.). Sum of squared errors Q (ȳ, y) = 1 V ȳ i y i 2. i V Used, e.g., in stereo reconstruction, part-based object detection. 79 / 82

139 Example 4: Task specific losses Object detection bounding boxes, or arbitrarily shaped regions detection ground truth image Intersection-over-union loss: IoU (bary, y) = 1 area(ȳ y) area(ȳ y) = 1 Used, e.g., in PASCAL VOC challenges for object detection, because its scale-invariance (no bias for or against big objects). 80 / 82

140 Making Bayes-optimal Predictions Given a distribution p(y x), what is the best way to predict f : X Y? Bayesian decision theory: pick f(x) that causes minimal expected loss: f(x) = argmin R (y) y Y for R (y) = Eȳ p(y x) { (ȳ, y)} = ȳ Y (ȳ, y)p(ȳ x) For many loss functions not tractable, but some exceptions: R 0/1 (y) = 1 p(y x), so f(x) = argmax y p(y x) R H (y) = 1 i V p(y i x), so f(x) = (y 1,..., y n ) for y i = argmax k Yi p(y i = k x) 81 / 82

141 Summary Probabilistic Inference Compute p(y F x) for a subset F of variables, in particular p(y i x) some exact and many approximate techniques MAP Prediction / Energy Minimization Identify y Y that maximizes p(y x) (minimizes E(y, x)) some exact and many approximate techniques Loss Function Measure the quality of a prediction by a loss function, : Y Y R. task specific: 0/1-loss, Hamming loss, intersection-of-union,... Optimal prediction rule: f(x) = argmin y E y p(ȳ x) (ȳ, y)) 82 / 82

Introduction To Graphical Models

Introduction To Graphical Models Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013

More information

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Part 6: Structured Prediction and Energy Minimization (1/2)

Part 6: Structured Prediction and Energy Minimization (1/2) Part 6: Structured Prediction and Energy Minimization (1/2) Providence, 21st June 2012 Prediction Problem Prediction Problem y = f (x) = argmax y Y g(x, y) g(x, y) = p(y x), factor graphs/mrf/crf, g(x,

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Markov Random Fields for Computer Vision (Part 1)

Markov Random Fields for Computer Vision (Part 1) Markov Random Fields for Computer Vision (Part 1) Machine Learning Summer School (MLSS 2011) Stephen Gould stephen.gould@anu.edu.au Australian National University 13 17 June, 2011 Stephen Gould 1/23 Pixel

More information

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials Philipp Krähenbühl and Vladlen Koltun Stanford University Presenter: Yuan-Ting Hu 1 Conditional Random Field (CRF) E x I = φ u

More information

Discrete Inference and Learning Lecture 3

Discrete Inference and Learning Lecture 3 Discrete Inference and Learning Lecture 3 MVA 2017 2018 h

More information

Part 7: Structured Prediction and Energy Minimization (2/2)

Part 7: Structured Prediction and Energy Minimization (2/2) Part 7: Structured Prediction and Energy Minimization (2/2) Colorado Springs, 25th June 2011 G: Worst-case Complexity Hard problem Generality Optimality Worst-case complexity Integrality Determinism G:

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/

More information

Undirected graphical models

Undirected graphical models Undirected graphical models Semantics of probabilistic models over undirected graphs Parameters of undirected models Example applications COMP-652 and ECSE-608, February 16, 2017 1 Undirected graphical

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah Introduction to Graphical Models Srikumar Ramalingam School of Computing University of Utah Reference Christopher M. Bishop, Pattern Recognition and Machine Learning, Jonathan S. Yedidia, William T. Freeman,

More information

Pushmeet Kohli Microsoft Research

Pushmeet Kohli Microsoft Research Pushmeet Kohli Microsoft Research E(x) x in {0,1} n Image (D) [Boykov and Jolly 01] [Blake et al. 04] E(x) = c i x i Pixel Colour x in {0,1} n Unary Cost (c i ) Dark (Bg) Bright (Fg) x* = arg min E(x)

More information

Probabilistic Graphical Models Lecture Notes Fall 2009

Probabilistic Graphical Models Lecture Notes Fall 2009 Probabilistic Graphical Models Lecture Notes Fall 2009 October 28, 2009 Byoung-Tak Zhang School of omputer Science and Engineering & ognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials

Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials by Phillip Krahenbuhl and Vladlen Koltun Presented by Adam Stambler Multi-class image segmentation Assign a class label to each

More information

Asaf Bar Zvi Adi Hayat. Semantic Segmentation

Asaf Bar Zvi Adi Hayat. Semantic Segmentation Asaf Bar Zvi Adi Hayat Semantic Segmentation Today s Topics Fully Convolutional Networks (FCN) (CVPR 2015) Conditional Random Fields as Recurrent Neural Networks (ICCV 2015) Gaussian Conditional random

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah Introduction to Graphical Models Srikumar Ramalingam School of Computing University of Utah Reference Christopher M. Bishop, Pattern Recognition and Machine Learning, Jonathan S. Yedidia, William T. Freeman,

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Graphical Models and Kernel Methods

Graphical Models and Kernel Methods Graphical Models and Kernel Methods Jerry Zhu Department of Computer Sciences University of Wisconsin Madison, USA MLSS June 17, 2014 1 / 123 Outline Graphical Models Probabilistic Inference Directed vs.

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity

More information

Undirected Graphical Models: Markov Random Fields

Undirected Graphical Models: Markov Random Fields Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected

More information

Lecture 8: PGM Inference

Lecture 8: PGM Inference 15 September 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I 1 Variable elimination Max-product Sum-product 2 LP Relaxations QP Relaxations 3 Marginal and MAP X1 X2 X3 X4

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning More Approximate Inference Mark Schmidt University of British Columbia Winter 2018 Last Time: Approximate Inference We ve been discussing graphical models for density estimation,

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 16 Undirected Graphs Undirected Separation Inferring Marginals & Conditionals Moralization Junction Trees Triangulation Undirected Graphs Separation

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

Introduction to Probabilistic Graphical Models

Introduction to Probabilistic Graphical Models Introduction to Probabilistic Graphical Models Christoph Lampert IST Austria (Institute of Science and Technology Austria) 1 / 51 Schedule Refresher of Probabilities Introduction to Probabilistic Graphical

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Summary of Class Advanced Topics Dhruv Batra Virginia Tech HW1 Grades Mean: 28.5/38 ~= 74.9%

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Inference in Bayesian Networks

Inference in Bayesian Networks Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Graph Cut based Inference with Co-occurrence Statistics. Ľubor Ladický, Chris Russell, Pushmeet Kohli, Philip Torr

Graph Cut based Inference with Co-occurrence Statistics. Ľubor Ladický, Chris Russell, Pushmeet Kohli, Philip Torr Graph Cut based Inference with Co-occurrence Statistics Ľubor Ladický, Chris Russell, Pushmeet Kohli, Philip Torr Image labelling Problems Assign a label to each image pixel Geometry Estimation Image Denoising

More information

Machine Learning, Fall 2009: Midterm

Machine Learning, Fall 2009: Midterm 10-601 Machine Learning, Fall 009: Midterm Monday, November nd hours 1. Personal info: Name: Andrew account: E-mail address:. You are permitted two pages of notes and a calculator. Please turn off all

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning Topics Markov Random Fields: Representation Conditional Random Fields Log-Linear Models Readings: KF

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning Anton Rodomanov Higher School of Economics, Russia Bayesian methods research group (http://bayesgroup.ru) 14 March

More information

CS Lecture 4. Markov Random Fields

CS Lecture 4. Markov Random Fields CS 6347 Lecture 4 Markov Random Fields Recap Announcements First homework is available on elearning Reminder: Office hours Tuesday from 10am-11am Last Time Bayesian networks Today Markov random fields

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation Wei Yang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang Department of Electronic Engineering,

More information

Statistical Problems in Computer Vision

Statistical Problems in Computer Vision V and ML Statistical Problems in omputer Vision Machine Learning and Perception Group Microsoft Research ambridge SI Meeting, ambridge, 26th September 2011 Statistical Problems in omputer Vision Microsoft

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Loopy Belief Propagation for Bipartite Maximum Weight b-matching

Loopy Belief Propagation for Bipartite Maximum Weight b-matching Loopy Belief Propagation for Bipartite Maximum Weight b-matching Bert Huang and Tony Jebara Computer Science Department Columbia University New York, NY 10027 Outline 1. Bipartite Weighted b-matching 2.

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS

UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS UNDERSTANDING BELIEF PROPOGATION AND ITS GENERALIZATIONS JONATHAN YEDIDIA, WILLIAM FREEMAN, YAIR WEISS 2001 MERL TECH REPORT Kristin Branson and Ian Fasel June 11, 2003 1. Inference Inference problems

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information

Random Field Models for Applications in Computer Vision

Random Field Models for Applications in Computer Vision Random Field Models for Applications in Computer Vision Nazre Batool Post-doctorate Fellow, Team AYIN, INRIA Sophia Antipolis Outline Graphical Models Generative vs. Discriminative Classifiers Markov Random

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Networks Inference with Probabilistic Graphical Models 4190.408 2016-Spring Bayesian Networks Inference with Probabilistic Graphical Models Byoung-Tak Zhang intelligence Lab Seoul National University 4190.408 Artificial (2016-Spring) 1 Machine Learning? Learning

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9 Undirected Models CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due next Wednesday (Nov 4) in class Start early!!! Project milestones due Monday (Nov 9)

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)

More information

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015

Course 16:198:520: Introduction To Artificial Intelligence Lecture 9. Markov Networks. Abdeslam Boularias. Monday, October 14, 2015 Course 16:198:520: Introduction To Artificial Intelligence Lecture 9 Markov Networks Abdeslam Boularias Monday, October 14, 2015 1 / 58 Overview Bayesian networks, presented in the previous lecture, are

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Probabilistic Graphical Models in Computer Vision (IN2329)

Probabilistic Graphical Models in Computer Vision (IN2329) Probabilistic Graphical Models in Computer Vision (IN2329) Csaba Domokos Summer Semester 2017 9. Human pose estimation & Mean field approximation......................................................................

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

13 : Variational Inference: Loopy Belief Propagation

13 : Variational Inference: Loopy Belief Propagation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 13 : Variational Inference: Loopy Belief Propagation Lecturer: Eric P. Xing Scribes: Rajarshi Das, Zhengzhong Liu, Dishan Gupta 1 Introduction

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

MAP Examples. Sargur Srihari

MAP Examples. Sargur Srihari MAP Examples Sargur srihari@cedar.buffalo.edu 1 Potts Model CRF for OCR Topics Image segmentation based on energy minimization 2 Examples of MAP Many interesting examples of MAP inference are instances

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information