Probabilistic Graphical Models & Applications

Size: px

Start display at page:

Download "Probabilistic Graphical Models & Applications"

Horace Reynolds
5 years ago
Views:

1 Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with kind permission of Christoph H. Lampert of IST Austria

Conditional Random Fields Maximum Likelihood for CRFs Numerical Training Latent Variables Supervised Learning Problem I Given training examples (x 1, y 1 ),.

2 Conditional Random Fields Maximum Likelihood for CRFs Numerical Training Latent Variables Supervised Learning Problem I Given training examples (x 1, y 1 ),..., (x N, y N ) X Y I I I I X : collection of interacting random variables, e.g. image pixels x X : one joint assignment, e.g. a specific image, out of all possible joint assignments Y : collection of interacting random variables, e.g. part locations y Y: one joint assignment, e.g. a specific pose, out of all possible joint assignments Images: HumanEva dataset Goal: be able to make predictions for new inputs, i.e. learn a function f : X Y. 3 / 44

3 Supervised Learning Problem X F (1) top Y top Y head F (2) top,head Step 1: define a proper graph structure of X and Y Y Yrarm torso Y larm Y rhnd Y lhnd... Y rleg... Y lleg Y rfoot Ylfoot Step 2: define a proper parameterization of p(y x; θ) p(y x; θ) = 1 Z e d i=1 θ i φ i (x,y) Step 3: learn parameters θ from training data Step 4: for new x X, make prediction e.g. maximum likelihood ( today) e.g. y = argmax p(y x; θ ) y Y ( next lecture) 4 / 44

4 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Main step: For each factor F F we define some φ F (x F, y F ) : X F Y F R d F, where (x F, y F ) are those random variables of (x, y) that appear in F. Example: pose estimation (X : image pixels, Y : part locations) φ head (x, y i ) = the pixels in the image x at the location specified by y i φ head-torso (y i, y j ) = the distance between locations y i and y j Example: image segmentation (X : image pixels, Y : per-pixel foreground/background flag) ( ) xi y unary terms φ i (x i, y i ) = i = 0 (1 x i ) y i = 1 pairwise terms φ ij (y i, y j ) = y i y j 5 / 44

5 Parameterization Goal: Define feature functions, φ i : X Y R d, for i = 1,..., d, or just φ : X Y R d Two common options (combinations of them also work): 1) combine factors by stacking their vectors φ(x, y) = ( φ F (x F, y F ) ) F F p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ F, φ F (x F, y F ) F F 2) combine factors by summing their vectors φ(x, y) = F F φ F (x F, y F ) p(y x; θ) = 1 Z(x; θ) e θ,φ(x,y) with θ, φ(x, y) = θ, φ F (x F, y F ) F F Result: log-linear model with parameter vector θ (sometimes: w, for weight vector) Conditional Random Field 6 / 44

6 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N p(y n x n ; θ) 7 / 44

7 Maximum Likelihood Parameter Estimation Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N θ = argmax θ R D log( ) = argmin θ R D = argmin θ R D = argmin θ R D p(y 1,..., y N x 1,..., x N ; θ) i.i.d. = argmax θ R D N log p(y n x n ; θ) }{{} negative conditional log-likelihood (of D) N [ log e θ,φ(x n,y) log Z(x n ; θ) ] N [ θ, φ(x n, y n ) + log y Y e θ,φ(xn,y) } {{ } log-partition function ] N p(y n x n ; θ) 7 / 44

8 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) 8 / 44

9 MAP Estimation of θ Treat θ as random variable; maximize posterior p(θ D) p(θ D) Bayes = p(x 1, y 1,..., x N, y N θ)p(θ) i.i.d. = p(θ) p(d) p(θ): prior belief on θ (cannot be estimated from data). N p(y n x n ; θ) p(y n x n ) θ = argmax θ R D = argmin θ R D [ ] p(θ D) = argmin log p(θ D) θ R D [ N log p(θ) [ = argmin log p(θ) θ R D N ] log p(y n x n ; θ) + log p(y n x n ) }{{} indep. of θ ] log p(y n x n ; θ) 8 / 44

10 θ [ = argmin log p(θ) θ R D N log p(y n x n ; θ) ] Choices for p(θ): p(θ) : const. θ R D (uniform; in R D not really a distribution) } {{ } θ [ = argmin N log p(y n x n ; θ) + const. ] negative conditional log-likelihood p(θ) := const. e λ 2 θ 2 (Gaussian) θ [ λ N = argmin θ R D 2 θ 2 log p(y n x n ; θ) + const. ] }{{} regularized negative conditional log-likelihood 9 / 44

11 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(θ) = λ 2 θ 2 + (λ 0 makes it unregularized) N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] Probabilistic parameter estimation or training means solving θ = argmin L(θ). θ R D y Y Same optimization problem as for multi-class logistic regression. 10 / 44

12 Negative Conditional Log-Likelihood (Toy Example) / 44

13 Steepest Descent Minimization minimize L(θ) input tolerance ɛ > 0 1: θ cur 0 2: repeat 3: v w L(θ cur ) 4: η argmin η R L(θ cur ηv) 5: θ cur θ cur ηv 6: until v < ɛ output θ cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 12 / 44

14 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = / 44

15 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

16 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

17 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 =0.10 w / 44

18 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

19 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

20 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

21 Steepest Descent Minimization minimize L(θ) 3 negative log likelihood σ 2 = w / 44

22 L(θ) = λ 2 θ 2 + θ L(θ) = λθ + = λθ + = λθ + N [ θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e θ,φ(xn,y) φ(x n, y) ] ȳ Y e θ,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] L(θ) = λid D D + N Ey p(y x n ;θ){ φ(x n, y)φ(x n, y) } 14 / 44

23 L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (θ x [ 3, 5], θ y = 0) 15 / 44

24 N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] For λ = 0: E y p(y x n ;θ)φ(x n, y) = φ(x n, y n ) θ L(θ) = 0, critical point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x;θ) φ(x, y)! = φ(x, y obs ) E.g. image segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 16 / 44

25 L(θ) = λid D D + N E y p(y x n ;θ) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(θ) is convex θ L(θ) = 0 implies global minimum slice through objective value (θ x [ 3, 5], θ y = 0) 17 / 44

26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). 18 / 44

27 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x; θ) log-linear in θ R D. Training: minimize (regularized) negative conditional log-likelihood, L(θ) L(θ) is differentiable and convex, gradient descent will find global optimum with θ L(θ) = 0 Same structure as multi-class logistic regression ( ˆ= no structure and Y = {1,..., K}). For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 18 / 44

28 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y 19 / 44

29 Solving the Training Optimization Problem Numerically Task: Compute v = θ L(θ cur ), evaluate L(θ cur + ηv): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y θ L(θ) = λ N 2 θ + [ φ(x n, y n ) p(y x n ; θ)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 19 / 44

30 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 20 / 44

31 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log Z(x n ; θ) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output variables 10s to 1,000,000s K: number of possible labels of each output variables 2 to 1000s 20 / 44

32 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = ( E yf p(y F x;θ)φ F (x, y F ) ) F F E yf p(y F x;θ)φ F (x, y F ) = y F Y F }{{} K F terms p(y F x; θ) φ }{{} F (x, y F ) factor marginals Factor marginals µ F = p(y F x; θ) are much smaller than complete joint distribution p(y x; θ), compute them by probabilistic inference. 21 / 44

33 Solving the Training Optimization Problem Numerically For a graphical model with factors F, the features decompose: φ(x, y) = F F φ F (x, y F ) E y p(y x;θ) φ(x, y) = E y p(y x;θ) φ F (x, y F ) F F = E y p(y x;θ) φ F (x, y F ) F F = E yf p(y F x;θ)φ F (x, y F ) F F Again, we need only the factor marginals µ F = p(y F x; θ) 22 / 44

34 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44

35 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible): O(K M ND), O(MK Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output variables K: number of possible labels of each output variables 23 / 44

36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(θ), but without ever computing L(θ) or L(θ) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements Compute approximate gradient L(θ) = λθ + D [ φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] D (x n,y n ) D Update parameter in negative direction of approximate gradient Line search would still be too slow use fixed stepsize rule, η t, instead (new parameter) more: see [Bottou, Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008] also: 24 / 44

37 Stochastic Gradient Descent (SGD) Important property of SGD: 3 negative log likelihood σ 2 =0.10 v = L(θ) is random, let s call it s distribution q / 44

38 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

39 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

40 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q 25 / 44

41 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44

42 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction 25 / 44

43 Stochastic Gradient Descent (SGD) Important property of SGD: v = L(θ) is random, let s call it s distribution q v is unbiased estimate of true gradient: E v q [v] = L(θ) on average, parameters updates point in the right direction SGD converges to argmin θ L(θ) η t must be chosen appropriately: t η t =, t η2 t <, e.g. η t 1 t SGD needs more iterations, but each one is much faster 25 / 44

44 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient (if marginal inference is possible):o(mk Fmax ND): L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 10s to 10000s M: number of output variables K: number of possible labels of each output variables 26 / 44

45 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words θ i, φ i (y i, x) : local classifier (like logistic-regression) φ pw (y i, y j ) = (i,j) E y i = y j R 1 : test for same label θ ij, φ ij (y i, y j ) : penalizer for label changes (if θ ij > 0) combined: argmax y p(y x; θ) is smoothed version of local cues 27 / 44

fit(y i, y j ) R 1 : test for geometric fit θ ij, φ ij (y i, y j ) :

46 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG θ i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit θ ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry 28 / 44

47 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 29 / 44

48 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) R K (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i (x) can be any classifiers, e.g. deep network, random forest,... Disadvantage: if local classifiers are bad, CRF training cannot fix that. 29 / 44

49 Solving the Training Optimization Problem Numerically N [ θ L(θ) = λθ + φ(x n, y n ) E y p(y x n ;θ)φ(x n, y) ] Computing the Gradient: O(K M ND) L(θ) = λ N [ 2 θ 2 + θ, φ(x n, y n ) + log e θ,φ(xn,y) ] y Y N: number of samples D: dimension of feature space M: number of output variables K: number of possible labels of each output variables What, if exact marginal inference is not possible? 30 / 44

50 Training with Approximate Inference Problem: exact marginal inference might be computationally too costly Idea: just use approximate marginals, e.g. from loopy BP Possibility 1: everything works fine Possibility 2: nothing works anymore approximate gradient might not be a descent direction (objective does not decrease along its negative) then, gradient descent might terminate too early (line search chooses η = 0), or not convergence at all, or convergence to a bad solution,... In general: no way to tell which of the possibilities. 31 / 44

51 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L 32 / 44

52 Training with Approximate Likelihood Question: how to train if (approximate) maximum likelihood learning does not work Idea: optimize a simpler quantity instead of L Pseudolikelihood [Besag, 1987] Pretend: p(y) p PL (y) for p PL (y) = i V p(y i y V \{i} ) = i V p(y i y N(i) ) 32 / 44

53 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] 33 / 44

54 Training with Approximate Likelihood Pseudolikelihood (PL) p(y x; θ) p PL (y x; θ) = i V p(y i y N(i), x; θ) For training data {(x 1, y 1 ),..., (x N, y N )}: N L PL (θ) = log p PL (y n x n ; θ) = = N log p(yi n yn(i) n, x; θ) i V N i V [ θ, φ(y n, x n ) log z Yi e θ,φ(y n 1,...,y n i 1,z,y n i+1,...,y n V,xn ) ] Partition functions sum only over one variable at a time tractable 33 / 44

55 Training with Approximate Likelihood Piecewise Training (PW) Idea: optimize a simpler quantity instead of L Piecewise Training [Sutton, McCallum, 2005] p(y x) F F p F (y F x) for p F (y F x) = 1 Z F (x; θ) e θ F,φ F (y F,x) 34 / 44

56 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) ] 35 / 44

57 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) F F N = [ θ F, φ F (yf n, x n ) log F F ȳ F Y F e θf,φf (ȳf,xn) Partition functions sum over F variables at a time usually tractable ] 35 / 44

58 Training with Approximate Likelihood Piecewise Training (PW) p(y x) F F p F (y F x; θ F ) for p F (y F x) e θ F,φ F (y F,x) For training data {(x 1, y 1 ),..., (x N, y N )}: N N L PW (θ) = log p PW (yf n x n ; θ) = log p F (yf n x) = F F F F N [ θ F, φ F (yf n, x n ) log ȳ F Y F e θf,φf (ȳf,xn) ] Partition functions sum over F variables at a time usually tractable Optimization decomposes into a sum over the θ F easy to parallelize 35 / 44

59 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: θ L(θ) = λθ N [ φ(x n, y n ) A lot of research on accelerating CRF training: y Y p(y x n ; θ) φ(x n, y) ] R D problem solution method(s) Y too large exploit structure (loopy) belief propagation fast sampling contrastive divergence surrogate L pseudo-likelihood, piecewise training N too large mini-batches stochastic gradient descent D too large pretrained φ unary two-stage training 36 / 44

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.