Learning with Structured Inputs and Outputs

Size: px

Start display at page:

Download "Learning with Structured Inputs and Outputs"

Emory Bruno Preston
5 years ago
Views:

Technology Austria), Vienna ENS/INRIA Summer School,

1 Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: 1 / 10

2 Schedule Monday Introduction to Graphical Models 9:00-9:45 Conditional Random Fields 9:45-10:30 Structured Support Vector Machines Slides available on my home page: 2 / 10

3 Extended version lecture in book form (180 pages) Foundations and Trends in Computer Graphics and Vision now publisher Available as PDF on 3 / 10

4 Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 4 / 10

5 Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 10

the way in which the parts belong together.

6 What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 10

7 What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 7 / 10

8 Reminder: Graphical Model for Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Joint probability distribution of all body parts p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 8 / 10

9 Reminder: Graphical Model for Image Segmentation Probability distribution over all foreground/background segmentations p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 9 / 10

10 Reminder: Inference/Prediction Monday: Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. Monday: MAP Prediction Predict f : X Y by solving y = argmax y Y p(y x) = argmin y Y E(y, x) Today: Parameter Learning Learn learn potentials/energy terms from training data. 10 / 10

11 Part 1: Conditional Random Fields

12 Supervised Learning Problem Given training examples (x 1, y 1 ),..., (x N, y N ) X Y How to make predictions g : X Y? Approach 1) Discrimitive Probabilistic Learning 1) Use training data to obtain an estimate p(y x). 2) Use f(x) = argminȳ Y y p(y x) (y, ȳ) to make predictions. Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. 2 / 29

13 Conditional Random Field Learning Goal: learn a posterior distribution p(y x) = 1 Z(x) exp( E F (y F ; x) ). F F with F = { all factors }: all unary, pairwise, potentially higher order,... parameterize each E F (y F ; x) = w F, φ F (x, y F ). fixed feature functions ( φ 1 (x 1, y),..., φ F (x F, y) ) : φ(x, y) weight vectors ( w 1,..., w F ) : w Result: log-linear model with parameter vector w 1 p(y x; w) = exp( w, φ(y, x) ). Z(x; w) with Z(x; w) = ȳ Y exp( w, φ(ȳ, x) ) New goal: find best parameter vector w R D. 3 / 29

14 Maximum Likelihood Parameter Estimation Idea 1: Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N w = argmax w R D p(y 1,..., y N x 1,..., x N, w) i.i.d. = argmax w R D log( ) = argmin w R D N p(y n x n, w) N log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 4 / 29

15 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) 5 / 29

16 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) p(w D) Bayes = p(x1, y 1,..., x N, y N w)p(w) i.i.d. = p(w) p(d) p(w): prior belief on w (cannot be estimated from data). N p(y n x n, w) p(y n x n ) w = argmax w R D = argmin w R D [ = argmin w R D [ ] p(w D) = argmin log p(w D) w R D [ N log p(w) log p(w) N ] log p(y n x n, w) + log p(y n x n ) }{{} indep. of w ] log p(y n x n, w) 5 / 29

17 w [ = argmin log p(w) w R D N log p(y n x n, w) ] Choices for p(w): p(w) : const. w [ = argmin w R D (uniform; in R D not really a distribution) N log p(y n x n, w) }{{} negative conditional log-likelihood + const. ] p(w) := const. e λ 2 w 2 (Gaussian) w [ λ N = argmin w R D 2 w 2 + log p(y n x n, w) + const. ] }{{} regularized negative conditional log-likelihood 6 / 29

18 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] (λ 0 makes it unregularized) y Y Probabilistic parameter estimation or training means solving w = argmin L(w). w R D Same optimization problem as for multi-class logistic regression. 7 / 29

19 Negative Conditional Log-Likelihood (Toy Example) / 29

20 Steepest Descent Minimization minimize L(w) input tolerance ɛ > 0 1: w cur 0 2: repeat 3: v w L(w cur ) 4: η argmin η R L(w cur ηv) 5: w cur w cur ηv 6: until v < ɛ output w cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 9 / 29

21 L(w) = λ 2 w 2 + w L(w) = λw + = λw + = λw + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e w,φ(xn,y) φ(x n, y) ] ȳ Y e w,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } 10 / 29

22 L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (w x [ 3, 5], w y = 0) 11 / 29

23 N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] For λ 0: E y p(y x n,w)φ(x n, y) = φ(x n, y n ) w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x,w) φ(x, y)! = φ(x, y obs ) E.g. Image Segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 12 / 29

24 L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(w) is convex w L(w) = 0 implies global minimum slice through objective value (w x [ 3, 5], w y = 0) 13 / 29

25 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. 14 / 29

26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 14 / 29

27 Solving the Training Optimization Problem Numerically Task: Compute v = w L(w cur ), evaluate L(w cur + ηv): L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = λ N 2 w + [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 15 / 29

28 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 16 / 29

29 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes 100s to 1,000,000s K: number of possible labels of each output nodes 2 to 1000s 16 / 29

30 Solving the Training Optimization Problem Numerically In a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x,w) φ(x, y) = E y p(y x,w) φ F (x, y F ) F F = ( E yf p(y F x,w)φ F (x, y F ) ) F F E yf p(y F x,w)φ F (x, y F ) = Factor marginals µ F = p(y F x, w) p(y F x, w) φ }{{} F (x, y F ) y F Y F }{{} factor marginals K F terms are much smaller than complete joint distribution p(y x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 17 / 29

31 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

32 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

33 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

34 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

35 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

37 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK 2 ND) (if BP is possible): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK 2 ND) per evaluation of L N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 100s to 10000s M: number of output nodes K: number of possible labels of each output nodes 20 / 29

38 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words w i, φ i (y i, x) : local classifier (like logistic-regression) φ i,j (y i, y j ) = y i = y j R 1 : test for same label w ij, φ ij (y i, y j ) : penalizer for label changes (if w ij > 0) combined: argmax y p(y x) is smoothed version of local cues original local confidence local + smoothness 21 / 29

Solving the Training Optimization Problem Numerically Typical feature functions in pose

test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together:

39 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG w i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.] 22 / 29

40 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 23 / 29

41 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i Disadvantage: (x) can be any classifiers, e.g. non-linear SVMs, deep network,... if local classifiers are bad, CRF training cannot fix that. 23 / 29

42 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: w L(w) = λw N [ φ(x n, y n ) y Y p(y x n, w) φ(x n, y) ] R D A lot of research on accelerating CRF training: problem solution method(s) Y too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 24 / 29

43 CRFs with Latent Variables So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 25 / 29

44 CRFs with Latent Variables Three types of variables in graphical model: x X always observed (input), y Y observed only in training (output), z Z never observed (latent). Example: x : image y : part positions z {0, 1} : flag front-view or side-view images: [Felzenszwalb et al., Object Detection with Discriminatively Trained Part Based Models, T-PAMI, 2010] 26 / 29

45 CRFs with Latent Variables Marginalization over Latent Variables Construct conditional likelihood as usual: p(y, z x, w) = 1 exp( w, φ(x, y, z) ) Z(x, w) Derive p(y x, w) by marginalizing over z: p(y x, w) = z Z p(y, z x, w) = 1 Z(x, w) exp( w, φ(x, y, z) ) z Z 27 / 29

46 Negative regularized conditional log-likelihood: L(w) = λ 2 w 2 + = λ 2 w 2 + = λ 2 w 2 + N log p(y n x n, w) N log p(y n, z x n, w) z Z N log exp( w, φ(x n, y n, z) ) z Z N log z Z y Y exp( w, φ(x n, y, z) ) L is not convex in w local minima possible How to train CRFs with latent variables is active research. 28 / 29

47 Summary CRF Learning Given: training set {(x 1, y 1 ),..., (x N, y N )} X Y feature functions φ : X Y R D that decomposes over factors, φ F : X Y F R d for F F Overall model is log-linear (in parameter w) p(y x; w) e w,φ(x,y) CRF training requires minimizing negative conditional log-likelihood: w = argmin w λ 2 w 2 + N [ w, φ(x n, y n ) log e w,φ(xn,y) ] y Y convex optimization problem (stochastic) gradient descent works training needs repeated runs of probabilistic inference latent variables are possible, but make training non-convex 29 / 29

48 Part 2: Structured Support Vector Machines

49 Supervised Learning Problem Training examples (x 1, y 1 ),..., (x N, y N ) X Y Loss function : Y Y R. How to make predictions g : X Y? Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. Slight variation (for historic reasons): 1) Learn a compatibility function g(x, y) (think: g = E ) 2) Use f(x) := argmax y Y g(x, y) to make predictions. 2 / 1

50 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). 3 / 1

51 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). Advantage: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Disadvantage: We need to know the loss function already at training time. We can t use probabilistic reasoning to find w. 3 / 1

52 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Two major problems: data distribution is unknown we can t compute E f : X Y has output in a discrete space f is piecewise constant w.r.t. w ( y, f(x)) is discontinuous, piecewise constant w.r.t w we can t apply gradient-based optimization 4 / 1

53 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Problem 1: data distribution is unknown Solution: Replace E (x,y) d(x,y) ( ) with empirical estimate 1 N To avoid overfitting: add a regularizer, e.g. λ 2 w 2. ( ) (x n,y n ) New task: min w R D λ 2 w N N ( y n, f(x n ) ). 5 / 1

54 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D λ 2 w N N ( y n, f(x n ) ). Problem: ( y n, f(x n ) ) = ( y, argmax y w, φ(x, y) ) discontinuous w.r.t. w. Solution: Replace (y, y ) with well behaved l(x, y, w) Typically: l upper bound to, continuous and convex w.r.t. w. New task: min w R D λ 2 w N N l(x n, y n, w)) 6 / 1

55 Reminder: Regularized Risk Minimization min w R D λ 2 w N N l(x n, y n, w)) Regularization + Loss on training data 7 / 1

56 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y 7 / 1

57 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l is an upper bound to : small l small 7 / 1

58 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, w) := log y Y exp ( w, φ(x n, y) w, φ(x n, y n ) ) Differentiable, convex, not an upper bound to (y, y ). 7 / 1

59 Structured Output Support Vector Machine min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) Conditional Random Field min w λ 2 w 2 + N log exp ( w, φ(x n, y) w, φ(x n, y n ) ) y Y }{{} = w,φ(x n,y n ) +exp( w,φ(x n,y) ) = cond.log.likelihood CRFs and SSVMs have more in common than usually assumed. log y exp( ) can be interpreted as a soft-max but: CRF doesn t take loss function into account at training time 8 / 1

60 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1

61 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1

62 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

63 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. labels that cause more loss are pushed further away lower chance of high-loss mistake at test time [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

64 Solving S-SVM Training Numerically We can solve SSVM training like CRF training: min w λ 2 w N N [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) continuous unconstrained convex non-differentiable we can t use gradient descent directly. we ll have to use subgradients 11 / 1

65 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

66 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

67 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 12 / 1

68 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 For differentiable f, the gradient v = f(w 0 ) is the only subgradient. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

69 Solving S-SVM Training Numerically Subgradient Method Subgradient method works basically like gradient descent: Subgradient Method Minimization minimize F (w) require: tolerance ɛ > 0, stepsizes η t w cur 0 repeat v sub w F (w cur ) wcur w cur η t v until F changed less than ɛ return w cur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 13 / 1

70 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 14 / 1

71 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y w For each y Y, l n y (w) is a linear function of w. 14 / 1

72 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y' w For each y Y, l n y (w) is a linear function of w. 14 / 1

73 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w For each y Y, l n y (w) is a linear function of w. 14 / 1

74 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w max over finite Y: piece-wise linear 14 / 1

75 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : 14 / 1

76 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : find maximal (active) y. 14 / 1

77 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) v Subgradient of l n at w 0 : find maximal (active) y, use v = l n y (w 0 ). w0 w 14 / 1

78 Solving S-SVM Training Numerically Subgradient Method Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: w w η t (λw 1 N n vn ) 8: end for output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-prediction (one per example). 15 / 1

79 Solving S-SVM Training Numerically Subgradient Method Same trick as for CRFs: stochastic updates: Stochastic Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: w w η t (λw 1 N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs only 1 argmax-prediction (but we ll need many iterations until convergence) 16 / 1

80 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

81 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

82 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

83 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

84 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

85 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

86 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = t = 6,... : no more changes. Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

87 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? 18 / 1

88 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? Remember from SVM: We can use inequalities and slack variables to encode the loss. 18 / 1

89 Solving S-SVM Training Numerically Structured SVM (equivalent formulation): Idea: slack variables min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n Note: ξ n 0 automatic, because left hand side is non-negative. Differentiable objective, convex, N non-linear contraints, 19 / 1

90 Solving S-SVM Training Numerically Structured SVM (also equivalent formulation): Idea: expand max-constraint into individual cases min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n, for all y Y Differentiable objective, convex, N Y linear constraints 20 / 1

91 Solving S-SVM Training Numerically Solve an S-SVM like a linear SVM: min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 21 / 1

92 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. 22 / 1

93 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? 22 / 1

94 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? Answer: Almost! We could, if there weren t N Y constraints. E.g. 100 binary images: constraints 22 / 1

95 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 23 / 1

96 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. 23 / 1

97 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum 23 / 1

98 Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: w 0, S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) + w, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-predictions (one per example), but we solve globally for next w, not by local steps. 24 / 1

99 Example: Object Localization X images, Y = { object bounding box } R 4. Training examples: Goal: f : X Y Loss function: area overlap (y, y ) = 1 area(y y ) area(y y ) [Blaschko, Lampert: Learning to Localize Objects with Structured Output Regression, ECCV 2008] 25 / 1

100 Example: Object Localization Structured SVM: φ(x, y) := bag-of-words histogram of region y in image x min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Interpretation: For every image, the correct bounding box, y n, should have a higher score than any wrong bounding box. Less overlap between the boxes bigger difference in score 26 / 1

101 Example: Object Localization Working set training Step 1: w 0. For every example: ŷ argmax y Y (y n, y) + w, φ(x n, y) }{{} =0 maximal -loss minimal overlap with y n ŷ y n = add constraint w, φ(x n, y n ) w, φ(x n, ŷ) 1 ξ n Note: similar to binary SVM training for object detection: positive examples: ground truth bounding boxes negative examples: random boxes from image background 27 / 1

102 Example: Object Localization Working set training Later Steps: For every example: ŷ argmax y Y (y n, y) }{{} + w, φ(x n, y) }{{} bias towards wrong regions object detection score if ŷ = y n : do nothing, else: add constraint w, φ(x n, y n ) w, φ(x n, ŷ) (y n, ŷ) ξ n enforces ŷ to have lower score after re-training. Note: similar to hard negative mining for object detection: perform detection on training image if detected region is far from ground truth, add as negative example Difference: S-SVM handles regions that overlap with ground truth. 28 / 1

103 Kernelized S-SVM We can also kernelize S-SVM optimization: max α R N Y +,...,N y Y subject to, for n = 1,..., N, y Y α ny (y n, y) 1 2 α ny 2 λn. α ny α nȳ K n nyȳ y,ȳ Y n,,...,n N Y many variables: train with working set of α iy. Kernelized prediction function: f(x) = argmax y Y n,y α ny k( (x n, y ), (x, y) ) Not very popular in Computer Vision (quickly becomes inefficient) 29 / 1

104 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) 30 / 1

105 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables Solve: with min w,ξ λ 2 w N N max y Y ln w(y) l n w(y) = (y n, y) + max z Z w, φ(xn, y, z) max z Z Problem: not convex can have local minima w, φ(xn, y n, z) [Yu, Joachims, Learning Structural SVMs with Latent Variables, 2009] similar: [Felzenszwalb et al., A Discriminatively Trained, Multiscale, Deformable Part Model, 2008], but Y = {±1} 30 / 1

106 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) 31 / 1

107 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) S-SVM solution derived from regularized risk minimization: enforce correct output to be better than all others by a margin : w, φ(x n, y n ) (y n, y) + w, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs many argmax predictions, but no probabilistic inference Latent variable possible, but optimization becomes non-convex. 31 / 1

108 Summary S-SVM Learning Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? how important is it to optimize the right loss? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 32 / 1

Ad: Positions at IST Austria, Vienna IST Austria Graduate School enter with MSc or BSc 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C.

109 Ad: Positions at IST Austria, Vienna IST Austria Graduate School enter with MSc or BSc 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C. Wojtan) Comp. Topology (H. Edelsbrunner) Game Theory (K. Chatterjee) Software Verification (T. Henzinger) Cryptography (K. Pietrzak) Comp. Neuroscience (G. Tkacik) Random Matrix Theory (L. Erdös) Statistics (C. Uhler), and more... fully funded positions Postdoc Positions in my Group see chl More info: Internships: send me an ! 33 / 1

110 Additional Material 34 / 1

111 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, 35 / 1

112 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary images: constraints (instead of ). 35 / 1

113 Solving S-SVM Training Numerically One-Slack Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) + w, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 36 / 1

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with