Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: http://www.ist.ac.at/~chl/ 1 / 10

Schedule Monday Introduction to Graphical Models 9:00-9:45 Conditional Random Fields 9:45-10:30 Structured Support Vector Machines Slides available on my home page: http://www.ist.ac.at/~chl 2 / 10

Extended version lecture in book form (180 pages) Foundations and Trends in Computer Graphics and Vision now publisher http://www.nowpublishers.com/ Available as PDF on http://pub.ist.ac.at/~chl/ 3 / 10

Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 4 / 10

Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 10

What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 10

What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 7 / 10

Reminder: Graphical Model for Pose Estimation X F (1) top Y top...... Y head F (2) top,head... Y rhnd......... Yrarm Y torso Y larm......... Y rleg Y lleg...... Y lhnd Y rfoot Y lfoot Joint probability distribution of all body parts p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 8 / 10

Reminder: Graphical Model for Image Segmentation Probability distribution over all foreground/background segmentations p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 9 / 10

Reminder: Inference/Prediction Monday: Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. Monday: MAP Prediction Predict f : X Y by solving y = argmax y Y p(y x) = argmin y Y E(y, x) Today: Parameter Learning Learn learn potentials/energy terms from training data. 10 / 10

Part 1: Conditional Random Fields

Supervised Learning Problem Given training examples (x 1, y 1 ),..., (x N, y N ) X Y How to make predictions g : X Y? Approach 1) Discrimitive Probabilistic Learning 1) Use training data to obtain an estimate p(y x). 2) Use f(x) = argminȳ Y y p(y x) (y, ȳ) to make predictions. Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. 2 / 29

Conditional Random Field Learning Goal: learn a posterior distribution p(y x) = 1 Z(x) exp( E F (y F ; x) ). F F with F = { all factors }: all unary, pairwise, potentially higher order,... parameterize each E F (y F ; x) = w F, φ F (x, y F ). fixed feature functions ( φ 1 (x 1, y),..., φ F (x F, y) ) : φ(x, y) weight vectors ( w 1,..., w F ) : w Result: log-linear model with parameter vector w 1 p(y x; w) = exp( w, φ(y, x) ). Z(x; w) with Z(x; w) = ȳ Y exp( w, φ(ȳ, x) ) New goal: find best parameter vector w R D. 3 / 29

Maximum Likelihood Parameter Estimation Idea 1: Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N w = argmax w R D p(y 1,..., y N x 1,..., x N, w) i.i.d. = argmax w R D log( ) = argmin w R D N p(y n x n, w) N log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 4 / 29

MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) 5 / 29

MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) p(w D) Bayes = p(x1, y 1,..., x N, y N w)p(w) i.i.d. = p(w) p(d) p(w): prior belief on w (cannot be estimated from data). N p(y n x n, w) p(y n x n ) w = argmax w R D = argmin w R D [ = argmin w R D [ ] p(w D) = argmin log p(w D) w R D [ N log p(w) log p(w) N ] log p(y n x n, w) + log p(y n x n ) }{{} indep. of w ] log p(y n x n, w) 5 / 29

w [ = argmin log p(w) w R D N log p(y n x n, w) ] Choices for p(w): p(w) : const. w [ = argmin w R D (uniform; in R D not really a distribution) N log p(y n x n, w) }{{} negative conditional log-likelihood + const. ] p(w) := const. e λ 2 w 2 (Gaussian) w [ λ N = argmin w R D 2 w 2 + log p(y n x n, w) + const. ] }{{} regularized negative conditional log-likelihood 6 / 29

Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] (λ 0 makes it unregularized) y Y Probabilistic parameter estimation or training means solving w = argmin L(w). w R D Same optimization problem as for multi-class logistic regression. 7 / 29

Negative Conditional Log-Likelihood (Toy Example) 64.000 256.000 2 1 1024.000 512.000 256.000 1024.000 2 1 128.000 64.000 0 32.000 0 4.000 8.000 1 1024.000 128.000 2048.000 1 16.000 32.000 2 3 2 1 0 1 2 3 4 2 3 2 1 0 1 2 3 4 2 1 128.000 64.000 8.000 16.000 2 1 0.500 0.031 0.008 0.004 0.002 0.001 0.000 0 1.000 2.000 0 64.000 32.000 16.000 2.000 1.000 0.250 0.125 0.062 0.016 0.000 0.000 0.000 1 32.000 2 3 2 1 0 1 2 3 4 4.000 1 8.000 4.000 2 3 2 1 0 1 2 3 4 0.000 8 / 29

Steepest Descent Minimization minimize L(w) input tolerance ɛ > 0 1: w cur 0 2: repeat 3: v w L(w cur ) 4: η argmin η R L(w cur ηv) 5: w cur w cur ηv 6: until v < ɛ output w cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 9 / 29

L(w) = λ 2 w 2 + w L(w) = λw + = λw + = λw + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e w,φ(xn,y) φ(x n, y) ] ȳ Y e w,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } 10 / 29

L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D. 80 70 60 50 40 30 20 10 0 3 2 1 0 1 2 3 4 5 slice through objective value (w x [ 3, 5], w y = 0) 11 / 29

N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] For λ 0: E y p(y x n,w)φ(x n, y) = φ(x n, y n ) w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x,w) φ(x, y)! = φ(x, y obs ) E.g. Image Segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 12 / 29

L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(w) is convex w L(w) = 0 implies global minimum. 80 70 60 50 40 30 20 10 0 3 2 1 0 1 2 3 4 5 slice through objective value (w x [ 3, 5], w y = 0) 13 / 29

Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 14 / 29

Solving the Training Optimization Problem Numerically Task: Compute v = w L(w cur ), evaluate L(w cur + ηv): L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = λ N 2 w + [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = 2 640 480 10 92475 ranking N images: Y = N!, e.g. N = 1000: Y 10 2568. We must use the structure in Y, or we re lost. 15 / 29

Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 16 / 29

Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes 100s to 1,000,000s K: number of possible labels of each output nodes 2 to 1000s 16 / 29

Solving the Training Optimization Problem Numerically In a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x,w) φ(x, y) = E y p(y x,w) φ F (x, y F ) F F = ( E yf p(y F x,w)φ F (x, y F ) ) F F E yf p(y F x,w)φ F (x, y F ) = Factor marginals µ F = p(y F x, w) p(y F x, w) φ }{{} F (x, y F ) y F Y F }{{} factor marginals K F terms are much smaller than complete joint distribution p(y x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 17 / 29

Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS 2008. also: http://leon.bottou.org/research/largescale 19 / 29

Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK 2 ND) (if BP is possible): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK 2 ND) per evaluation of L N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 100s to 10000s M: number of output nodes K: number of possible labels of each output nodes 20 / 29

Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words w i, φ i (y i, x) : local classifier (like logistic-regression) φ i,j (y i, y j ) = y i = y j R 1 : test for same label w ij, φ ij (y i, y j ) : penalizer for label changes (if w ij > 0) combined: argmax y p(y x) is smoothed version of local cues original local confidence local + smoothness 21 / 29

Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG w i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.] 22 / 29

Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i Disadvantage: (x) can be any classifiers, e.g. non-linear SVMs, deep network,... if local classifiers are bad, CRF training cannot fix that. 23 / 29

Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: w L(w) = λw N [ φ(x n, y n ) y Y p(y x n, w) φ(x n, y) ] R D A lot of research on accelerating CRF training: problem solution method(s) Y too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 24 / 29

CRFs with Latent Variables So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 25 / 29

CRFs with Latent Variables Three types of variables in graphical model: x X always observed (input), y Y observed only in training (output), z Z never observed (latent). Example: x : image y : part positions z {0, 1} : flag front-view or side-view images: [Felzenszwalb et al., Object Detection with Discriminatively Trained Part Based Models, T-PAMI, 2010] 26 / 29

CRFs with Latent Variables Marginalization over Latent Variables Construct conditional likelihood as usual: p(y, z x, w) = 1 exp( w, φ(x, y, z) ) Z(x, w) Derive p(y x, w) by marginalizing over z: p(y x, w) = z Z p(y, z x, w) = 1 Z(x, w) exp( w, φ(x, y, z) ) z Z 27 / 29

Negative regularized conditional log-likelihood: L(w) = λ 2 w 2 + = λ 2 w 2 + = λ 2 w 2 + N log p(y n x n, w) N log p(y n, z x n, w) z Z N log exp( w, φ(x n, y n, z) ) z Z N log z Z y Y exp( w, φ(x n, y, z) ) L is not convex in w local minima possible How to train CRFs with latent variables is active research. 28 / 29

Summary CRF Learning Given: training set {(x 1, y 1 ),..., (x N, y N )} X Y feature functions φ : X Y R D that decomposes over factors, φ F : X Y F R d for F F Overall model is log-linear (in parameter w) p(y x; w) e w,φ(x,y) CRF training requires minimizing negative conditional log-likelihood: w = argmin w λ 2 w 2 + N [ w, φ(x n, y n ) log e w,φ(xn,y) ] y Y convex optimization problem (stochastic) gradient descent works training needs repeated runs of probabilistic inference latent variables are possible, but make training non-convex 29 / 29

Part 2: Structured Support Vector Machines

Supervised Learning Problem Training examples (x 1, y 1 ),..., (x N, y N ) X Y Loss function : Y Y R. How to make predictions g : X Y? Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. Slight variation (for historic reasons): 1) Learn a compatibility function g(x, y) (think: g = E ) 2) Use f(x) := argmax y Y g(x, y) to make predictions. 2 / 1

Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). Advantage: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Disadvantage: We need to know the loss function already at training time. We can t use probabilistic reasoning to find w. 3 / 1

Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Two major problems: data distribution is unknown we can t compute E f : X Y has output in a discrete space f is piecewise constant w.r.t. w ( y, f(x)) is discontinuous, piecewise constant w.r.t w we can t apply gradient-based optimization 4 / 1

Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Problem 1: data distribution is unknown Solution: Replace E (x,y) d(x,y) ( ) with empirical estimate 1 N To avoid overfitting: add a regularizer, e.g. λ 2 w 2. ( ) (x n,y n ) New task: min w R D λ 2 w 2 + 1 N N ( y n, f(x n ) ). 5 / 1

Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D λ 2 w 2 + 1 N N ( y n, f(x n ) ). Problem: ( y n, f(x n ) ) = ( y, argmax y w, φ(x, y) ) discontinuous w.r.t. w. Solution: Replace (y, y ) with well behaved l(x, y, w) Typically: l upper bound to, continuous and convex w.r.t. w. New task: min w R D λ 2 w 2 + 1 N N l(x n, y n, w)) 6 / 1

Reminder: Regularized Risk Minimization min w R D λ 2 w 2 + 1 N N l(x n, y n, w)) Regularization + Loss on training data 7 / 1

Reminder: Regularized Risk Minimization min w R D λ 2 w 2 + 1 N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l is an upper bound to : small l small 7 / 1

Reminder: Regularized Risk Minimization min w R D λ 2 w 2 + 1 N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, w) := log y Y exp ( w, φ(x n, y) w, φ(x n, y n ) ) Differentiable, convex, not an upper bound to (y, y ). 7 / 1

Structured Output Support Vector Machine min w λ 2 w 2 + 1 N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) Conditional Random Field min w λ 2 w 2 + N log exp ( w, φ(x n, y) w, φ(x n, y n ) ) y Y }{{} = w,φ(x n,y n ) +exp( w,φ(x n,y) ) = cond.log.likelihood CRFs and SSVMs have more in common than usually assumed. log y exp( ) can be interpreted as a soft-max but: CRF doesn t take loss function into account at training time 8 / 1

Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w 2 + 1 N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1

Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w 2 + 1 N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w 2 + 1 N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. labels that cause more loss are pushed further away lower chance of high-loss mistake at test time [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

Solving S-SVM Training Numerically We can solve SSVM training like CRF training: min w λ 2 w 2 + 1 N N [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) continuous unconstrained convex non-differentiable we can t use gradient descent directly. we ll have to use subgradients 11 / 1

Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

Solving S-SVM Training Numerically Subgradient Method Subgradient method works basically like gradient descent: Subgradient Method Minimization minimize F (w) require: tolerance ɛ > 0, stepsizes η t w cur 0 repeat v sub w F (w cur ) wcur w cur η t v until F changed less than ɛ return w cur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 13 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 14 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y w For each y Y, l n y (w) is a linear function of w. 14 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y' w For each y Y, l n y (w) is a linear function of w. 14 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w For each y Y, l n y (w) is a linear function of w. 14 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : find maximal (active) y. 14 / 1

Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w 2 + 1 N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) v Subgradient of l n at w 0 : find maximal (active) y, use v = l n y (w 0 ). w0 w 14 / 1

Solving S-SVM Training Numerically Subgradient Method Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: w w η t (λw 1 N n vn ) 8: end for output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-prediction (one per example). 15 / 1

Solving S-SVM Training Numerically Subgradient Method Same trick as for CRFs: stochastic updates: Stochastic Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: w w η t (λw 1 N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs only 1 argmax-prediction (but we ll need many iterations until convergence) 16 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = t = 6,... : no more changes. Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w 2 + 1 N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? 18 / 1

Solving S-SVM Training Numerically Structured SVM (equivalent formulation): Idea: slack variables min w,ξ λ 2 w 2 + 1 N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n Note: ξ n 0 automatic, because left hand side is non-negative. Differentiable objective, convex, N non-linear contraints, 19 / 1

Solving S-SVM Training Numerically Structured SVM (also equivalent formulation): Idea: expand max-constraint into individual cases min w,ξ λ 2 w 2 + 1 N N ξ n subject to, for n = 1,..., N, (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n, for all y Y Differentiable objective, convex, N Y linear constraints 20 / 1

Solving S-SVM Training Numerically Solve an S-SVM like a linear SVM: min w R D,ξ R n λ 2 w 2 + 1 N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 21 / 1

Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w 2 + 1 N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? 22 / 1

Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. 23 / 1

Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: w 0, S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) + w, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-predictions (one per example), but we solve globally for next w, not by local steps. 24 / 1

Example: Object Localization X images, Y = { object bounding box } R 4. Training examples: Goal: f : X Y Loss function: area overlap (y, y ) = 1 area(y y ) area(y y ) [Blaschko, Lampert: Learning to Localize Objects with Structured Output Regression, ECCV 2008] 25 / 1

Example: Object Localization Structured SVM: φ(x, y) := bag-of-words histogram of region y in image x min w R D,ξ R n λ 2 w 2 + 1 N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Interpretation: For every image, the correct bounding box, y n, should have a higher score than any wrong bounding box. Less overlap between the boxes bigger difference in score 26 / 1

Example: Object Localization Working set training Step 1: w 0. For every example: ŷ argmax y Y (y n, y) + w, φ(x n, y) }{{} =0 maximal -loss minimal overlap with y n ŷ y n = add constraint w, φ(x n, y n ) w, φ(x n, ŷ) 1 ξ n Note: similar to binary SVM training for object detection: positive examples: ground truth bounding boxes negative examples: random boxes from image background 27 / 1

Example: Object Localization Working set training Later Steps: For every example: ŷ argmax y Y (y n, y) }{{} + w, φ(x n, y) }{{} bias towards wrong regions object detection score if ŷ = y n : do nothing, else: add constraint w, φ(x n, y n ) w, φ(x n, ŷ) (y n, ŷ) ξ n enforces ŷ to have lower score after re-training. Note: similar to hard negative mining for object detection: perform detection on training image if detected region is far from ground truth, add as negative example Difference: S-SVM handles regions that overlap with ground truth. 28 / 1

Kernelized S-SVM We can also kernelize S-SVM optimization: max α R N Y +,...,N y Y subject to, for n = 1,..., N, y Y α ny (y n, y) 1 2 α ny 2 λn. α ny α nȳ K n nyȳ y,ȳ Y n,,...,n N Y many variables: train with working set of α iy. Kernelized prediction function: f(x) = argmax y Y n,y α ny k( (x n, y ), (x, y) ) Not very popular in Computer Vision (quickly becomes inefficient) 29 / 1

SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables Solve: with min w,ξ λ 2 w 2 + 1 N N max y Y ln w(y) l n w(y) = (y n, y) + max z Z w, φ(xn, y, z) max z Z Problem: not convex can have local minima w, φ(xn, y n, z) [Yu, Joachims, Learning Structural SVMs with Latent Variables, 2009] similar: [Felzenszwalb et al., A Discriminatively Trained, Multiscale, Deformable Part Model, 2008], but Y = {±1} 30 / 1

Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) S-SVM solution derived from regularized risk minimization: enforce correct output to be better than all others by a margin : w, φ(x n, y n ) (y n, y) + w, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs many argmax predictions, but no probabilistic inference Latent variable possible, but optimization becomes non-convex. 31 / 1

Summary S-SVM Learning Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? how important is it to optimize the right loss? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 32 / 1

Ad: Positions at IST Austria, Vienna IST Austria Graduate School enter with MSc or BSc 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C. Wojtan) Comp. Topology (H. Edelsbrunner) Game Theory (K. Chatterjee) Software Verification (T. Henzinger) Cryptography (K. Pietrzak) Comp. Neuroscience (G. Tkacik) Random Matrix Theory (L. Erdös) Statistics (C. Uhler), and more... fully funded positions Postdoc Positions in my Group see http://www.ist.ac.at/ chl More info: www.ist.ac.at Internships: send me an email! 33 / 1

Additional Material 34 / 1

Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary 16 16 images: 10 177 constraints (instead of 10 79 ). 35 / 1

Solving S-SVM Training Numerically One-Slack Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) + w, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 36 / 1