Learning with Structured Inputs and Outputs

Size: px
Start display at page:

Download "Learning with Structured Inputs and Outputs"

Transcription

1 Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: 1 / 10

2 Schedule Monday Introduction to Graphical Models 9:00-9:45 Conditional Random Fields 9:45-10:30 Structured Support Vector Machines Slides available on my home page: 2 / 10

3 Extended version lecture in book form (180 pages) Foundations and Trends in Computer Graphics and Vision now publisher Available as PDF on 3 / 10

4 Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 4 / 10

5 Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 10

6 What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 10

7 What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 7 / 10

8 Reminder: Graphical Model for Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Joint probability distribution of all body parts p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 8 / 10

9 Reminder: Graphical Model for Image Segmentation Probability distribution over all foreground/background segmentations p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 9 / 10

10 Reminder: Inference/Prediction Monday: Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. Monday: MAP Prediction Predict f : X Y by solving y = argmax y Y p(y x) = argmin y Y E(y, x) Today: Parameter Learning Learn learn potentials/energy terms from training data. 10 / 10

11 Part 1: Conditional Random Fields

12 Supervised Learning Problem Given training examples (x 1, y 1 ),..., (x N, y N ) X Y How to make predictions g : X Y? Approach 1) Discrimitive Probabilistic Learning 1) Use training data to obtain an estimate p(y x). 2) Use f(x) = argminȳ Y y p(y x) (y, ȳ) to make predictions. Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. 2 / 29

13 Conditional Random Field Learning Goal: learn a posterior distribution p(y x) = 1 Z(x) exp( E F (y F ; x) ). F F with F = { all factors }: all unary, pairwise, potentially higher order,... parameterize each E F (y F ; x) = w F, φ F (x, y F ). fixed feature functions ( φ 1 (x 1, y),..., φ F (x F, y) ) : φ(x, y) weight vectors ( w 1,..., w F ) : w Result: log-linear model with parameter vector w 1 p(y x; w) = exp( w, φ(y, x) ). Z(x; w) with Z(x; w) = ȳ Y exp( w, φ(ȳ, x) ) New goal: find best parameter vector w R D. 3 / 29

14 Maximum Likelihood Parameter Estimation Idea 1: Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N w = argmax w R D p(y 1,..., y N x 1,..., x N, w) i.i.d. = argmax w R D log( ) = argmin w R D N p(y n x n, w) N log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 4 / 29

15 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) 5 / 29

16 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) p(w D) Bayes = p(x1, y 1,..., x N, y N w)p(w) i.i.d. = p(w) p(d) p(w): prior belief on w (cannot be estimated from data). N p(y n x n, w) p(y n x n ) w = argmax w R D = argmin w R D [ = argmin w R D [ ] p(w D) = argmin log p(w D) w R D [ N log p(w) log p(w) N ] log p(y n x n, w) + log p(y n x n ) }{{} indep. of w ] log p(y n x n, w) 5 / 29

17 w [ = argmin log p(w) w R D N log p(y n x n, w) ] Choices for p(w): p(w) : const. w [ = argmin w R D (uniform; in R D not really a distribution) N log p(y n x n, w) }{{} negative conditional log-likelihood + const. ] p(w) := const. e λ 2 w 2 (Gaussian) w [ λ N = argmin w R D 2 w 2 + log p(y n x n, w) + const. ] }{{} regularized negative conditional log-likelihood 6 / 29

18 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] (λ 0 makes it unregularized) y Y Probabilistic parameter estimation or training means solving w = argmin L(w). w R D Same optimization problem as for multi-class logistic regression. 7 / 29

19 Negative Conditional Log-Likelihood (Toy Example) / 29

20 Steepest Descent Minimization minimize L(w) input tolerance ɛ > 0 1: w cur 0 2: repeat 3: v w L(w cur ) 4: η argmin η R L(w cur ηv) 5: w cur w cur ηv 6: until v < ɛ output w cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 9 / 29

21 L(w) = λ 2 w 2 + w L(w) = λw + = λw + = λw + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e w,φ(xn,y) φ(x n, y) ] ȳ Y e w,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } 10 / 29

22 L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (w x [ 3, 5], w y = 0) 11 / 29

23 N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] For λ 0: E y p(y x n,w)φ(x n, y) = φ(x n, y n ) w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x,w) φ(x, y)! = φ(x, y obs ) E.g. Image Segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 12 / 29

24 L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(w) is convex w L(w) = 0 implies global minimum slice through objective value (w x [ 3, 5], w y = 0) 13 / 29

25 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. 14 / 29

26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 14 / 29

27 Solving the Training Optimization Problem Numerically Task: Compute v = w L(w cur ), evaluate L(w cur + ηv): L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = λ N 2 w + [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 15 / 29

28 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 16 / 29

29 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes 100s to 1,000,000s K: number of possible labels of each output nodes 2 to 1000s 16 / 29

30 Solving the Training Optimization Problem Numerically In a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x,w) φ(x, y) = E y p(y x,w) φ F (x, y F ) F F = ( E yf p(y F x,w)φ F (x, y F ) ) F F E yf p(y F x,w)φ F (x, y F ) = Factor marginals µ F = p(y F x, w) p(y F x, w) φ }{{} F (x, y F ) y F Y F }{{} factor marginals K F terms are much smaller than complete joint distribution p(y x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 17 / 29

31 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

32 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29

33 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

34 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

35 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29

37 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK 2 ND) (if BP is possible): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK 2 ND) per evaluation of L N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 100s to 10000s M: number of output nodes K: number of possible labels of each output nodes 20 / 29

38 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words w i, φ i (y i, x) : local classifier (like logistic-regression) φ i,j (y i, y j ) = y i = y j R 1 : test for same label w ij, φ ij (y i, y j ) : penalizer for label changes (if w ij > 0) combined: argmax y p(y x) is smoothed version of local cues original local confidence local + smoothness 21 / 29

39 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG w i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.] 22 / 29

40 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 23 / 29

41 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i Disadvantage: (x) can be any classifiers, e.g. non-linear SVMs, deep network,... if local classifiers are bad, CRF training cannot fix that. 23 / 29

42 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: w L(w) = λw N [ φ(x n, y n ) y Y p(y x n, w) φ(x n, y) ] R D A lot of research on accelerating CRF training: problem solution method(s) Y too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 24 / 29

43 CRFs with Latent Variables So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 25 / 29

44 CRFs with Latent Variables Three types of variables in graphical model: x X always observed (input), y Y observed only in training (output), z Z never observed (latent). Example: x : image y : part positions z {0, 1} : flag front-view or side-view images: [Felzenszwalb et al., Object Detection with Discriminatively Trained Part Based Models, T-PAMI, 2010] 26 / 29

45 CRFs with Latent Variables Marginalization over Latent Variables Construct conditional likelihood as usual: p(y, z x, w) = 1 exp( w, φ(x, y, z) ) Z(x, w) Derive p(y x, w) by marginalizing over z: p(y x, w) = z Z p(y, z x, w) = 1 Z(x, w) exp( w, φ(x, y, z) ) z Z 27 / 29

46 Negative regularized conditional log-likelihood: L(w) = λ 2 w 2 + = λ 2 w 2 + = λ 2 w 2 + N log p(y n x n, w) N log p(y n, z x n, w) z Z N log exp( w, φ(x n, y n, z) ) z Z N log z Z y Y exp( w, φ(x n, y, z) ) L is not convex in w local minima possible How to train CRFs with latent variables is active research. 28 / 29

47 Summary CRF Learning Given: training set {(x 1, y 1 ),..., (x N, y N )} X Y feature functions φ : X Y R D that decomposes over factors, φ F : X Y F R d for F F Overall model is log-linear (in parameter w) p(y x; w) e w,φ(x,y) CRF training requires minimizing negative conditional log-likelihood: w = argmin w λ 2 w 2 + N [ w, φ(x n, y n ) log e w,φ(xn,y) ] y Y convex optimization problem (stochastic) gradient descent works training needs repeated runs of probabilistic inference latent variables are possible, but make training non-convex 29 / 29

48 Part 2: Structured Support Vector Machines

49 Supervised Learning Problem Training examples (x 1, y 1 ),..., (x N, y N ) X Y Loss function : Y Y R. How to make predictions g : X Y? Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. Slight variation (for historic reasons): 1) Learn a compatibility function g(x, y) (think: g = E ) 2) Use f(x) := argmax y Y g(x, y) to make predictions. 2 / 1

50 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). 3 / 1

51 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). Advantage: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Disadvantage: We need to know the loss function already at training time. We can t use probabilistic reasoning to find w. 3 / 1

52 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Two major problems: data distribution is unknown we can t compute E f : X Y has output in a discrete space f is piecewise constant w.r.t. w ( y, f(x)) is discontinuous, piecewise constant w.r.t w we can t apply gradient-based optimization 4 / 1

53 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Problem 1: data distribution is unknown Solution: Replace E (x,y) d(x,y) ( ) with empirical estimate 1 N To avoid overfitting: add a regularizer, e.g. λ 2 w 2. ( ) (x n,y n ) New task: min w R D λ 2 w N N ( y n, f(x n ) ). 5 / 1

54 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D λ 2 w N N ( y n, f(x n ) ). Problem: ( y n, f(x n ) ) = ( y, argmax y w, φ(x, y) ) discontinuous w.r.t. w. Solution: Replace (y, y ) with well behaved l(x, y, w) Typically: l upper bound to, continuous and convex w.r.t. w. New task: min w R D λ 2 w N N l(x n, y n, w)) 6 / 1

55 Reminder: Regularized Risk Minimization min w R D λ 2 w N N l(x n, y n, w)) Regularization + Loss on training data 7 / 1

56 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y 7 / 1

57 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l is an upper bound to : small l small 7 / 1

58 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, w) := log y Y exp ( w, φ(x n, y) w, φ(x n, y n ) ) Differentiable, convex, not an upper bound to (y, y ). 7 / 1

59 Structured Output Support Vector Machine min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) Conditional Random Field min w λ 2 w 2 + N log exp ( w, φ(x n, y) w, φ(x n, y n ) ) y Y }{{} = w,φ(x n,y n ) +exp( w,φ(x n,y) ) = cond.log.likelihood CRFs and SSVMs have more in common than usually assumed. log y exp( ) can be interpreted as a soft-max but: CRF doesn t take loss function into account at training time 8 / 1

60 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1

61 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1

62 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

63 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. labels that cause more loss are pushed further away lower chance of high-loss mistake at test time [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1

64 Solving S-SVM Training Numerically We can solve SSVM training like CRF training: min w λ 2 w N N [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) continuous unconstrained convex non-differentiable we can t use gradient descent directly. we ll have to use subgradients 11 / 1

65 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

66 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

67 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 12 / 1

68 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 For differentiable f, the gradient v = f(w 0 ) is the only subgradient. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1

69 Solving S-SVM Training Numerically Subgradient Method Subgradient method works basically like gradient descent: Subgradient Method Minimization minimize F (w) require: tolerance ɛ > 0, stepsizes η t w cur 0 repeat v sub w F (w cur ) wcur w cur η t v until F changed less than ɛ return w cur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 13 / 1

70 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 14 / 1

71 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y w For each y Y, l n y (w) is a linear function of w. 14 / 1

72 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y' w For each y Y, l n y (w) is a linear function of w. 14 / 1

73 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w For each y Y, l n y (w) is a linear function of w. 14 / 1

74 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w max over finite Y: piece-wise linear 14 / 1

75 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : 14 / 1

76 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : find maximal (active) y. 14 / 1

77 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) v Subgradient of l n at w 0 : find maximal (active) y, use v = l n y (w 0 ). w0 w 14 / 1

78 Solving S-SVM Training Numerically Subgradient Method Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: w w η t (λw 1 N n vn ) 8: end for output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-prediction (one per example). 15 / 1

79 Solving S-SVM Training Numerically Subgradient Method Same trick as for CRFs: stochastic updates: Stochastic Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: w w η t (λw 1 N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs only 1 argmax-prediction (but we ll need many iterations until convergence) 16 / 1

80 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

81 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

82 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

83 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

84 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

85 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

86 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = t = 6,... : no more changes. Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1

87 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? 18 / 1

88 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? Remember from SVM: We can use inequalities and slack variables to encode the loss. 18 / 1

89 Solving S-SVM Training Numerically Structured SVM (equivalent formulation): Idea: slack variables min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n Note: ξ n 0 automatic, because left hand side is non-negative. Differentiable objective, convex, N non-linear contraints, 19 / 1

90 Solving S-SVM Training Numerically Structured SVM (also equivalent formulation): Idea: expand max-constraint into individual cases min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n, for all y Y Differentiable objective, convex, N Y linear constraints 20 / 1

91 Solving S-SVM Training Numerically Solve an S-SVM like a linear SVM: min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 21 / 1

92 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. 22 / 1

93 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? 22 / 1

94 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? Answer: Almost! We could, if there weren t N Y constraints. E.g. 100 binary images: constraints 22 / 1

95 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 23 / 1

96 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. 23 / 1

97 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum 23 / 1

98 Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: w 0, S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) + w, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-predictions (one per example), but we solve globally for next w, not by local steps. 24 / 1

99 Example: Object Localization X images, Y = { object bounding box } R 4. Training examples: Goal: f : X Y Loss function: area overlap (y, y ) = 1 area(y y ) area(y y ) [Blaschko, Lampert: Learning to Localize Objects with Structured Output Regression, ECCV 2008] 25 / 1

100 Example: Object Localization Structured SVM: φ(x, y) := bag-of-words histogram of region y in image x min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Interpretation: For every image, the correct bounding box, y n, should have a higher score than any wrong bounding box. Less overlap between the boxes bigger difference in score 26 / 1

101 Example: Object Localization Working set training Step 1: w 0. For every example: ŷ argmax y Y (y n, y) + w, φ(x n, y) }{{} =0 maximal -loss minimal overlap with y n ŷ y n = add constraint w, φ(x n, y n ) w, φ(x n, ŷ) 1 ξ n Note: similar to binary SVM training for object detection: positive examples: ground truth bounding boxes negative examples: random boxes from image background 27 / 1

102 Example: Object Localization Working set training Later Steps: For every example: ŷ argmax y Y (y n, y) }{{} + w, φ(x n, y) }{{} bias towards wrong regions object detection score if ŷ = y n : do nothing, else: add constraint w, φ(x n, y n ) w, φ(x n, ŷ) (y n, ŷ) ξ n enforces ŷ to have lower score after re-training. Note: similar to hard negative mining for object detection: perform detection on training image if detected region is far from ground truth, add as negative example Difference: S-SVM handles regions that overlap with ground truth. 28 / 1

103 Kernelized S-SVM We can also kernelize S-SVM optimization: max α R N Y +,...,N y Y subject to, for n = 1,..., N, y Y α ny (y n, y) 1 2 α ny 2 λn. α ny α nȳ K n nyȳ y,ȳ Y n,,...,n N Y many variables: train with working set of α iy. Kernelized prediction function: f(x) = argmax y Y n,y α ny k( (x n, y ), (x, y) ) Not very popular in Computer Vision (quickly becomes inefficient) 29 / 1

104 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) 30 / 1

105 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables Solve: with min w,ξ λ 2 w N N max y Y ln w(y) l n w(y) = (y n, y) + max z Z w, φ(xn, y, z) max z Z Problem: not convex can have local minima w, φ(xn, y n, z) [Yu, Joachims, Learning Structural SVMs with Latent Variables, 2009] similar: [Felzenszwalb et al., A Discriminatively Trained, Multiscale, Deformable Part Model, 2008], but Y = {±1} 30 / 1

106 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) 31 / 1

107 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) S-SVM solution derived from regularized risk minimization: enforce correct output to be better than all others by a margin : w, φ(x n, y n ) (y n, y) + w, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs many argmax predictions, but no probabilistic inference Latent variable possible, but optimization becomes non-convex. 31 / 1

108 Summary S-SVM Learning Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? how important is it to optimize the right loss? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 32 / 1

109 Ad: Positions at IST Austria, Vienna IST Austria Graduate School enter with MSc or BSc 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C. Wojtan) Comp. Topology (H. Edelsbrunner) Game Theory (K. Chatterjee) Software Verification (T. Henzinger) Cryptography (K. Pietrzak) Comp. Neuroscience (G. Tkacik) Random Matrix Theory (L. Erdös) Statistics (C. Uhler), and more... fully funded positions Postdoc Positions in my Group see chl More info: Internships: send me an ! 33 / 1

110 Additional Material 34 / 1

111 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, 35 / 1

112 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary images: constraints (instead of ). 35 / 1

113 Solving S-SVM Training Numerically One-Slack Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) + w, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 36 / 1

Probabilistic Graphical Models & Applications

Probabilistic Graphical Models & Applications Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with

More information

Part 4: Conditional Random Fields

Part 4: Conditional Random Fields Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs

More information

Introduction To Graphical Models

Introduction To Graphical Models Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

1 Machine Learning Concepts (16 points)

1 Machine Learning Concepts (16 points) CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Learning with Structured Inputs and Outputs

Learning with Structured Inputs and Outputs Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

Lecture 3: Multiclass Classification

Lecture 3: Multiclass Classification Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015 Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018 From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Intelligent Systems:

Intelligent Systems: Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition

More information

Logistic Regression. William Cohen

Logistic Regression. William Cohen Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review CS395T: Structured Models for NLP Lecture 2: Machine Learning Review Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah Administrivia Course enrollment Lecture slides posted on website

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

ML (cont.): SUPPORT VECTOR MACHINES

ML (cont.): SUPPORT VECTOR MACHINES ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert

Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Linear discriminant functions

Linear discriminant functions Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative

More information

Linear classifiers: Logistic regression

Linear classifiers: Logistic regression Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The

More information

CS Machine Learning Qualifying Exam

CS Machine Learning Qualifying Exam CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range

More information