Learning with Structured Inputs and Outputs
|
|
- Emory Bruno Preston
- 5 years ago
- Views:
Transcription
1 Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna ENS/INRIA Summer School, Paris, July 2013 Slides: 1 / 10
2 Schedule Monday Introduction to Graphical Models 9:00-9:45 Conditional Random Fields 9:45-10:30 Structured Support Vector Machines Slides available on my home page: 2 / 10
3 Extended version lecture in book form (180 pages) Foundations and Trends in Computer Graphics and Vision now publisher Available as PDF on 3 / 10
4 Standard Regression/Classification: f : X R. Structured Output Learning: f : X Y. 4 / 10
5 Standard Regression/Classification: f : X R. inputs X can be any kind of objects output y is a real number Structured Output Learning: f : X Y. inputs X can be any kind of objects outputs y Y are complex (structured) objects 5 / 10
6 What is structured data? Ad hoc definition: data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together. Text Molecules / Chemical Structures Documents/HyperText Images 6 / 10
7 What is structured output prediction? Ad hoc definition: predicting structured outputs from input data (in contrast to predicting just a single number, like in classification or regression) Natural Language Processing: Automatic Translation (output: sentences) Sentence Parsing (output: parse trees) Bioinformatics: Secondary Structure Prediction (output: bipartite graphs) Enzyme Function Prediction (output: path in a tree) Speech Processing: Automatic Transcription (output: sentences) Text-to-Speech (output: audio signal) Robotics: Planning (output: sequence of actions) This tutorial: Applications and Examples from Computer Vision 7 / 10
8 Reminder: Graphical Model for Pose Estimation X F (1) top Y top Y head F (2) top,head... Y rhnd Yrarm Y torso Y larm Y rleg Y lleg Y lhnd Y rfoot Y lfoot Joint probability distribution of all body parts p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 8 / 10
9 Reminder: Graphical Model for Image Segmentation Probability distribution over all foreground/background segmentations p(y x) = 1 Z(x) exp( E F (y F ; x)). F F Exponent ( energy ) decomposes into small but interacting factors. 9 / 10
10 Reminder: Inference/Prediction Monday: Probabilistic Inference Compute marginal probabilities p(y F x) for any factor F, in particular, p(y i x) for all i V. Monday: MAP Prediction Predict f : X Y by solving y = argmax y Y p(y x) = argmin y Y E(y, x) Today: Parameter Learning Learn learn potentials/energy terms from training data. 10 / 10
11 Part 1: Conditional Random Fields
12 Supervised Learning Problem Given training examples (x 1, y 1 ),..., (x N, y N ) X Y How to make predictions g : X Y? Approach 1) Discrimitive Probabilistic Learning 1) Use training data to obtain an estimate p(y x). 2) Use f(x) = argminȳ Y y p(y x) (y, ȳ) to make predictions. Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. 2 / 29
13 Conditional Random Field Learning Goal: learn a posterior distribution p(y x) = 1 Z(x) exp( E F (y F ; x) ). F F with F = { all factors }: all unary, pairwise, potentially higher order,... parameterize each E F (y F ; x) = w F, φ F (x, y F ). fixed feature functions ( φ 1 (x 1, y),..., φ F (x F, y) ) : φ(x, y) weight vectors ( w 1,..., w F ) : w Result: log-linear model with parameter vector w 1 p(y x; w) = exp( w, φ(y, x) ). Z(x; w) with Z(x; w) = ȳ Y exp( w, φ(ȳ, x) ) New goal: find best parameter vector w R D. 3 / 29
14 Maximum Likelihood Parameter Estimation Idea 1: Maximize likelihood of outputs y 1,..., y N for inputs x 1,..., x N w = argmax w R D p(y 1,..., y N x 1,..., x N, w) i.i.d. = argmax w R D log( ) = argmin w R D N p(y n x n, w) N log p(y n x n, w) }{{} negative conditional log-likelihood (of D) 4 / 29
15 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) 5 / 29
16 MAP Estimation of w Idea 2: Treat w as random variable; maximize posterior p(w D) p(w D) Bayes = p(x1, y 1,..., x N, y N w)p(w) i.i.d. = p(w) p(d) p(w): prior belief on w (cannot be estimated from data). N p(y n x n, w) p(y n x n ) w = argmax w R D = argmin w R D [ = argmin w R D [ ] p(w D) = argmin log p(w D) w R D [ N log p(w) log p(w) N ] log p(y n x n, w) + log p(y n x n ) }{{} indep. of w ] log p(y n x n, w) 5 / 29
17 w [ = argmin log p(w) w R D N log p(y n x n, w) ] Choices for p(w): p(w) : const. w [ = argmin w R D (uniform; in R D not really a distribution) N log p(y n x n, w) }{{} negative conditional log-likelihood + const. ] p(w) := const. e λ 2 w 2 (Gaussian) w [ λ N = argmin w R D 2 w 2 + log p(y n x n, w) + const. ] }{{} regularized negative conditional log-likelihood 6 / 29
18 Probabilistic Models for Structured Prediction - Summary Negative (Regularized) Conditional Log-Likelihood (of D) L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] (λ 0 makes it unregularized) y Y Probabilistic parameter estimation or training means solving w = argmin L(w). w R D Same optimization problem as for multi-class logistic regression. 7 / 29
19 Negative Conditional Log-Likelihood (Toy Example) / 29
20 Steepest Descent Minimization minimize L(w) input tolerance ɛ > 0 1: w cur 0 2: repeat 3: v w L(w cur ) 4: η argmin η R L(w cur ηv) 5: w cur w cur ηv 6: until v < ɛ output w cur Alternatives: L-BFGS (second-order descent without explicit Hessian) Conjugate Gradient We always need (at least) the gradient of L. 9 / 29
21 L(w) = λ 2 w 2 + w L(w) = λw + = λw + = λw + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y N [ φ(x n, y n y Y ) e w,φ(xn,y) φ(x n, y) ] ȳ Y e w,φ(xn,ȳ) N [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } 10 / 29
22 L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y continuous (not discrete), C -differentiable on all R D slice through objective value (w x [ 3, 5], w y = 0) 11 / 29
23 N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] For λ 0: E y p(y x n,w)φ(x n, y) = φ(x n, y n ) w L(w) = 0, criticial point of L (local minimum/maximum/saddle point). Interpretation: We want the model distribution to match the empirical one: E y p(y x,w) φ(x, y)! = φ(x, y obs ) E.g. Image Segmentation φ unary : correct amount of foreground vs. background φ pairwise : correct amount of fg/bg transitions smoothness 12 / 29
24 L(w) = λid D D + N E y p(y x n,w) { φ(x n, y)φ(x n, y) } positive definite Hessian matrix L(w) is convex w L(w) = 0 implies global minimum slice through objective value (w x [ 3, 5], w y = 0) 13 / 29
25 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. 14 / 29
26 Milestone I: Probabilistic Training (Conditional Random Fields) p(y x, w) log-linear in w R D. Training: minimize negative conditional log-likelihood, L(w) L(w) is differentiable and convex, gradient descent will find global optimum with w L(w) = 0 Same structure as multi-class logistic regression. For logistic regression: this is where the textbook ends. We re done. For conditional random fields: we re not in safe waters, yet! 14 / 29
27 Solving the Training Optimization Problem Numerically Task: Compute v = w L(w cur ), evaluate L(w cur + ηv): L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y w L(w) = λ N 2 w + [ φ(x n, y n ) p(y x n, w)φ(x n, y) ] y Y Problem: Y typically is very (exponentially) large: binary image segmentation: Y = ranking N images: Y = N!, e.g. N = 1000: Y We must use the structure in Y, or we re lost. 15 / 29
28 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 16 / 29
29 Solving the Training Optimization Problem Numerically N [ w L(w) = λw + φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient (naive): O(K M ND) L(w) = λ N [ 2 w 2 + w, φ(x n, y n ) + log Z(x n, w) ] Line Search (naive): O(K M ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes 100s to 1,000,000s K: number of possible labels of each output nodes 2 to 1000s 16 / 29
30 Solving the Training Optimization Problem Numerically In a graphical model with factors F, the features decompose: φ(x, y) = ( ) φ F (x, y F ) F F ( ) E y p(y x,w) φ(x, y) = E y p(y x,w) φ F (x, y F ) F F = ( E yf p(y F x,w)φ F (x, y F ) ) F F E yf p(y F x,w)φ F (x, y F ) = Factor marginals µ F = p(y F x, w) p(y F x, w) φ }{{} F (x, y F ) y F Y F }{{} factor marginals K F terms are much smaller than complete joint distribution p(y x, w), can be computed/approximated, e.g., with (loopy) belief propagation. 17 / 29
31 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29
32 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK Fmax ND): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK Fmax ND) per evaluation of L N: number of samples 10s to 1,000,000s D: dimension of feature space M: number of output nodes K: number of possible labels of each output nodes 18 / 29
33 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29
34 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29
35 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29
36 Solving the Training Optimization Problem Numerically What, if the training set D is too large (e.g. millions of examples)? Stochastic Gradient Descent (SGD) Minimize L(w), but without ever computing L(w) or L(w) exactly In each gradient descent step: Pick random subset D D, often just 1 3 elements! Follow approximate gradient L(w) = λw + D [ φ(x n, y n ) E y p(y xn,w)φ(x n, y) ] D (x n,y n ) D Avoid line search by using fixed stepsize rule η (new parameter) SGD converges to argmin w L(w)! (if η chosen right) SGD needs more iterations, but each one is much faster more: see L. Bottou, O. Bousquet: The Tradeoffs of Large Scale Learning, NIPS also: 19 / 29
37 Solving the Training Optimization Problem Numerically w L(w) = λw + N [ φ(x n, y n ) E y p(y x n,w)φ(x n, y) ] Computing the Gradient: O(K M nd), O(MK 2 ND) (if BP is possible): L(w) = λ 2 w 2 + N [ w, φ(x n, y n ) + log e w,φ(xn,y) ] y Y Line Search: O(K M nd), O(MK 2 ND) per evaluation of L N: number of samples D: dimension of feature space: φ i,j 1 10s, φ i : 100s to 10000s M: number of output nodes K: number of possible labels of each output nodes 20 / 29
38 Solving the Training Optimization Problem Numerically Typical feature functions in image segmentation: φ i (y i, x) R 1000 : local image features, e.g. bag-of-words w i, φ i (y i, x) : local classifier (like logistic-regression) φ i,j (y i, y j ) = y i = y j R 1 : test for same label w ij, φ ij (y i, y j ) : penalizer for label changes (if w ij > 0) combined: argmax y p(y x) is smoothed version of local cues original local confidence local + smoothness 21 / 29
39 Solving the Training Optimization Problem Numerically Typical feature functions in pose estimation: φ i (y i, x) R 1000 : local image representation, e.g. HoG w i, φ i (y i, x) : local confidence map φ i,j (y i, y j ) = good fit(y i, y j ) R 1 : test for geometric fit w ij, φ ij (y i, y j ) : penalizer for unrealistic poses together: argmax y p(y x) is sanitized version of local cues original local confidence local + geometry [V. Ferrari, M. Marin-Jimenez, A. Zisserman: Progressive Search Space Reduction for Human Pose Estimation, CVPR 2008.] 22 / 29
40 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij 23 / 29
41 Solving the Training Optimization Problem Numerically Idea: split learning of unary potentials into two parts: local classifiers, their importance. Two-Stage Training pre-train f y i (x) ˆ= log p(y i x) use φ i (y i, x) := f y i (x) RK (low-dimensional) keep φ ij (y i, y j ) are before perform CRF learning with φ i and φ ij Advantage: lower dimensional feature space during inference faster f y i Disadvantage: (x) can be any classifiers, e.g. non-linear SVMs, deep network,... if local classifiers are bad, CRF training cannot fix that. 23 / 29
42 Solving the Training Optimization Problem Numerically CRF training methods is based on gradient-descent optimization. The faster we can do it, the better (more realistic) models we can use: w L(w) = λw N [ φ(x n, y n ) y Y p(y x n, w) φ(x n, y) ] R D A lot of research on accelerating CRF training: problem solution method(s) Y too large exploit structure (loopy) belief propagation smart sampling contrastive divergence use approximate L e.g. pseudo-likelihood N too large mini-batches stochastic gradient descent D too large trained φ unary two-stage training 24 / 29
43 CRFs with Latent Variables So far, training was fully supervised, all variables were observed. In real life, some variables can be unobserved even during training. missing labels in training data latent variables, e.g. part location latent variables, e.g. part occlusion latent variables, e.g. viewpoint 25 / 29
44 CRFs with Latent Variables Three types of variables in graphical model: x X always observed (input), y Y observed only in training (output), z Z never observed (latent). Example: x : image y : part positions z {0, 1} : flag front-view or side-view images: [Felzenszwalb et al., Object Detection with Discriminatively Trained Part Based Models, T-PAMI, 2010] 26 / 29
45 CRFs with Latent Variables Marginalization over Latent Variables Construct conditional likelihood as usual: p(y, z x, w) = 1 exp( w, φ(x, y, z) ) Z(x, w) Derive p(y x, w) by marginalizing over z: p(y x, w) = z Z p(y, z x, w) = 1 Z(x, w) exp( w, φ(x, y, z) ) z Z 27 / 29
46 Negative regularized conditional log-likelihood: L(w) = λ 2 w 2 + = λ 2 w 2 + = λ 2 w 2 + N log p(y n x n, w) N log p(y n, z x n, w) z Z N log exp( w, φ(x n, y n, z) ) z Z N log z Z y Y exp( w, φ(x n, y, z) ) L is not convex in w local minima possible How to train CRFs with latent variables is active research. 28 / 29
47 Summary CRF Learning Given: training set {(x 1, y 1 ),..., (x N, y N )} X Y feature functions φ : X Y R D that decomposes over factors, φ F : X Y F R d for F F Overall model is log-linear (in parameter w) p(y x; w) e w,φ(x,y) CRF training requires minimizing negative conditional log-likelihood: w = argmin w λ 2 w 2 + N [ w, φ(x n, y n ) log e w,φ(xn,y) ] y Y convex optimization problem (stochastic) gradient descent works training needs repeated runs of probabilistic inference latent variables are possible, but make training non-convex 29 / 29
48 Part 2: Structured Support Vector Machines
49 Supervised Learning Problem Training examples (x 1, y 1 ),..., (x N, y N ) X Y Loss function : Y Y R. How to make predictions g : X Y? Approach 2) Loss-minimizing Parameter Estimation 1) Use training data to learn an energy function E(x, y) 2) Use f(x) := argmin y Y E(x, y) to make predictions. Slight variation (for historic reasons): 1) Learn a compatibility function g(x, y) (think: g = E ) 2) Use f(x) := argmax y Y g(x, y) to make predictions. 2 / 1
50 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). 3 / 1
51 Loss-Minimizing Parameter Learning D = {(x 1, y 1 ),..., (x N, y N )} i.i.d. training set φ : X Y R D be a feature function. : Y Y R be a loss function. Find a weight vector w that minimizes the expected loss E (x,y) (y, f(x)) for f(x) = argmax y Y w, φ(x, y). Advantage: We directly optimize for the quantity of interest: expected loss. No expensive-to-compute partition function Z will show up. Disadvantage: We need to know the loss function already at training time. We can t use probabilistic reasoning to find w. 3 / 1
52 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Two major problems: data distribution is unknown we can t compute E f : X Y has output in a discrete space f is piecewise constant w.r.t. w ( y, f(x)) is discontinuous, piecewise constant w.r.t w we can t apply gradient-based optimization 4 / 1
53 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D E (x,y) (y, f(x)) Problem 1: data distribution is unknown Solution: Replace E (x,y) d(x,y) ( ) with empirical estimate 1 N To avoid overfitting: add a regularizer, e.g. λ 2 w 2. ( ) (x n,y n ) New task: min w R D λ 2 w N N ( y n, f(x n ) ). 5 / 1
54 Reminder: Regularized Risk Minimization Task: for f(x) = argmax y Y w, φ(x, y) min w R D λ 2 w N N ( y n, f(x n ) ). Problem: ( y n, f(x n ) ) = ( y, argmax y w, φ(x, y) ) discontinuous w.r.t. w. Solution: Replace (y, y ) with well behaved l(x, y, w) Typically: l upper bound to, continuous and convex w.r.t. w. New task: min w R D λ 2 w N N l(x n, y n, w)) 6 / 1
55 Reminder: Regularized Risk Minimization min w R D λ 2 w N N l(x n, y n, w)) Regularization + Loss on training data 7 / 1
56 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y 7 / 1
57 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y l is maximum over linear functions continuous, convex. l is an upper bound to : small l small 7 / 1
58 Reminder: Regularized Risk Minimization min w R D λ 2 w N Hinge loss: maximum margin training N l(x n, y n, w)) Regularization + Loss on training data l(x n, y n [, w) := max (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ] y Y Alternative: Logistic loss: probabilistic training l(x n, y n, w) := log y Y exp ( w, φ(x n, y) w, φ(x n, y n ) ) Differentiable, convex, not an upper bound to (y, y ). 7 / 1
59 Structured Output Support Vector Machine min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) Conditional Random Field min w λ 2 w 2 + N log exp ( w, φ(x n, y) w, φ(x n, y n ) ) y Y }{{} = w,φ(x n,y n ) +exp( w,φ(x n,y) ) = cond.log.likelihood CRFs and SSVMs have more in common than usually assumed. log y exp( ) can be interpreted as a soft-max but: CRF doesn t take loss function into account at training time 8 / 1
60 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1
61 Example: Multiclass SVM { Y = {1, 2,..., K}, (y, y 1 for y y ) = 0 otherwise. ( ) φ(x, y) = y = 1 φ(x), y = 2 φ(x),..., y = K φ(x) Solve: min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } { = 0 for y = y n 1 + w, φ(x n, y) w, φ(x n, y n ) for y y n Classification: f(x) = argmax y Y w, φ(x, y). Crammer-Singer Multiclass SVM [K. Crammer, Y. Singer: On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines, JMLR, 2001] 9 / 1
62 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1
63 Example: Hierarchical Multiclass SVM Hierarchical Multiclass Loss: (y, y ) := 1 (distance in tree) 2 (cat, cat) = 0, (cat, dog) = 1, (cat, bus) = 2, etc. min w λ 2 w N N max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) } {{ } e.g. if y n = cat, w, φ(x n, cat) w, φ(x n, dog)! 1 w, φ(x n, cat) w, φ(x n, car)! 2 w, φ(x n, cat) w, φ(x n, bus)! 2. labels that cause more loss are pushed further away lower chance of high-loss mistake at test time [L. Cai, T. Hofmann: Hierarchical Document Categorization with Support Vector Machines, ACM CIKM, 2004] [A. Binder, K.-R. Müller, M. Kawanabe: On taxonomies for multi-class image categorization, IJCV, 2011] 10 / 1
64 Solving S-SVM Training Numerically We can solve SSVM training like CRF training: min w λ 2 w N N [ ] max y Y (yn, y) + w, φ(x n, y) w, φ(x n, y n ) continuous unconstrained convex non-differentiable we can t use gradient descent directly. we ll have to use subgradients 11 / 1
65 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1
66 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1
67 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 12 / 1
68 Solving S-SVM Training Numerically Subgradient Method Definition Let f : R D R be a convex, not necessarily differentiable, function. A vector v R D is called a subgradient of f at w 0, if f(w) f(w 0 ) + v, w w 0 for all w. f(w) v f(w0) f(w0)+ v,w-w 0 w w0 For differentiable f, the gradient v = f(w 0 ) is the only subgradient. f(w) v f(w0)+ v,w-w 0 f(w0) w0 w 12 / 1
69 Solving S-SVM Training Numerically Subgradient Method Subgradient method works basically like gradient descent: Subgradient Method Minimization minimize F (w) require: tolerance ɛ > 0, stepsizes η t w cur 0 repeat v sub w F (w cur ) wcur w cur η t v until F changed less than ɛ return w cur Converges to global minimum, but rather inefficient if F non-differentiable. [Shor, Minimization methods for non-differentiable functions, Springer, 1985.] 13 / 1
70 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 14 / 1
71 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y w For each y Y, l n y (w) is a linear function of w. 14 / 1
72 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) y' w For each y Y, l n y (w) is a linear function of w. 14 / 1
73 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w For each y Y, l n y (w) is a linear function of w. 14 / 1
74 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w max over finite Y: piece-wise linear 14 / 1
75 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : 14 / 1
76 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) w w0 Subgradient of l n at w 0 : find maximal (active) y. 14 / 1
77 Solving S-SVM Training Numerically Subgradient Method Computing a subgradient: min w λ 2 w N N l n (w) with l n (w) = max y l n y (w), and l n y (w) := (y n, y) + w, φ(x n, y) w, φ(x n, y n ) l(w) v Subgradient of l n at w 0 : find maximal (active) y, use v = l n y (w 0 ). w0 w 14 / 1
78 Solving S-SVM Training Numerically Subgradient Method Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: for i=1,...,n do 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: v n φ(x n, ŷ) φ(x n, y n ) 6: end for 7: w w η t (λw 1 N n vn ) 8: end for output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-prediction (one per example). 15 / 1
79 Solving S-SVM Training Numerically Subgradient Method Same trick as for CRFs: stochastic updates: Stochastic Subgradient Method S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ, input number of iterations T, stepsizes η t for t = 1,..., T 1: w 0 2: for t=1,...,t do 3: (x n, y n ) randomly chosen training example pair 4: ŷ argmax y Y (y n, y) + w, φ(x n, y) w, φ(x n, y n ) 5: w w η t (λw 1 N [φ(xn, ŷ) φ(x n, y n )]) 6: end for output prediction function f(x) = argmax y Y w, φ(x, y). Observation: each update of w needs only 1 argmax-prediction (but we ll need many iterations until convergence) 16 / 1
80 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
81 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
82 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
83 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
84 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
85 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
86 Example: Image Segmenatation X images, Y = { binary segmentation ( masks }. Training example(s): (x n, y n ) =, ) (y, ȳ) = p y p ȳ p (Hamming loss) t = 1: ŷ = t = 2: ŷ = t = 3: ŷ = φ(y n ) φ(ŷ): black +, white +, green, blue, gray φ(y n ) φ(ŷ): black +, white +, green =, blue =, gray φ(y n ) φ(ŷ): black =, white =, green, blue, gray t = 4: ŷ = φ(y n ) φ(ŷ): black =, white =, green, blue =, gray = t = 5: ŷ = φ(y n ) φ(ŷ): black =, white =, green =, blue =, gray = t = 6,... : no more changes. Images: [Carreira, Li, Sminchisescu, Object Recognition by Sequential Figure-Ground Ranking, IJCV 2010] 17 / 1
87 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? 18 / 1
88 Solving S-SVM Training Numerically Structured Support Vector Machine: min w λ 2 w N N max [ (y n, y) + w, φ(x n, y) w, φ(x n, y n ) )] y Y Subgradient method converges slowly. Can we do better? Remember from SVM: We can use inequalities and slack variables to encode the loss. 18 / 1
89 Solving S-SVM Training Numerically Structured SVM (equivalent formulation): Idea: slack variables min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, max y Y [ ] (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n Note: ξ n 0 automatic, because left hand side is non-negative. Differentiable objective, convex, N non-linear contraints, 19 / 1
90 Solving S-SVM Training Numerically Structured SVM (also equivalent formulation): Idea: expand max-constraint into individual cases min w,ξ λ 2 w N N ξ n subject to, for n = 1,..., N, (y n, y) + w, φ(x n, y) w, φ(x n, y n ) ξ n, for all y Y Differentiable objective, convex, N Y linear constraints 20 / 1
91 Solving S-SVM Training Numerically Solve an S-SVM like a linear SVM: min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Introduce feature vectors δφ(x n, y n, y) := φ(x n, y n ) φ(x n, y). 21 / 1
92 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. 22 / 1
93 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? 22 / 1
94 Solving S-SVM Training Numerically Solve λ min w R D,ξ R n 2 w N + subject to, for i = 1,... n, for all y Y, N ξ n Same structure as an ordinary SVM! quadratic objective linear constraints w, δφ(x n, y n, y) (y n, y) ξ n. Question: Can we use an ordinary SVM/QP solver? Answer: Almost! We could, if there weren t N Y constraints. E.g. 100 binary images: constraints 22 / 1
95 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: 23 / 1
96 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. 23 / 1
97 Solving S-SVM Training Numerically Working Set Solution: working set training It s enough if we enforce the active constraints. The others will be fulfilled automatically. We don t know which ones are active for the optimal solution. But it s likely to be only a small number can of course be formalized. Keep a set of potentially active constraints and update it iteratively: Solving S-SVM Training Numerically Working Set Start with working set S = (no contraints) Repeat until convergence: Solve S-SVM training problem with constraints from S Check, if solution violates any of the full constraint set if no: we found the optimal solution, terminate. if yes: add most violated constraints to S, iterate. Good practical performance and theoretic guarantees: polynomial time convergence ɛ-close to the global optimum 23 / 1
98 Working Set S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: w 0, S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ argmax y Y (y n, y) + w, φ(x n, y) 6: if ŷ y n then 7: S S {(x n, ŷ)} 8: end if 9: end for 10: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Obs: each update of w needs N argmax-predictions (one per example), but we solve globally for next w, not by local steps. 24 / 1
99 Example: Object Localization X images, Y = { object bounding box } R 4. Training examples: Goal: f : X Y Loss function: area overlap (y, y ) = 1 area(y y ) area(y y ) [Blaschko, Lampert: Learning to Localize Objects with Structured Output Regression, ECCV 2008] 25 / 1
100 Example: Object Localization Structured SVM: φ(x, y) := bag-of-words histogram of region y in image x min w R D,ξ R n λ 2 w N N ξ n subject to, for i = 1,... n, w, φ(x n, y n ) w, φ(x n, y) (y n, y) ξ n, for all y Y. Interpretation: For every image, the correct bounding box, y n, should have a higher score than any wrong bounding box. Less overlap between the boxes bigger difference in score 26 / 1
101 Example: Object Localization Working set training Step 1: w 0. For every example: ŷ argmax y Y (y n, y) + w, φ(x n, y) }{{} =0 maximal -loss minimal overlap with y n ŷ y n = add constraint w, φ(x n, y n ) w, φ(x n, ŷ) 1 ξ n Note: similar to binary SVM training for object detection: positive examples: ground truth bounding boxes negative examples: random boxes from image background 27 / 1
102 Example: Object Localization Working set training Later Steps: For every example: ŷ argmax y Y (y n, y) }{{} + w, φ(x n, y) }{{} bias towards wrong regions object detection score if ŷ = y n : do nothing, else: add constraint w, φ(x n, y n ) w, φ(x n, ŷ) (y n, ŷ) ξ n enforces ŷ to have lower score after re-training. Note: similar to hard negative mining for object detection: perform detection on training image if detected region is far from ground truth, add as negative example Difference: S-SVM handles regions that overlap with ground truth. 28 / 1
103 Kernelized S-SVM We can also kernelize S-SVM optimization: max α R N Y +,...,N y Y subject to, for n = 1,..., N, y Y α ny (y n, y) 1 2 α ny 2 λn. α ny α nȳ K n nyȳ y,ȳ Y n,,...,n N Y many variables: train with working set of α iy. Kernelized prediction function: f(x) = argmax y Y n,y α ny k( (x n, y ), (x, y) ) Not very popular in Computer Vision (quickly becomes inefficient) 29 / 1
104 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) 30 / 1
105 SSVMs with Latent Variables Latent variables also possible in S-SVMs x X always observed, y Y observed only in training, z Z never observed (latent). Decision function: f(x) = argmax y Y max z Z w, φ(x, y, z) Maximum Margin Training with Maximization over Latent Variables Solve: with min w,ξ λ 2 w N N max y Y ln w(y) l n w(y) = (y n, y) + max z Z w, φ(xn, y, z) max z Z Problem: not convex can have local minima w, φ(xn, y n, z) [Yu, Joachims, Learning Structural SVMs with Latent Variables, 2009] similar: [Felzenszwalb et al., A Discriminatively Trained, Multiscale, Deformable Part Model, 2008], but Y = {±1} 30 / 1
106 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) 31 / 1
107 Summary S-SVM Learning Given: training set {(x 1, y 1 ),..., (x n, y n )} X Y loss function : Y Y R. parameterize f(x) := argmax y w, φ(x, y) Task: find w that minimizes expected loss on future data, E (x,y) (y, f(x)) S-SVM solution derived from regularized risk minimization: enforce correct output to be better than all others by a margin : w, φ(x n, y n ) (y n, y) + w, φ(x n, y) for all y Y. convex optimization problem, but non-differentiable many equivalent formulations different training algorithms training needs many argmax predictions, but no probabilistic inference Latent variable possible, but optimization becomes non-convex. 31 / 1
108 Summary S-SVM Learning Structured Learning is full of Open Research Questions How to train faster? CRFs need many runs of probablistic inference, SSVMs need many runs of argmax-predictions. How to reduce the necessary amount of training data? semi-supervised learning? transfer learning? How can we better understand different loss function? how important is it to optimize the right loss? Can we understand structured learning with approximate inference? often computing L(w) or argmaxy w, φ(x, y) exactly is infeasible. can we guarantee good results even with approximate inference? More and new applications! 32 / 1
109 Ad: Positions at IST Austria, Vienna IST Austria Graduate School enter with MSc or BSc 1(2) + 3 yr PhD program Computer Vision/Machine Learning (me, Vladimir Kolmogorov) Computer Graphics (C. Wojtan) Comp. Topology (H. Edelsbrunner) Game Theory (K. Chatterjee) Software Verification (T. Henzinger) Cryptography (K. Pietrzak) Comp. Neuroscience (G. Tkacik) Random Matrix Theory (L. Erdös) Statistics (C. Uhler), and more... fully funded positions Postdoc Positions in my Group see chl More info: Internships: send me an ! 33 / 1
110 Additional Material 34 / 1
111 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, 35 / 1
112 Solving S-SVM Training Numerically One-Slack One-Slack Formulation of S-SVM: (equivalent to ordinary S-SVM formulation by ξ = 1 N n ξn ) min w R D,ξ R + λ 2 w 2 + ξ subject to, for all (ŷ 1,..., ŷ N ) Y Y, N [ (y n, ŷ N ) + w, φ(x n, ŷ n ) w, φ(x n, y n ) ] Nξ, Y N linear constraints, convex, differentiable objective. We blew up the constraint set even further: 100 binary images: constraints (instead of ). 35 / 1
113 Solving S-SVM Training Numerically One-Slack Working Set One-Slack S-SVM Training input training pairs {(x 1, y 1 ),..., (x n, y n )} X Y, input feature map φ(x, y), loss function (y, y ), regularizer λ 1: S 2: repeat 3: (w, ξ) solution to QP only with constraints from S 4: for i=1,...,n do 5: ŷ n argmax y Y (y n, y) + w, φ(x n, y) 6: end for 7: S S { ( (x 1,..., x n ), (ŷ 1,..., ŷ n ) ) } 8: until S doesn t change anymore. output prediction function f(x) = argmax y Y w, φ(x, y). Often faster convergence: We add one strong constraint per iteration instead of n weak ones. 36 / 1
Probabilistic Graphical Models & Applications
Probabilistic Graphical Models & Applications Learning of Graphical Models Bjoern Andres and Bernt Schiele Max Planck Institute for Informatics The slides of today s lecture are authored by and shown with
More informationPart 4: Conditional Random Fields
Part 4: Conditional Random Fields Sebastian Nowozin and Christoph H. Lampert Colorado Springs, 25th June 2011 1 / 39 Problem (Probabilistic Learning) Let d(y x) be the (unknown) true conditional distribution.
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph Lampert Institute of Science and Technology Austria Vienna (Klosterneuburg), Austria INRIA Summer School 2010 Learning with Structured Inputs and Outputs
More informationIntroduction To Graphical Models
Peter Gehler Introduction to Graphical Models Introduction To Graphical Models Peter V. Gehler Max Planck Institute for Intelligent Systems, Tübingen, Germany ENS/INRIA Summer School, Paris, July 2013
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationML4NLP Multiclass Classification
ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationSupport Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More information1 Machine Learning Concepts (16 points)
CSCI 567 Fall 2018 Midterm Exam DO NOT OPEN EXAM UNTIL INSTRUCTED TO DO SO PLEASE TURN OFF ALL CELL PHONES Problem 1 2 3 4 5 6 Total Max 16 10 16 42 24 12 120 Points Please read the following instructions
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationLearning with Structured Inputs and Outputs
Learning with Structured Inputs and Outputs Christoph H. Lampert IST Austria (Institute of Science and Technology Austria), Vienna Microsoft Machine Learning and Intelligence School 2015 July 29-August
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLogistic Regression. Machine Learning Fall 2018
Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationLecture 3: Multiclass Classification
Lecture 3: Multiclass Classification Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Some slides are adapted from Vivek Skirmar and Dan Roth CS6501 Lecture 3 1 Announcement v Please enroll in
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationParameter learning in CRF s
Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationCPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017
CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationSupport Vector Machine
Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More informationFrom Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018
From Binary to Multiclass Classification CS 6961: Structured Prediction Spring 2018 1 So far: Binary Classification We have seen linear models Learning algorithms Perceptron SVM Logistic Regression Prediction
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationSupport Vector Machines
Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationLINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning
LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive
More informationSupport Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2
More informationCSC 411 Lecture 17: Support Vector Machine
CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationIntelligent Systems:
Intelligent Systems: Undirected Graphical models (Factor Graphs) (2 lectures) Carsten Rother 15/01/2015 Intelligent Systems: Probabilistic Inference in DGM and UGM Roadmap for next two lectures Definition
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationIntroduction to Machine Learning
Introduction to Machine Learning Logistic Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 5: Multi-class and preference learning Juho Rousu 11. October, 2017 Juho Rousu 11. October, 2017 1 / 37 Agenda from now on: This week s theme: going
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationCS395T: Structured Models for NLP Lecture 2: Machine Learning Review
CS395T: Structured Models for NLP Lecture 2: Machine Learning Review Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah Administrivia Course enrollment Lecture slides posted on website
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More informationSupport Vector Machine (SVM) and Kernel Methods
Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More information(Kernels +) Support Vector Machines
(Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning
More informationTowards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert
Towards Lifelong Machine Learning Multi-Task and Lifelong Learning with Unlabeled Tasks Christoph Lampert HSE Computer Science Colloquium September 6, 2016 IST Austria (Institute of Science and Technology
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationLinear classifiers: Logistic regression
Linear classifiers: Logistic regression STAT/CSE 416: Machine Learning Emily Fox University of Washington April 19, 2018 How confident is your prediction? The sushi & everything else were awesome! The
More informationCS Machine Learning Qualifying Exam
CS Machine Learning Qualifying Exam Georgia Institute of Technology March 30, 2017 The exam is divided into four areas: Core, Statistical Methods and Models, Learning Theory, and Decision Processes. There
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationClustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.
Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)
More informationSupport Vector Machines and Kernel Methods
Support Vector Machines and Kernel Methods Geoff Gordon ggordon@cs.cmu.edu July 10, 2003 Overview Why do people care about SVMs? Classification problems SVMs often produce good results over a wide range
More information