Machine Learning Lecture 6

Size: px

Start display at page:

Download "Machine Learning Lecture 6"

Paulina Thomas
6 years ago
Views:

Talk Announcement Machine Learning Lecture 6 Nonlinear SVMs 06.05.009 Bastian Leibe RWTH Aachen http://www.umic.rwth-aachen.de/multimedia leibe@umic.rwth-aachen.de Many slides adapted from B.

, 14:00h, 6317 3D Gesichtsanimation mit einem Morphable Model Das 3D Morphable Model einer Objektklasse ist ein Vektorraum von Formen und Texturen, der von einer Menge von Beispieldaten

Dabei ist die Punkt-zu-Punkt Korrespondenz der Beispieldaten von entscheidender Bedeutung, also die Tatsache, dass ein Vertex des Modells in allen Beispielen den gleichen Punkt beschreibt.

Durch automatische Anpassung des Modells an Gesichter in Bildern können ausserdem Fotos und Gemälde animiert werden.

Announcements Course Outline Exercise available on LP Risk VC dimension Linear classifiers Fisher s Linear Discriminant SVMs Submit your results until next Tuesday evening.

Machines Boosting, Decision Trees Generative Models (5 weeks) Bayesian Networks Markov Random Fields Regression Problems ( weeks) Gaussian Processes 4 Recap: Generalization and Overfitting

1 Talk Announcement Machine Learning Lecture 6 Nonlinear SVMs Bastian Leibe RWTH Aachen leibe@umic.rwth-aachen.de Many slides adapted from B. Schiele Volker Blanz (University of Siegen) , 14:00h, D Gesichtsanimation mit einem Morphable Model Das 3D Morphable Model einer Objektklasse ist ein Vektorraum von Formen und Texturen, der von einer Menge von Beispieldaten aufgespannt wird. Dabei ist die Punkt-zu-Punkt Korrespondenz der Beispieldaten von entscheidender Bedeutung, also die Tatsache, dass ein Vertex des Modells in allen Beispielen den gleichen Punkt beschreibt. Angewandt auf 3D Scans von Gesichtsausdrücken kann damit eine automatische, lernbasierte Gesichtsanimation realisiert werden. Durch automatische Anpassung des Modells an Gesichter in Bildern können ausserdem Fotos und Gemälde animiert werden. Der Vortrag präsentiert die Grundlagen von Morphable Models, die Animation von Mund- und Augenbewegungen sowie die Simulation des Wachstums von Kindergesichtern. Announcements Course Outline Exercise available on LP Risk VC dimension Linear classifiers Fisher s Linear Discriminant SVMs Submit your results until next Tuesday evening. 3 Fundamentals ( weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory Support Vector Machines Boosting, Decision Trees Generative Models (5 weeks) Bayesian Networks Markov Random Fields Regression Problems ( weeks) Gaussian Processes 4 Recap: Generalization and Overfitting Recap: Risk Goal: predict class labels of new observations Train classification model on limited training set. The further we optimize the model parameters, the more the training error will decrease. However, at some point the test error will go up again. Overfitting to the training set! test error training error 5 Image source: B. Schiele Empirical risk Measured on the training/validation set R emp (α)= 1 N L(y i, f(x i ; α)) i=1 Actual risk (= Expected risk) Expectation of the error on all data. R(α)= L(y i, f(x; α))dp X,Y (x, y) P X,Y (x, y) is the probability distribution of (x,y). It is fixed, but typically unknown. In general, we can t compute the actual risk directly! Slide adapted from Bernt Schiele 6 1

2 Recap: Statistical Learning Theory Recap: VC Dimension Idea Compute an upper bound on the actual risk based on the empirical risk where R(α) R emp (α)+ǫ(n, p, h) N: number of training examples p * : probability that the bound is correct h: capacity of the learning machine ( VC-dimension ) Slide adapted from Bernt Schiele 7 Vapnik-Chervonenkis dimension Measure for the capacity of a learning machine. Formal definition: If a given set of l points can be labeled in all possible l ways, and for each labeling, a member of the set {f(α)} can be found which correctly assigns those labels, we say that the set of points is shattered by the set of functions. The VC dimension for the set of functions {f(α)} is defined as the maximum number of training points that can be shattered by {f(α)}. 8 Recap: Upper Bound on the Risk Recap: Structural Risk Minimization Important result (Vapnik 1979, 1995) With probability (1-η), the following bound holds R(α) R emp (α)+ This bound is independent of P X,Y (x, y)! If we know h (the VC dimension), we can easily compute the risk bound R(α) R emp (α)+ǫ(n, p, h) Slide adapted from Bernt Schiele h(log(n/h)+1) log(η/4) N VC confidence 9 How can we implement Structural Risk Minimization? R(α) R emp (α)+ǫ(n, p, h) Classic approach Keep ǫ(n, p, h) constant and minimize R emp (α). ǫ(n, p, h) can be kept constant by controlling the model parameters. Support Vector Machines (SVMs) Keep R emp (α) constant and minimize ǫ(n, p, h). In fact: R emp (α)=0 for separable data. Control ǫ(n, p, h) by adapting the VC dimension (controlling the capacity of the classifier). Slide credit: Bernt Schiele 10 Topics of This Lecture Recap: Support Vector Machine (SVM) Linear Support Vector Machines (Recap) Lagrangian (primal) formulation Dual formulation Discussion Linearly non-separable case Soft-margin classification Updated formulation Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Applications 11 Basic idea The SVM tries to find a classifier which maximizes the margin between pos. and neg. data points. Up to now: consider linear classifiers w T x+b=0 Formulation as a convex optimization problem Find the hyperplane satisfying 1 arg min w,b w under the constraints t n (w T x n + b) 1 n based on training data points x n and target values t n { 1,1}. Margin 1

3 Recap: SVM Primal Formulation Recap: SVM Solution Lagrangian primal form L p = 1 { w a n tn (w T x n + b) 1 } = 1 w a n {t n y(x n ) 1} The solution of L p needs to fulfill the KKT conditions Necessary and sufficient conditions KKT: a n 0 λ 0 t n y(x n ) 1 0 f(x) 0 a n {t n y(x n ) 1} = 0 λf(x) = 0 13 Solution for the hyperplane Computed as a linear combination of the training examples w= a n t n x n Sparse solution: a n 0 only for some points, the support vectors Only the SVs actually influence the decision boundary! Compute b by averaging over all support vectors: b= 1 N S n S ( t n m S a m t m x T m x n ) 14 Recap: SVM Support Vectors Recap: SVM Dual Formulation The training points for which a n > 0 are called support vectors. Graphical interpretation: The support vectors are the points on the margin. They define the margin and thus the hyperplane. All other data points can be discarded! 15 Slide adapted from Bernt Schiele Image source: C. Burges, 1998 Maximize L d (a)= a n 1 a n a m t n t m (x T m x n) Comparison L d is equivalent to the primal form L p, but only depends on a n. L p scales with O(D 3 ). m=1 under the conditions a n 0 n a n t n = 0 L d scales with O(N 3 ) in practice between O(N) and O(N ). Slide adapted from Bernt Schiele 16 Topics of This Lecture So Far Linear Support Vector Machines (Recap) Lagrangian (primal) formulation Dual formulation Discussion Linearly non-separable case Soft-margin classification Updated formulation Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Applications 17 Only looked at linearly separable case Current problem formulation has no solution if the data are not linearly separable! Need to introduce some tolerance to outlier data points. Margin? 18 3

4 SVM Non-Separable Data SVM Soft-Margin Classification Non-separable data I.e. the following inequalities cannot be satisfied for all data points Instead use w T x n + b +1 for t n =+1 w T x n + b 1 for t n = 1 w T x n + b +1 ξ n for t n =+1 w T x n + b 1+ξ n for t n = 1 with slack variables ξ n 0 n 19 Slack variables One slack variable ξ n 0 for each training data point. Interpretation ξ n = 0 for points that are on the correct side of the margin. ξ n = t n y(x n ) for all other points. We do not have to set the slack variables ourselves! They are jointly optimized together with w. ξ 4 ξ 3 ξ ξ 1 w Point on decision boundary: ξ n = 1 Misclassified point: ξ n > 1 How that? 0 SVM Non-Separable Data SVM New Primal Formulation Separable data Minimize 1 w Non-separable data1 Minimize w +C Tradeoff parameter! ξ n 1 New SVM Primal: Optimize L p = 1 w +C ξ n a n (t n y(x n ) 1+ξ n ) µ n ξ n KKT conditions a n 0 t n y(x n ) 1+ξ n 0 a n (t n y(x n ) 1+ξ n ) = 0 Constraint t n y(x n ) 1 ξ n µ n 0 ξ n 0 µ n ξ n = 0 Constraint ξ n 0 KKT: λ 0 f(x) 0 λf(x) = 0 SVM New Dual Formulation SVM New Solution New SVM Dual: Maximize L d (a)= a n 1 a n a m t n t m (x T m x n) m=1 under the conditions 0 a n C a n t n = 0 This is again a quadratic programming problem Solve as before (more on that later) Slide adapted from Bernt Schiele This is all that changed! 3 Solution for the hyperplane Computed as a linear combination of the training examples w= a n t n x n Again sparse solution: a n = 0 for points outside the margin. The slack points with ξ n > 0 are now also support vectors! Compute b by averaging over all N M points with 0 < a n < C: ( b= 1 t n ) a m t m x T m N x n M n M m M 4 4

5 Interpretation of Support Vectors Topics of This Lecture Those are the hard examples! We can visualize them, e.g. for face detection 5 Image source: E. Osuna, F. Girosi, 1997 Linear Support Vector Machines (Recap) Lagrangian (primal) formulation Dual formulation Discussion Linearly non-separable case Soft-margin classification Updated formulation Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Applications 6 So Far Nonlinear SVM Only looked at linearly separable case Current problem formulation has no solution if the data are not linearly separable! Need to introduce some tolerance to outlier data points. Slack variables. Only looked at linear decision boundaries This is not sufficient for many applications. Want to generalize the ideas to non-linear boundaries. ξ 4 ξ 3 ξ ξ 1 w 7 Image source: B. Schoelkopf, A. Smola, 00 Linear SVMs Datasets that are linearly separable with some noise work well: But what are we going to do if the dataset is just too hard? How about mapping data to a higher-dimensional space: Slide credit: Raymond Mooney 0 x 0 x 0 x x 8 Another Example Non-separable by a hyperplane in D Another Example Separable by a surface in 3D Slide credit: Bill Freeman 9 Slide credit: Bill Freeman 30 5

6 Nonlinear SVM Feature Spaces Nonlinear SVM General idea: The original input space can be mapped to some higher-dimensional feature space where the training set is separable: Slide credit: Raymond Mooney Φ: x φ(x) 31 General idea Nonlinear transformation φ of the data points x n : x R D φ:r D H Hyperplane in higher-dim. space H (linear classifier in H) w T φ(x)+b=0 Nonlinear classifier in R D. Slide credit: Bernt Schiele 3 What Could This Look Like? Problem with High-dim. Basis Functions Example: Mapping to polynomial space, x, y R : x 1 φ(x)= x1 x x Motivation: Easier to separate data in higher-dimensional space. But wait isn t there a big problem? How should we evaluate the decision function? 33 Image source: C. Burges, 1998 Problem In order to apply the SVM, we need to evaluate the function Using the hyperplane, which is itself defined as What happens if we try this for a million-dimensional feature space φ(x)? Oh-oh y(x)=w T φ(x)+b w= a n t n φ(x n ) 34 Solution: The Kernel Trick Back to Our Previous Example Important observation φ(x) only appears in the form of dot products φ(x) T φ(y): y(x) = w T φ(x)+b = a n t n φ(x n ) T φ(x)+b Define a so-called kernel function k(x,y) = φ(x) T φ(y). Now, in place of the dot product, use the kernel instead: y(x) = a n t n k(x n,x)+b The kernel function implicitly maps the data to the higherdimensional space (without having to compute φ(x) explicitly)! 35 nd degree polynomial kernel: x φ(x) T 1 y1 φ(y)= x1 x y1 y x y = x 1 y 1 +x 1x y 1 y + x y =(x T y) =: k(x,y) Whenever we evaluate the kernel function k(x,y) = (x T y), we implicitly compute the dot product in the higher-dimensional feature space. 36 Image source: C. Burges,

7 SVMs with Kernels Which Functions are Valid Kernels? Using kernels Applying the kernel trick is easy. Just replace every dot product by a kernel function x T y k(x,y) and we re done. Instead of the raw input space, we re now working in a higherdimensional (potentially infinite dimensional!) space, where the data is more easily separable. Wait does this always work? The kernel needs to define an implicit mapping to a higher-dimensional feature space φ(x). When is this the case? Sounds like magic 37 Mercer s theorem (modernized version): Every positive definite symmetric function is a kernel. Positive definite symmetric functions correspond to a positive definite symmetric Gram matrix: K = Slide credit: Raymond Mooney k(x 1,x 1 ) k(x 1,x ) k(x 1,x 3 ) k(x 1,x n ) k(x,x 1 ) k(x,x ) k(x,x 3 ) k(x,x n ) k(x n,x 1 ) k(x n,x ) k(x n,x 3 ) k(x n,x n ) 38 Kernels Fulfilling Mercer s Condition Nonlinear SVM Dual Formulation Polynomial kernel k(x,y)=(x T y+1) p Radial Basis Function kernel } k(x,y)=exp { (x y) σ Hyperbolic tangent kernel k(x,y)=tanh(κx T y+ δ) (and many, many more ) Slide credit: Bernt Schiele e.g. Gaussian e.g. Sigmoid 39 SVM Dual: Maximize L d (a)= a n 1 a n a m t n t m k(x m,x n ) m=1 under the conditions 0 a n C a n t n = 0 Classify new data points using y(x) = a n t n k(x n,x)+b 40 VC Dimension for Polynomial Kernel VC Dimension for Gaussian RBF Kernel Polynomial kernel of degree p: Dimensionality of H: Example: k(x,y)=(x T y) p ( ) D+ p 1 p D = 16 16=56 p = 4 dim(h) = The hyperplane in H then has VC-dimension dim(h)+1 41 Radial Basis Function: } k(x,y)=exp { (x y) σ In this case, H is infinite dimensional! exp(x)=1+ x 1! +x! +...+xn n! +... Since only the kernel function is used by the SVM, this is no problem. The hyperplane in H then has VC-dimension dim(h)+1= 4 7

VC Dimension for Gaussian RBF Kernel Example: RBF Kernels Intuitively If we make the radius of the RBF kernel sufficiently small, then each data point can be associated with its own kernel.

Smola, 00 But but but Theoretical Justification for Maximum Margins Don t we risk overfitting with those enormously highdimensional feature spaces?

8 VC Dimension for Gaussian RBF Kernel Example: RBF Kernels Intuitively If we make the radius of the RBF kernel sufficiently small, then each data point can be associated with its own kernel. However, this also means that we can get finite VC-dimension if we set a lower limit to the RBF radius. 43 Decision boundary on toy problem RBF Kernel width (σ) 44 Image source: B. Schoelkopf, A. Smola, 00 But but but Theoretical Justification for Maximum Margins Don t we risk overfitting with those enormously highdimensional feature spaces? No matter what the basis functions are, there are really only up to N parameters: a 1, a,, a N and most of them are usually set to zero by the maximum margin criterion. The data effectively lives in a low-dimensional subspace of H. What about the VC dimension? I thought low VC-dim was good (in the sense of the risk bound)? Yes, but the maximum margin classifier magically solves this. Reason (Vapnik): by maximizing the margin, we can reduce the VC-dimension. Empirically, SVMs have very good generalization performance. 45 Vapnik has proved the following: The class of optimal linear separators has VC dimension h bounded from above as D h min, 1 m0 + ρ where ρ is the margin, D is the diameter of the smallest sphere that can enclose all of the training examples, and m 0 is the dimensionality. Intuitively, this implies that regardless of dimensionality m 0 we can minimize the VC dimension by maximizing the margin ρ. Thus, complexity of the classifier is kept small regardless of dimensionality. Slide credit: Raymond Mooney 46 SVM Demo Summary: SVMs Applet from libsvm ( 47 Properties Empirically, SVMs work very, very well. SVMs are currently among the best performers for a number of classification tasks ranging from text to genomic data. SVMs can be applied to complex data types beyond feature vectors (e.g. graphs, sequences, relational data) by designing kernel functions for such data. SVM techniques have been applied to a variety of other tasks e.g. SV Regression, One-class SVMs, The kernel trick has been used for a wide variety of applications. It can be applied wherever dot products are in use e.g. Kernel PCA, kernel FLD, Good overview, software, and tutorials available on

9 Summary: SVMs Topics of This Lecture Limitations How to select the right kernel? Still something of a black art How to select the kernel parameters? (Massive) cross-validation. Usually, several parameters are optimized together in a grid search. Solving the quadratic programming problem Standard QP solvers do not perform too well on SVM task. Dedicated methods have been developed for this, e.g. SMO. Speed of evaluation Evaluating y(x) scales linearly in the number of SVs. Too expensive if we have a large number of support vectors. There are techniques to reduce the effective SV set. Training for very large datasets (millions of data points) Still problematic 49 Linear Support Vector Machines (Recap) Lagrangian (primal) formulation Dual formulation Discussion Linearly non-separable case Soft-margin classification Updated formulation Nonlinear Support Vector Machines Nonlinear basis functions The Kernel trick Mercer s condition Popular kernels Applications 50 Example Application: Text Classification Example Application: Text Classification Problem: Classify a document in a number of categories Results: Representation: Bag-of-words approach Histogram of word counts (on learned dictionary) Very high-dimensional feature space (~ dimensions) Few irrelevant features This was one of the first applications of SVMs T. Joachims (1997)? 51 5 Example Application: Text Classification Example Application: OCR This is also how you could implement a simple spam filter Incoming Dictionary Word activations SVM Mailbox Trash 53 Handwritten digit recognition US Postal Service Database Standard benchmark task for many learning algorithms 54 9

Example Application: OCR Results Almost no overfitting with higher-degree kernels. Example Application: Pedestrian Detection Sliding-window approach 55 E.g. histogram representation (HOG) Map each grid cell in the input window to a histogram of gradient orientations.

Classifier [Dalal & Triggs, CVPR 005] Example Application: Pedestrian Detection Many Other Applications Lots of other applications in all fields of technology OCR N. Dalal, B.

prediction Design on decision feedback equalizers (DFE) in telephony (Detailed references in Schoelkopf & Smola, 00, pp.

org/) Command-line based interface Source code available (in C) Interfaces to Python, MATLAB, Perl, Java, DLL, libsvm (http://www.csie.ntu.edu.

NET, Both include fast training and evaluation algorithms, support for multi-class SVMs, automated training and cross-validation, Easy to apply to your own problems!

10 Example Application: OCR Results Almost no overfitting with higher-degree kernels. Example Application: Pedestrian Detection Sliding-window approach 55 E.g. histogram representation (HOG) Map each grid cell in the input window to a histogram of gradient orientations. Train a linear SVM using training set of pedestrian vs. non-pedestrian windows. Obj./non-obj. Classifier [Dalal & Triggs, CVPR 005] Example Application: Pedestrian Detection Many Other Applications Lots of other applications in all fields of technology OCR N. Dalal, B. Triggs, Histograms of Oriented Gradients for Human Detection, CVPR Text classification Computer vision High-energy physics Monitoring of household appliances Protein secondary structure prediction Design on decision feedback equalizers (DFE) in telephony (Detailed references in Schoelkopf & Smola, 00, pp. 1) 58 You Can Try It At Home References and Further Reading Lots of SVM software available, e.g. svmlight ( Command-line based interface Source code available (in C) Interfaces to Python, MATLAB, Perl, Java, DLL, libsvm ( Library for inclusion with own code C++ and Java sources Interfaces to Python, R, MATLAB, Perl, Ruby, Weka, C+.NET, Both include fast training and evaluation algorithms, support for multi-class SVMs, automated training and cross-validation, Easy to apply to your own problems! 59 More information on SVMs can be found in Chapter 7.1 of Bishop s book. You can also look at Schölkopf & Smola (some chapters available online). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 006 B. Schölkopf, A. Smola Learning with Kernels MIT Press, 00 A more in-depth introduction to SVMs is available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. (), pp

Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant