Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Support Vector Machine (continued)

Introduction to Machine Learning

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines

Statistical Pattern Recognition

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Jeff Howbert Introduction to Machine Learning Winter

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) and Kernel Methods

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines.

Statistical Machine Learning from Data

(Kernels +) Support Vector Machines

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear Models in Machine Learning

Machine Learning Practice Page 2 of 2 10/28/13

Introduction to SVM and RVM

Statistical Methods for Data Mining

L5 Support Vector Classification

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

9 Classification. 9.1 Linear Classifiers

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 7: Kernels for Classification and Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Midterm exam CS 189/289, Fall 2015

Perceptron Revisited: Linear Separators. Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

SVM optimization and Kernel methods

Support Vector Machines

Introduction to Support Vector Machines

Support Vector Machines

HOMEWORK 4: SVMS AND KERNELS

Support Vector Machine

Support Vector Machines

CSE 546 Final Exam, Autumn 2013

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Linear & nonlinear classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Binary Classification / Perceptron

Machine Learning. Support Vector Machines. Manfred Huber

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Support Vector Machines. CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector and Kernel Methods

Support Vector Machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Support Vector Machines Explained

Kernel Logistic Regression and the Import Vector Machine

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machine. Industrial AI Lab.

Support Vector Machines and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Day 4: Classification, support vector machines

Kernel Methods and Support Vector Machines

1 Machine Learning Concepts (16 points)

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Max Margin-Classifier

Linear Regression (9/11/13)

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CPSC 340: Machine Learning and Data Mining

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

18.9 SUPPORT VECTOR MACHINES

Modelli Lineari (Generalizzati) e SVM

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Classification and Support Vector Machine

Announcements - Homework

Classifier Complexity and Support Vector Classifiers

10-701/ Machine Learning - Midterm Exam, Fall 2010

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Lecture 11. Kernel Methods

Machine Learning, Midterm Exam

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Training the linear classifier

Applied Machine Learning Annalisa Marsico

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

COMP 875 Announcements

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear Classification and SVM. Dr. Xin Zhang

CSC 411 Lecture 17: Support Vector Machine

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Transcription:

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer and Jyrki Kivinen) November 2nd December 15th 2017

Support Vector Machines 2,

3, Outline A refresher on linear models Feature transformations Linear classifiers (case: Perceptron) Maximum margin classifiers Surrogate loss functions (max margin vs log. reg.) SVM and the kernel trick

4, Linear models A refresher about linear models (see linear regression, Lecture 3): We consider features x = (x 1,..., x p ) R p throughout this chapter Function f : R p R is linear if for some β R p it can be written as p f (x) = β x = β j x j By including a constant feature x 1 1, we can express models with an intercept term using the same formula j=1 β is often called coefficient or weight vector

5, Multivariate linear regression We assume matrix X R n p has n instances x i as its rows and y R n contains the corresponding labels y i In the standard linear regression case, we write y = Xβ + ɛ where the residual ɛ i = y i β x i indicates the error of f (x) on data point (x i, y i ) Least squares: Find β which minimises the sum of squared residuals n ɛ 2 i = ɛ 2 2 i=1 Closed-form solution (assuming n p): ˆβ = (X T X) 1 X T y

6, Further topics in linear regression: Feature transformations Earlier (Lecture 3), we already discussed non-linear transformations: e.g., a degree 5 polynomial of x R f (x i ) = β 0 + β 1 x i + β 2 x 2 i + β 3 x 3 i + β 4 x 4 i + β 5 x 5 i Likewise, we mentioned the possibility to include interactions via cross-terms f (x i ) = β 0 + β 1 x i1 + β 2 x i2 + β 12 x i1 x i2

7, Further topics in linear regression: Dummy variables What if we have qualitative/categorical (instead of continuous) features, like gender, job title, pixel color, etc.? Binary features with two levels can be included as they are: x i { 0, 1 } Coefficient can be interpreted as the difference between instances with x i = 0 and x i = 1: e.g., average increase in salary When there are more than two levels, it doesn t usually make sense to assume linearity f ((x 1, x 2, 1)) f ((x 1, x 2, 0)) = f ((x 1, x 2, 2)) f ((x 1, x 2, 1)) especially when the encoding is arbitrary: red = 0, green = 1, blue = 2

Further topics in linear regression: Dummy variables (2) For more than two levels, introduce dummy (or indicator) variables: { 1 if ith person is a student x i1 = 0 otherwise { 1 if ith person is an actor/actress x i2 = 0 otherwise { 1 if ith person is a data scientist x i3 = 0 otherwise One level is usually left without a dummy variable since otherwise the model is over-parametrized Adding a constant α to all coefficients of variable Xi and subtracting α from the intercept has net effect zero Read Sec. 3.3.1 (Qualitative Predictors) of the textbook 8,

9, Linear classification via regression As we have seen, minimising squared error in linear regression has a nice closed form solution (if inverting a p p matrix is feasible) How about using the linear predictor f (x) = β x for classification with a binary class label y { 1, 1 } through { +1, if β x 0 ŷ = sign(f (x)) = 1, if β x < 0 Given a training set (x 1, y 1 ),..., (x n, y n ), it is computationally intractable to find the coefficient vector β that minimises the 0 1 loss n i=1 I [yi (β x i )<0]

10, Linear classification via regression (2) One approach is to replace 0-1 loss I [yi (β x i )<0] with a surrogate loss function something similar but easier to optimise In particular, we could replace I [yi (β x i )<0] by the squared error (y i β x i ) 2 learn β using least squares regression on the binary classification data set (with y i { 1, +1 }) use β in linear classifier ĉ(x) = sign(β x) advantage: computationally efficient disadvantage: sensitive to outliers (in particular, too good predictions y i (β x i ) 1 get heavily punished, which is counterintuitive) We ll return to this a while

11, The Perceptron algorithm (briefly) NB: The perceptron is just mentioned in passing not required for the exam. However, the concepts introduced here (linear separability and margin) will be useful in what follows. The perceptron algorithm is a simple iterative method which can be used to train a linear classifier If the training data (x i, y i ) n i=1 is linearly separable, i.e. there is some β R p such that y i (β x i ) > 0 for all i, the algorithm is guaranteed to find such a β The algorithm (or its variations) can be run also for non-separable data but there is no guarantee about the result

12, Perceptron algorithm: Main ideas The algorithm keeps track of and updates a weight vector β Each input item is shown once in a sweep over the training data. If a full sweep is completed without any misclassifications then we are done, and return β that classifies all training data correctly. Whenever ŷ i y i we update β by adding y i x i. This turns β towards x i if y i = +1, and away from x i if y i = 1

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Current state of β

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Red point classified correctly, no change to β

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Green point classified correctly, no change to β

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Green point misclassified, will change β as follows...

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Adding y i x i to current weight vector β to obtain new weight vector

13, Perceptron algorithm: Illustration wβ training example of class +1 training example of class 1 Adding y i x i to current weight vector β to obtain new weight vector Note that the length of β is irrelevant for classification

Margin Given a data set (x i, y i ) n i=1 and M > 0, we say that a coefficient vector β separates the data with margin M if for all i we have y i (β x i ) M β 2 Explanation β xi / β 2 is the scalar projection of x i onto vector β y i (β x i ) 0 means we predict the correct class β x i / β 2 is Euclidean distance between point x i and the decision boundary, i.e., hyperplane β x = 0 βw M 14,

15, Max margin classifier and SVM: Terminology Maximal margin classifier (Sec. 9.1.3): Find β that classifies all instances correctly and maximizes the margin M Its special cases: Support vector classifier (Sec. 9.2): Maximize the soft margin M allowing some points to violate the margin (and even be misclassified), controlled by a tuning parameter C: y i (β x i ) M(1 ɛ i ) n ɛ i 0, ɛ i C i=1 subject to β 2 = 1 Support vector machine (SVM; Sec. 9.3): Non-linear version of the support vector classifier

348 9. Support Vector Machines 3 2 1 0 1 2 3 3 2 1 0 1 2 3 1 0 1 2 X 1 1 0 1 2 X 1 X2 3 2 1 0 1 2 3 X2 3 2 1 0 1 2 3 X2 X2 1 0 1 2 1 0 1 2 X 1 X 1 Fig. 9.7 in (James et al., 2013) FIGURE 9.7. A support vector classifier was fit using four different values of the 16,

17, Observations on max margin classifiers Consider the linearly separable case ɛ i 0. The maximal margin touches a set of training data points x i, which are called support vectors

18, Observations on max margin classifiers (2) Given a set of support vectors, the coefficients defining the hyperplane can be defined as ˆβ = n c i y i x i, i=1 with some c i 0, where c i > 0 only if the ith data point touches the margin In other words, the classifier is defined by a few data points A similar property holds for the soft margin: the more the ith point violates the margin, the larger c i, and for points that do not violate the margin, c i = 0

19, Observations on max margin classifiers (3) The optimization problem for both hard and soft margin can be solved efficiently using the Lagrange method The details are beyond our scope (but interesting!) A key property is that the solution only depends on the data through the inner products x i, x j = x i x j (and the values y i ) This follows from the expression of the coefficient vector ˆβ as a linear combination of the support vectors. Given a new (test) data point x, we can classify it based on the sign of ( n ) n ˆf (x) = ˆβ x = c i y i x i x = c i y i x i, x i=1 i=1

20, Relation to other linear classifiers The soft margin minimization problem of the support vector classifier can be rewritten as an unconstrained problem { n } min β i=1 max[0, 1 y i (β x i )] + λ β 2 2 Compare this to penalized logistic regression { n min β i=1 ln(1 + exp( y i (β x i ))) + λ β 2 2 or ridge regression { n } min (y i β x i ) 2 + λ β 2 2 β i=1 These are all examples of common surrogate loss functions }

Relation to other linear classifiers (2) Compare the hinge loss ( SVM Loss ) max[0, 1 y i (β x)] (black) and the logistic loss exp( y i (β x)) (green) Fig. 9.12 in the (James et al., 2013) 21,

22, Kernel trick Since the data only appear through x i, x j, we can use the following kernel trick Imagine that we want to introduce non-linearity by mapping the original data into a higher-dimensional representation remember the polynomial example xi 1, x i, xi 2, x i 3,... interaction terms are an another example: (x i, x j ) (x i, x j, x i x j ) Denote this mapping by Φ : R p R q, q > p Define the kernel function as K(x i, x) = Φ(x i ), Φ(x) The trick is to evaluate K(x i, x) without actually computing the mappings Φ(x i ) and Φ(x)

Kernels Popular kernels: linear kernel: K(xi, x) = x i, x polynomial kernel: K(x i, x) = ( x i, x + 1) d (Gaussian) radial basis function: K(xi, x) = exp( γ x i x 2 2 ) For example, the radial basis function (RBF) kernel corresponds to a feature mapping of infinite dimension! The same kernel trick can be applied to any learning algorithm that can be expressed in terms of inner products between the data points x perceptron linear (ridge) regression Gaussian process regression principal component analysis (PCA)... 23,

24, SVM: Example From (James et al., 2013) Three SVM results on the same data, from left to right: Linear kernel, polynomial kernel d = 3, RBF library(e1071) fit = svm(y., data=d, kernel="radial", gamma=1, cost=1) plot(fit, D)

25, SVMs: Properties The use of the hinge loss (soft margin) as a surrogate for the 0 1 loss leads to the support vector classifier With a suitable choice of kernel, the SVM can be applied in various different situations string kernels for text, structured outputs,... The computation of pairwise kernel values K(x i, x j ) may become intractable for large samples but fast techniques are available SVM is one of the overall best out-of-the-box classifiers Since the kernel trick allows complex, non-linear decision boundaries, regularization is absolutely crucial: the tuning parameter C is typically chosen by cross-validation