Lecture 7: Kernels for Classification and Regression

Similar documents
Short Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning

Review: Support vector machines. Machine learning techniques and image analysis

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Introduction to Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Kernel Methods and Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (SVM) and Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machines

Support Vector Machines.

Support Vector Machines. Maximizing the Margin

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine (continued)

Jeff Howbert Introduction to Machine Learning Winter

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Kernel methods CSE 250B

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

CS798: Selected topics in Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machine I

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

ML (cont.): SUPPORT VECTOR MACHINES

Lecture 10: Support Vector Machine and Large Margin Classifier

Statistical Methods for SVM

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Support Vector Machines and Kernel Methods

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

SVMs, Duality and the Kernel Trick

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Linear Classification and SVM. Dr. Xin Zhang

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Statistical Machine Learning from Data

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machines, Kernel SVM

Basis Expansion and Nonlinear SVM. Kai Yu

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machine II

The Representor Theorem, Kernels, and Hilbert Spaces

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Max Margin-Classifier

Support Vector Machine

Modelli Lineari (Generalizzati) e SVM

(Kernels +) Support Vector Machines

Day 4: Classification, support vector machines

Learning From Data Lecture 25 The Kernel Trick

Convex Optimization in Classification Problems

Kernel Methods in Machine Learning

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Kaggle.

L5 Support Vector Classification

SVMC An introduction to Support Vector Machines Classification

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning 2: Nonlinear Regression

Introduction to Support Vector Machines

Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Support Vector Machines

18.9 SUPPORT VECTOR MACHINES

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

CS145: INTRODUCTION TO DATA MINING

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Lecture Notes on Support Vector Machine

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Kernel methods, kernel SVM and ridge regression

Probabilistic Machine Learning. Industrial AI Lab.

Support Vector Machine & Its Applications

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machine. Industrial AI Lab.

Support Vector Machine

Discriminative Learning and Big Data

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernel Methods. Machine Learning A W VO

Applied Machine Learning Annalisa Marsico

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear Regression (continued)

(i) The optimisation problem solved is 1 min

Machine Learning : Support Vector Machines

ECE521 week 3: 23/26 January 2017

Transcription:

Lecture 7: Kernels for Classification and Regression CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 15, 2011

Outline

Outline

A linear regression problem Linear auto-regressive model for time-series: y t linear function of y t 1, y t 2 y t = w 1 + w 2 y t 1 + w 3 y t 2, t = 1,..., T. This writes y t = w T x t, with x t the feature vectors x t := (1, y t 1, y t 2 ), t = 1,..., T. Model fitting via least-squares: min w X T w y 2 2 Prediction rule : ŷ T +1 = w 1 + w 2 y T + w 3 y T 1 = w T x T +1.

Nonlinear regression Nonlinear auto-regressive model for time-series: y t quadratic function of y t 1, y t 2 y t = w 1 + w 2 y t 1 + w 3 y t 2 + w 4 y 2 t 1 + w 5 y t 1 y t 2 + w 6 y 2 t 2. This writes y t = w T φ(x t), with φ(x t) the augmented feature vectors ( ) φ(x t) := 1, y t 1, y t 2, yt 1, 2 y t 1 y t 2, yt 2 2. Everything the same as before, with x replaced by φ(x).

Nonlinear classification Non-linear (e.g., quadratic) decision boundary w 1 x 1 + w 2 x 2 + w 3 x 2 1 + w 4 x 1 x 2 + w 5 x 2 2 + b = 0. Writes w T φ(x) + b = 0, with φ(x) := (x 1, x 2, x 2 1, x 1 x 2, x 2 2 ).

Challenges In principle, it seems can always augment the dimension of the feature space to make the data linearly separable. (See the video at http://www.youtube.com/watch?v=3licbrzprza) How do we do it in a computationally efficient manner?

Outline

Linear least-squares where min w X T w y 2 2 + λ w 2 2 X = [x 1,..., x n] is the m n matrix of data points. y R m is the response vector, w contains regression coefficients. λ 0 is a regularization parameter. Prediction rule: y = w T x, where x R n is a new data point.

Support vector machine (SVM) where min w m (1 y i (w T x i + b)) + λ w 2 2 i=1 X = [x 1,..., x m] is the n m matrix of data points in R n. y { 1, 1} m is the label vector. w, b contain classifer coefficients. λ 0 is a regularization parameter. In the sequel, we ll ignore the bias term (for simplicity only). Classification rule: y = sign(w T x + b), where x R n is a new data point.

of problem Many classification problems can be written where min w L(X T w, y) + λ w 2 2 X = [x 1,..., x n] is a m n matrix of data points. y R m contains a response vector (or labels). w contains classifier coefficients. L is a loss function that depends on the problem considered. λ 0 is a regularization parameter. Prediction/classification rule: depends only on w T x, where x R n is a new data point.

Loss functions Squared loss: (for linear least-squares regression) L(z, y) = z y 2 2. Hinge loss: (for SVMs) m L(z, y) = max(0, 1 y i z i ) i=1 Logistic loss: (for logistic regression) m L(z, y) = log(1 + e y i z i ). i=1

Outline

Key result For the generic problem: min w L(X T w) + λ w 2 2 the optimal w lies in the span of the data points (x 1,..., x m): for some vector v R m. w = Xv

Proof R( A) = N ( A ). Therefore, the output space R m is decomposed as Any w Rn can be written as( Athe of two Rm = R ) sum = R (orthogonal A ) N ( A )vectors:, R( A) and w = Xv + r dim N ( A ) + rank( A ) = m, T where X T r = 0 (that is, r is in the nullspace N (X )). 3 see a graphical exemplification in R in Figure 2.2.2. Nonlinear case2.26: Illus Figure of the fundamen theorem of linea Algebra in R 3 ; A [ a1 a2 ]. Figure shows the case X = A = (a1, a2 ).

Consequence of key result For the generic problem: min w L(X T w) + λ w 2 2 the optimal w can be written as w = Xv for some vector v R m. Hence training problem depends only on K := X T X: min v L(Kv) + λv T Kv.

Kernel matrix The training problem depends only on the kernel matrix K = X T X K ij = xi T x j K contains the scalar products between all data point pairs. The prediction/classification rule depends on the scalar products between new point x and the data points x 1,..., x m: w T x = v T X T x = v T k, k := X T x = (x T x 1,..., x T x m).

Computational advantages Once K is formed (this takes O(n)), then the training problem has only m variables. When n >> m, this leads to a dramatic reduction in problem size.

How about the nonlinear case? In the nonlinear case, we simply replace the feature vectors x i by some augmented feature vectors φ(x i ), with φ a non-linear mapping. Example : in classification with quadratic decision boundary, we use φ(x) := (x 1, x 2, x 2 1, x 1 x 2, x 2 2 ). This leads to the modified kernel matrix K ij = φ(x i ) T φ(x j ), 1 i, j m.

The kernel function The kernel function associated with mapping φ is k(x, z) = φ(x) T φ(z). It provides information about the metric in the feature space, e.g.: φ(x) φ(z) 2 2 = k(x, x) 2k(x, z) + k(z, z). The computational effort involved in solving the training problem; making a prediction, depends only on our ability to quickly evaluate such scalar products. We can t choose k arbitrarily; it has to satisfy the above for some φ.

Outline

Quadratic kernels Classification with quadratic boundaries involves feature vectors φ(x) = (1, x 1, x 2, x 2 1, x 1 x 2, x 2 2 ). Fact : given two vectors x, z R 2, we have φ(x) T φ(z) = (1 + x T z) 2.

More generally when φ(x) is the vector formed with all the products between the components of x R n, up to degree d, then for any two vectors x, z R n, φ(x) T φ(z) = (1 + x T z) d. Computational effort grows linearly in n. This represents a dramatic reduction in speed over the brute force approach: Form φ(x), φ(z); evaluate φ(x) T φ(z). Computational effort grows as n d.

Gaussian kernel function: k(x, z) = exp ( x ) z 2 2, 2σ 2 where σ > 0 is a scale parameter. Allows to ignore points that are too far apart. Corresponds to a non-linear mapping φ to infinite-dimensional feature space. There is a large variety (a zoo?) of other kernels, some adapted to structure of data (text, images, etc).

In practice Kernels need to be chosen by the user. Choice not always obvious; Gaussian or polynomial kernels are popular. Control over-fitting via cross validation (wrt say, scale parameter of Gaussian kernel, or degree of polynomial kernel). Kernel methods not well adapted to l 1 -norm regularization.