Support Vector Machines

Similar documents
Learning Neural Networks

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machine (continued)

18.9 SUPPORT VECTOR MACHINES

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Statistical Pattern Recognition

Support Vector Machines

Linear & nonlinear classifiers

Support Vector Machines.

Support vector machines Lecture 4

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

FINAL: CS 6375 (Machine Learning) Fall 2014

COMS 4771 Introduction to Machine Learning. Nakul Verma

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Machine Learning. Support Vector Machines. Manfred Huber

9 Classification. 9.1 Linear Classifiers

Support Vector Machine (SVM) and Kernel Methods

Introduction to Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Kernel Methods. Charles Elkan October 17, 2007

COMP 875 Announcements

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Pattern Recognition 2018 Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Kernel Methods and Support Vector Machines

Statistical and Computational Learning Theory

Support Vector Machines

Linear & nonlinear classifiers

SVMs: nonlinearity through kernels

L5 Support Vector Classification

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Logistic Regression and Support Vector Machine

Jeff Howbert Introduction to Machine Learning Winter

Introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Support Vector Machines

Neural networks and support vector machines

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Kernelized Perceptron Support Vector Machines

Mining Classification Knowledge

Midterm Exam Solutions, Spring 2007

Machine Learning (CS 567) Lecture 3

Statistical Methods for Data Mining

1 Machine Learning Concepts (16 points)

SVM optimization and Kernel methods

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines for Classification: A Statistical Portrait

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines

Classifier Complexity and Support Vector Classifiers

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Introduction to SVM and RVM

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

The Perceptron Algorithm, Margins

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Max Margin-Classifier

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

SVMC An introduction to Support Vector Machines Classification

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Perceptron Revisited: Linear Separators. Support Vector Machines

Linear classifiers Lecture 3

SVMs, Duality and the Kernel Trick

Kernels and the Kernel Trick. Machine Learning Fall 2017

Logistic Regression Logistic

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

6.867 Machine learning

Machine Learning (CS 567) Lecture 5

Linear Models for Regression

Machine Learning Lecture 7

Support Vector Machines

Support Vector Machines

Midterm exam CS 189/289, Fall 2015

Support Vector Machines and Kernel Methods

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Modelli Lineari (Generalizzati) e SVM

Brief Introduction to Machine Learning

Announcements. Proposals graded

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS145: INTRODUCTION TO DATA MINING

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Machine Learning

Transcription:

Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization algorithms from Operations Reseach (Linear Programming and Quadratic Programming) Implicit feature transformation using kernels Control of overfitting by maximizing the margin

White Lie Warning This first introduction to SVMs describes a special case with simplifying assumptions We will revisit SVMs later in the quarter and remove the assumptions The material you are about to see does t describe real SVMs

Linear Programming The linear programming problem is the following: find w minimize c w subject to w a i = b i for i = 1,, m w j 0 for j = 1,, n There are fast algorithms for solving linear programs including the simplex algorithm and Karmarkar s s algorithm

Formulating LTU Learning as Linear Programming Encode classes as {+1, 1} LTU: h(x) = +1 if w x 0 = 11 otherwise An example (x( i,y i ) is classified correctly by h if y i w x i > 0 Basic idea: The constraints on the linear programming problem will be of the form y i w x i > 0 We need to introduce two more steps to convert this into the standard format for a linear program

Converting to Standard LP Form Step 1: Convert to equality constraints by using slack variables Introduce one slack variable s i for each training example x i and require that s i 0: y i w x i s i = 0 Step 2: Make all variables positive by subtracting pairs Replace each w j be a difference of two variables: w j = u j v j, where u j, v j 0 y i (u v) x i s i = 0 Linear program: Find s i, u j, v j Minimize ( objective function) Subject to: y i ( j (u j v j )x ij ) s i = 0 s i 0, u j 0, v j 0 The linear program will have a solution iff the points are linearly separable

Example 30 random data points labeled according to the line x 2 = 1 + x 1 Pink line is true classifier Blue line is the linear programming fit

What Happens with Non-Separable Data? Bad News: Linear Program is Infeasible

Higher Dimensional Spaces Theorem: For any data set, there exists a mapping Φ to a higher-dimensional space such that the data is linearly separable Φ(X) ) = (φ( 1 (x), φ 2 (x),, φ D (x)) Example: Map to quadratic space x = (x 1, x 2 ) (just two features) Φ(x) ) = (x 2 1, 2 x 1 x 2,x 2 2, 2 x 1, 2 x 2, 1) compute linear separator in this space

Drawback of this approach The number of features increases rapidly This makes the linear program much slower to solve

Kernel Trick A dot product between two higher- dimensional mappings can sometimes be implemented by a kernel function K(x i, x j ) = Φ(x i ) Φ(x j ) Example: Quadratic Kernel K(x i, x j ) = (x i x j +1) 2 = (x i1 x j1 + x i2 x j2 +1) 2 = x 2 i1x 2 j1 +2x i1 x i2 x j1 x j2 + x 2 i2x 2 j2 +2x i1 x j1 +2x i2 x j2 +1 = (x 2 i1, 2 x i1 x i2,x 2 i2, 2 x i1, 2 x i2, 1) (x 2 j1, 2 x j1 x j2,x 2 j2, 2 x j1, 2 x j2, 1) = Φ(x i ) Φ(x j )

Idea Reformulate the LTU linear program so that it only involves dot products between pairs of training examples Then we can use kernels to compute these dot products Running time of algorithm will t depend on number of dimensions D of high- dimensional space

Reformulating the LTU Linear Program Claim: In online Perceptron, w can be written as Proof: w = j α j y j x j Each weight update has the form w t := w t-1 + η g i,t g i,t is computed as g i,t i,t = error it y i x i (error it = 1 if x i misclassified in iteration t; 0 otherwise) Hence w t = w 0 + t i η error it y i x i But w 0 = (0, 0,,, 0), so w t = t i η error it y i x i w t = i ( t η error it ) y i x i w t = i α i y i x i

Rewriting the Linear Separator Using Dot Products w x i = X j α j y j x j x i = X j α j y j (x j x i ) Change of variables instead of optimizing w,, optimize {α{ j }. Rewrite the constraint y i w x i > 0 as y i j α j y j (x j x i ) > 0 or j α j y j y i (x j x i ) > 0

The Linear Program becomes Find {α{ j } minimize ( objective function) subject to j α j y j y i (x j x i ) > 0 α j 0 Notes: The weight α j tells us how important example x j is. If α j is n-zero, then x j is called a support vector To classify a new data point x,, we take its dot product with the support vectors j α j y j (x j x) ) > 0?

Kernel Version of Linear Program Find {α{ j } minimize ( objective function) subject to j α j y j y i K(x j,x i ) > 0 α j 0 Classify new x according to j α j y j K(x j,x)) > 0?

Two support vectors (blue) with α 1 = 0.205 and α 2 = 0.338 Equivalent to the line x 2 = 0.0974 + x 1 *1.341

Solving the Non-Separable Case with Cubic Polymial Kernel

Dot product K(x i, x j ) = x i x j Kernels Polymial of degree d K(x i, x j ) = (x( i x j + 1) d Gaussian with scale σ K(x i, x j ) = exp( x i x j 2 /σ 2 ) Polymials often give strange boundaries. Gaussians generally work well.

Gaussian kernel with σ 2 = 4 The gaussian kernel is equivalent to an infinite- dimensional feature space!

Evaluation of SVMs Criterion Perc Logistic LDA Trees Nets NNbr SVM Mixed data Missing values somewhat Outliers Motone transformations somewhat Scalability Irrelevant inputs somewhat * Linear combinations somewhat Interpretable ** Accurate * = dot product kernel with absolute value penalty ** = dot product kernel

Support Vector Machines Summary Advantages of SVMs variable-sized hypothesis space polymial-time exact optimization rather than approximate methods unlike decision trees and neural networks Kernels allow very flexible hypotheses Disadvantages of SVMs Must choose kernel and kernel parameters: Gaussian, σ Very large problems are computationally intractable quadratic in number of examples problems with more than 20,000 examples are very difficult to solve exactly Batch algorithm

SVMs Unify LTUs and Nearest Neighbor With Gaussian kernel compute distance to a set of support vector nearest neighbors transform through a gaussian (nearer neighbors get bigger votes) take weighted sum of those distances for each class classify to the class with most votes