MATH 829: Introduction to Data Mining and Analysis Support vector machines

Similar documents
Support Vector Machines

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

Pattern Recognition 2018 Support Vector Machines

Statistical Methods for SVM

Introduction to Support Vector Machines

CSC 411 Lecture 17: Support Vector Machine

Support vector machines Lecture 4

Kernel Methods and Support Vector Machines

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Jeff Howbert Introduction to Machine Learning Winter

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Support Vector Machines, Kernel SVM

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

A vector from the origin to H, V could be expressed using:

MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Support Vector Machines

Support Vector Machines

Introduction to SVM and RVM

Perceptron Revisited: Linear Separators. Support Vector Machines

ML (cont.): SUPPORT VECTOR MACHINES

Classification and Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Statistical Pattern Recognition

CS145: INTRODUCTION TO DATA MINING

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

MIRA, SVM, k-nn. Lirong Xia

Learning by constraints and SVMs (2)

Warm up: risk prediction with logistic regression

Multiclass Classification-1

Convex Optimization and Support Vector Machine

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Gaussian and Linear Discriminant Analysis; Multiclass Classification

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

CS-E4830 Kernel Methods in Machine Learning

Support Vector Machine (continued)

An Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear, Binary SVM Classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Statistical Methods for Data Mining

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

SVM optimization and Kernel methods

Support Vector Machines Explained

18.9 SUPPORT VECTOR MACHINES

Support Vector Machines

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

Support vector machines

Learning From Data Lecture 25 The Kernel Trick

Max Margin-Classifier

Support Vector Machine for Classification and Regression

Constrained Optimization and Support Vector Machines

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Multisurface Proximal Support Vector Machine Classification via Generalized Eigenvalues

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS , Fall 2011 Assignment 2 Solutions

Support Vector Machine. Industrial AI Lab.

MATH 567: Mathematical Techniques in Data Science Categorical data

c 4, < y 2, 1 0, otherwise,

CS798: Selected topics in Machine Learning

Lecture 10: Support Vector Machine and Large Margin Classifier

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Support Vector Machines and Bayes Regression

Homework 3. Convex Optimization /36-725

Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

The Perceptron Algorithm, Margins

Machine Learning 4771

Kernelized Perceptron Support Vector Machines

ECS171: Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

The Perceptron Algorithm

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

L5 Support Vector Classification

ESS2222. Lecture 4 Linear model

Stephen Scott.

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

CS 188: Artificial Intelligence. Outline

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Support vector networks

Support Vector Machines

(Kernels +) Support Vector Machines

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Lecture 3 January 28

Ensembles. Léon Bottou COS 424 4/8/2010

Machine Learning for NLP

Transcription:

1/10 MATH 829: Introduction to Data Mining and Analysis Support vector machines Dominique Guillot Departments of Mathematical Sciences University of Delaware March 11, 2016

Hyperplanes 2/10 Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β).

Hyperplanes 2/10 Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β). An ane hyperplane H in R n is a subset of the form where β 0 R, β R n. H = {x R n : β 0 + β T x = 0}

Hyperplanes Recall: A hyperplane H in V = R n is a subspace of V of dimension n 1 (i.e., a subspace of codimension 1). Each hyperplane is determined by a nonzero vector β R n via H = {x R n : β T x = 0} = span(β). An ane hyperplane H in R n is a subset of the form where β 0 R, β R n. H = {x R n : β 0 + β T x = 0} We often use the term hyperplane for ane hyperplane. 2/10

Hyperplanes (cont.) 3/10 Let H = {x R n : β 0 + β T x = 0}.

Hyperplanes (cont.) Let H = {x R n : β 0 + β T x = 0}. Note that for x 0, x 1 H, β T (x 0 x 1 ) = 0. Thus β is perpendicular to H. It follows that for x R n, d(x, H) = βt β (x x 0) = β 0 + β T x. β 3/10

4/10 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure 4.14. (Orange = least-squares)

4/10 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure 4.14. (Orange = least-squares) Classify using G(x) = sgn(x T β + β 0 ).

4/10 Separating hyperplane Suppose we have binary data with labels {+1, 1}. We want to separate data using an (ane) hyperplane. ESL, Figure 4.14. (Orange = least-squares) Classify using G(x) = sgn(x T β + β 0 ). Separating hyperplane may not be unique. Separating hyperplane may not exist (i.e., data may not be separable).

Margins 5/10 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land).

Margins 5/10 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1.

Margins 5/10 Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1. Note that: y i (x T i β + β 0 ) > 0 Correct classication y i (x T i β + β 0 ) < 0 Incorrect classication

Margins Uniqueness problem: when the data is separable, choose the hyperplane to maximize the margin (the no man's land). Data: (y i, x i ) {+1, 1} R p (i = 1,..., n). Suppose β 0 + β T x is a separating hyperplane with β = 1. Note that: y i (x T i β + β 0 ) > 0 Correct classication y i (x T i β + β 0 ) < 0 Incorrect classication Also, y i (x T i β + β 0) = distance between x and hyperplane (since β = 1). 5/10

Margins (cont.) 6/10 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n).

Margins (cont.) 6/10 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization.

6/10 Margins (cont.) Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β.

Margins (cont.) 6/10 Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β. We can always rescale (β, β 0 ) so that β = 1/M. Our problem is therefore equivalent to 1 min β 0,β R p 2 β 2 subject to y i (x T i β + β 0 ) 1 (i = 1,..., n).

Margins (cont.) Thus, if the data is separable, we can solve max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M (i = 1,..., n). We will transform the problem into a usual form used in convex optimization. We can remove β = 1 by replacing the constraint by 1 β y i(x T i β+β 0 ) M, or equivalently, y i (x T i β+β 0 ) M β. We can always rescale (β, β 0 ) so that β = 1/M. Our problem is therefore equivalent to 1 min β 0,β R p 2 β 2 subject to y i (x T i β + β 0 ) 1 (i = 1,..., n). We now recognize the problem as a convex optimization problem with a quadratic objective, and linear inequality constraints. 6/10

Support vector machines 7/10 The previous problem works well when the data is separable. What happens if there is no way to nd a margin?

Support vector machines 7/10 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error.

Support vector machines 7/10 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error. We replace y i (x T i β + β 0) M by y i (x T i β + β 0 ) M(1 ξ i ), ξ i 0, and add the constraint n ξ i C for some xed constant C > 0. i=1

Support vector machines 7/10 The previous problem works well when the data is separable. What happens if there is no way to nd a margin? We allow some points to be on the wrong side of the margin, but keep control on the error. We replace y i (x T i β + β 0) M by y i (x T i β + β 0 ) M(1 ξ i ), ξ i 0, and add the constraint n ξ i C for some xed constant C > 0. i=1 The problem becomes: max M β 0,β R p, β =1 subject to y i (x T i β + β 0 ) M(1 ξ i ) n ξ i 0, ξ i C. i=1

Support vector machines (cont.) 8/10 As before, we can transform the problem into its normal form: 1 min β 0,β 2 β 2 subject to y i (x T i β + β 0 ) 1 ξ i n ξ i 0, ξ i C. i=1 Problem can be solved using standard optimization packages.

Multiple classes of data 9/10 The SVM is a binary classier. How can we classify data with K > 2 classes?

Multiple classes of data 9/10 The SVM is a binary classier. How can we classify data with K > 2 classes? One versus all:(or one versus the rest) Fit the model to separate each class against the remaining classes. Label a new point x according to the model for which x T β + β 0 is the largest. Need to t the model K times.

Multiple classes of data (cont.) 10/10 One versus one: 1 Train a classier for each possible pair of classes. Note: There are ( K 2 ) = K(K 1)/2 such pairs. 2 Classify a new points according to a majority vote: count the number of times the new point is assign to a given class, and pick the class with the largest number.

Multiple classes of data (cont.) One versus one: 1 Train a classier for each possible pair of classes. Note: There are ( K 2 ) = K(K 1)/2 such pairs. 2 Classify a new points according to a majority vote: count the number of times the new point is assign to a given class, and pick the class with the largest number. Need to t the model ( K 2 ) times (computationally intensive). 10/10