UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

Similar documents
UVA CS 4501: Machine Learning

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

UVA$CS$6316$$ $Fall$2015$Graduate:$$ Machine$Learning$$ $ $Lecture$8:$Supervised$ClassificaDon$ with$support$vector$machine$

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machine & Its Applications

UVA CS / Introduc8on to Machine Learning and Data Mining

Last Lecture Recap UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 3: Linear Regression

FIND A FUNCTION TO CLASSIFY HIGH VALUE CUSTOMERS

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS798: Selected topics in Machine Learning

(Kernels +) Support Vector Machines

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

SUPPORT VECTOR MACHINE

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical Pattern Recognition

Support Vector Machine (continued)

Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

UVA CS 4501: Machine Learning. Lecture 11b: Support Vector Machine (nonlinear) Kernel Trick and in PracCce. Dr. Yanjun Qi. University of Virginia

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 10: Supervised ClassificaDon with Support Vector Machine. Dr. Yanjun Qi. University of Virginia

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Max Margin-Classifier

Kernel Methods and Support Vector Machines

Where are we? è Five major sec3ons of this course

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines, Kernel SVM

Linear & nonlinear classifiers

Support Vector Machines. Machine Learning Fall 2017

SVMs, Duality and the Kernel Trick

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning. Lecture 6: Linear Regression Model with Dr. Yanjun Qi. University of Virginia

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

LECTURE 7 Support vector machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines

Support Vector Machines.

Machine Learning Lecture 7

Support Vector Machine (SVM) and Kernel Methods

Statistical Machine Learning from Data

Nonlinear Support Vector Machines through Iterative Majorization and I-Splines

Polyhedral Computation. Linear Classifiers & the SVM

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Announcements - Homework

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

CS 6375 Machine Learning

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

18.9 SUPPORT VECTOR MACHINES

Machine Learning Practice Page 2 of 2 10/28/13

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Pattern Recognition 2018 Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

The Perceptron. Volker Tresp Summer 2014

Support Vector Machines

Support Vector Machines and Kernel Methods

Support Vector Machines for Classification and Regression

Support Vector Machine. Industrial AI Lab.

L5 Support Vector Classification

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Basis Expansion and Nonlinear SVM. Kai Yu

Support Vector Machines

MLCC 2017 Regularization Networks I: Linear Models

Introduction to Support Vector Machines

Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines Explained

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

The role of dimensionality reduction in classification

Jeff Howbert Introduction to Machine Learning Winter

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Machine Learning & SVM

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

CSE446: non-parametric methods Spring 2017

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Transcription:

UVA CS 4501-001 / 6501 007 Introduc8on to Machine Learning and Data Mining Lecture 9: Classifica8on with Support Vector Machine (cont.) Yanjun Qi / Jane University of Virginia Department of Computer Science 9/25/14 1 Where we are? è Five major seckons of this course q Regression (supervised) q ClassificaKon (supervised) q Unsupervised models q Learning theory q Graphical models 9/25/14 2 1

Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 3 e.g. SUPERVISED Linear Binary Classifier w T x + b>0? Predict Class = +1 zone x w T x + b<0? Predict Class = -1 zone f y f(x,w,b) = sign(w T x + b)? denotes +1 point denotes -1 point denotes future points 9/25/14 4 Courtesy slide from Prof. Andrew Moore s tutorial 2

Types of classifiers We can divide the large variety of classification approaches into roughly three major types 1. Discriminative - directly estimate a decision rule/boundary - e.g., support vector machine, decision tree 2. Generative: - build a generative statistical model - e.g., Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 9/25/14 5 A study comparing Classifiers è 11 binary classificakon problems / 8 metrics 9/25/14 6 3

A Dataset for binary classificakon Output as Binary Class: only two possibilities Data/points/instances/examples/samples/records: [ rows ] Features/a0ributes/dimensions/independent variables/covariates/ predictors/regressors: [ columns, except the last] Target/outcome/response/label/dependent variable: special column to be predicted [ last column ] 9/25/14 7 History of SVM SVM is inspired from stakskcal learning theory [3] SVM was first introduced in 1992 [1] SVM becomes popular because of its success in handwriden digit recognikon 1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4. See SecKon 5.11 in [2] or the discussion in [3] for details SVM is now regarded as an important example of kernel methods, arguably the hodest area in machine learning ten years ago [1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory 5 144-152, Pittsburgh, 1992. [2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp. 77-82, 1994. [3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999. 9/25/14 8 4

Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 9 Linear Classifiers x f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) How would you classify this data? 9/25/14 10 5

Linear Classifiers x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) How would you classify this data? 9/25/14 11 Linear Classifiers x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) How would you classify this data? 9/25/14 12 6

Linear Classifiers x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) How would you classify this data? 9/25/14 13 Linear Classifiers x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) Any of these would be fine....but which is best? 9/25/14 14 7

Classifier Margin x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hiing a datapoint. 9/25/14 15 Maximum Margin x α f y est denotes +1 denotes - 1 f(x,w,b) = sign(w T x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM 9/25/14 16 8

Maximum Margin x α f y est denotes +1 denotes - 1 Support Vectors are those datapoints that the margin pushes up against f(x,w,b) = sign(w T x + b) The maximum margin linear classifier is the linear classifier with the, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM 9/25/14 17 Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from both sets of points From all the possible boundary lines, this leads to the largest margin on both sides 9/25/14 18 9

Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from points on both sides D } } D Why? Intuitive, makes sense Some theoretical support Works well in practice 9/25/14 19 Max margin classifiers Instead of fitting all points, focus on boundary points Learn a boundary that leads to the largest margin from points on both sides D } } D Also known as linear support vector machines (SVMs) These are the vectors supporting the boundary 9/25/14 20 10

Max- margin & Decision Boundary The decision boundary should be as far away from the data of both classes as possible Class 1 Class -1 Specifying a max margin classifier Predict class +1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Class +1 plane boundary Class -1 plane Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 9/25/14 22 11

Specifying a max margin classifier Predict class +1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Is the linear separation assumption realistic? We will deal with this shortly, but lets assume it for now Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 9/25/14 23 Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 24 12

Maximizing the margin Predict class +1 M Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Lets define the width of the margin by M How can we encode our goal of maximizing M in terms of our parameters (w and b)? Lets start with a few obsevrations 9/25/14 25 w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class +1 Maximizing the margin: observation-1 Predict class -1 Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 plane Why? Let u and v be two points on the +1 plane, then for the vector defined by u and v we have w T (u-v) = 0 M Corollary: the vector w is orthogonal to the -1 plane 9/25/14 26 13

ObservaKon 1 è Review :Vector SubtracKon ObservaKon 1 è Review :Vector Product 9/25/14 28 14

ObservaKon 1 è Review :Vector Product 9/25/14 29 ObservaKon 1 è Review :Vector Product 9/25/14 30 15

ObservaKon 1 è Review :Vector Product è Orthogonal 9/25/14 31 ObservaKon 1 è Review :Vector Norm 9/25/14 32 16

ObservaKon 1 è Review : Vector Product, Orthogonal, and Norm For two vectors x and y, x T y is called the (inner) vector product. x and y are called orthogonal if x T y = 0 The square root of the product of a vector with itself, is called the 2- norm ( x 2 ), can also write as x 9/25/14 33 Maximizing the margin: observation-1 Observa8on 1: the vector w is orthogonal to the +1 plane Class 2 Class 1 M 17

w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class +1 Maximizing the margin: observation-1 Predict class -1 M Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 plane Why? Let u and v be two points on the +1 plane, then for the vector defined by u and v we have w T (u-v) = 0 Corollary: the vector w is orthogonal to the -1 plane 9/25/14 35 w T x+b=+1 w T x+b=0 w T x+b=-1 Maximizing the margin: observation-2 Predict class +1 Predict class -1 Classify as +1 if w T x+b 1 Classify as -1 if w T x+b - 1 Undefined if -1 <w T x+b < 1 Observation 1: the vector w is orthogonal to the +1 and -1 planes Observation 2: if x + is a point on the +1 plane and x - is the closest point to x + on the -1 plane then M x + = λw + x - Since w is orthogonal to both planes we need to travel some distance along w to get from x + to x - 9/25/14 36 18

Predict class +1 Putting it together M w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M Predict class -1 We can now define M in terms of w and b w T x + + b = +1 w T (λw + x - ) + b = +1 w T x - + b + λw T w = +1-1 + λw T w = +1 λ = 2/w T w 9/25/14 37 w T x+b=+1 w T x+b=0 w T x+b=-1 w T x + + b = +1 w T x - + b = -1 x + = λw + x - x + - x - = M λ = 2/w T w Predict class +1 Putting it together Predict class -1 We can now define M in terms of w and b M M = x + - x - M = λw = λ w = λ M = 2 w T w w T w = 2 w T w w T w 9/25/14 38 19

w T x+b=+1 w T x+b=0 w T x+b=-1 Finding the optimal parameters Predict class +1 Predict class -1 M 2 M = w T w We can now search for the optimal parameters by finding a solution that: 1. Correctly classifies all points 2. Maximizes the margin (or equivalently minimizes w T w) Several optimization methods can be used: Gradient descent, simulated annealing, EM etc. 9/25/14 39 Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 40 20

Optimization Step i.e. learning optimal parameter for SVM Predict class +1 M M = 2 w T w w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class -1 Min (w T w)/2 subject to the following constraints: For all x in class + 1 w T x+b 1 For all x in class - 1 w T x+b -1 } A total of n constraints if we have n input samples 9/25/14 41 Optimization Review: Ingredients ObjecKve funckon Variables Constraints Find values of the variables that minimize or maximize the objeckve funckon while saksfying the constraints 42 21

Optimization with Quadratic programming (QP) Quadratic programming solves optimization problems of the following form: min U u T Ru 2 + d T u + c subject to n inequality constraints: a 11 u 1 + a 12 u 2 +... b 1!!! a n1 u 1 + a n 2 u 2 +... b n and k equivalency constraints: a n +1,1 u 1 + a n +1,2 u 2 +...= b n +1!!! Quadratic term When a problem can be specified as a QP problem we can use solvers that are better than gradient descent or simulated annealing a n +k,1 u 1 + a n +k,2 u 2 +... = b n +k 9/25/14 43 SVM as a QP problem w T x+b=+1 w T x+b=0 w T x+b=-1 Predict class +1 Predict class -1 M M = 2 w T w R as I matrix, d as zero vector, c as 0 value min U u T Ru 2 + d T u + c subject to n inequality constraints: a 11 u 1 + a 12 u 2 +... b 1!!! Min (w T w)/2 subject to the following inequality constraints: For all x in class + 1 w T x+b 1 For all x in class - 1 w T x+b -1 } A total of n constraints if we have n input samples a n1 u 1 + a n 2 u 2 +... b n and k equivalency constraints: a n +1,1 u 1 + a n +1,2 u 2 +... = b n +1!!! a n +k,1 u 1 + a n +k,2 u 2 +... = b n +k 9/25/14 44 22

Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 45 Non linearly separable case So far we assumed that a linear plane can perfectly separate the points But this is not usally the case - noise, outliers Hard to solve (two minimization problems) How can we convert this to a QP problem? - Minimize training errors? min w T w min #errors - Penalize training errors: min w T w+c*(#errors) Hard to encode in a QP problem 9/25/14 46 23

Non linearly separable case Instead of minimizing the number of misclassified points we can minimize the distance between these points and their correct plane +1 plane -1 plane The new optimization problem is: w T n w min w + Cε i 2 i=1 subject to the following inequality constraints: For all x i in class + 1 w T x+b 1- ε i ε k ε j For all x i in class - 1 w T x+b -1+ ε i Wait. Are we missing something? 9/25/14 47 Final optimization for non linearly separable case ε k +1 plane -1 plane ε j The new optimization problem is: w T w min w + Cε i 2 i=1 subject to the following inequality constraints: For all x i in class + 1 w T x+b 1- ε i For all x i in class - 1 w T x+b -1+ ε i For all i ε I 0 n }A total of n constraints } Another n constraints 9/25/14 48 24

Where we are Two optimization problems: For the separable and non separable cases w T n w T w w min min w + Cε w i 2 2 i=1 For all x in class + 1 For all x i in class + 1 w T x+b 1 w T x+b 1- ε i For all x For all x in class - 1 i in class - 1 w T w x+b -1 T x+b -1+ ε i For all i ε I 0 9/25/14 49 Today q Review of ClassificaKon q Support Vector Machine (SVM) ü Large Margin Linear Classifier ü Define Margin (M) in terms of model parameter ü OpKmizaKon to learn model parameters (w, b) ü Non linearly separable case ü OpKmizaKon with dual form 9/25/14 50 25

Where we are Two optimization problems: For the separable and non separable cases w T n w Min (w T min w)/2 w + Cε i 2 i=1 For all x in class + 1 For all x i in class + 1 w T x+b 1 w T x+b 1- ε i For all x For all x in class - 1 i in class - 1 w T w x+b -1 T x+b -1+ ε i For all i ε I 0 Instead of solving these QPs directly we will solve a dual formulation of the SVM optimization problem The main reason for switching to this type of representation is that it would allow us to use a neat trick that will make our lives easier (and the run time faster) 9/25/14 51 Optimization Review: Constrained OpKmizaKon with Lagrange When equal constraints è opkmize f(x), subject to g i (x) Method of Lagrange mulkpliers: convert to a higher- dimensional problem Minimize w.r.t. f ( x) + g ( x) λ i i ( x1 x n 1 k ; λ λ ) 52 26

An alternative (dual) representation of the SVM QP We will start with the linearly separable case Instead of encoding the correct classification rule and constraint we will use LaGrange multiplies to encode it as part of the our minimization problem Min (w T w)/2 For all x in class +1 w T x+b 1 For all x in class -1 w T x+b -1 Why? Min (w T w)/2 (w T x i +b)y i 1 9/25/14 53 An alternative (dual) representation of the SVM QP We will start with the linearly separable case Instead of encoding the correct classification rule a constraint we will use Lagrange multiplies to encode it as part of the our minimization problem Recall that Lagrange multipliers can be applied to turn the following problem: min x x 2 s.t. x b To min x max α x 2 -α(x-b) s.t. α 0 MinMax b Min (w T w)/2 (w T x i +b)y i 1 Allowed min Global min 9/25/14 54 27

References Big thanks to Prof. Ziv Bar- Joseph @ CMU for allowing me to reuse some of his slides Prof. Andrew Moore @ CMU s slides Elements of StaKsKcal Learning, by HasKe, Tibshirani and Friedman 9/18/14 55 28