CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Similar documents
Support Vector Machine (continued)

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Pattern Recognition 2018 Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SVMs, Duality and the Kernel Trick

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Statistical Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Lecture 10: A brief introduction to Support Vector Machine

Statistical Machine Learning from Data

Perceptron Revisited: Linear Separators. Support Vector Machines

L5 Support Vector Classification

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Lecture Notes on Support Vector Machine

Machine Learning And Applications: Supervised Learning-SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning. Support Vector Machines. Manfred Huber

Linear & nonlinear classifiers

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Support Vector Machines Explained

Support Vector Machines

Support Vector Machines

Max Margin-Classifier

Support Vector Machines

Support Vector Machines.

Support Vector Machine. Industrial AI Lab.

Announcements - Homework

SVM May 2007 DOE-PI Dianne P. O Leary c 2007

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Support Vector Machines

Lecture 10: Support Vector Machine and Large Margin Classifier

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear & nonlinear classifiers

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Support Vector Machine & Its Applications

Kernelized Perceptron Support Vector Machines

Convex Optimization and Support Vector Machine

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machine. Industrial AI Lab. Prof. Seungchul Lee

Sequential Minimal Optimization (SMO)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Kernel Methods and Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Review: Support vector machines. Machine learning techniques and image analysis

(Kernels +) Support Vector Machines

LECTURE 7 Support vector machines

Support vector machines Lecture 4

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines for Classification and Regression

Support Vector Machines and Kernel Methods

Support Vector Machines, Kernel SVM

Support vector machines

Introduction to SVM and RVM

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines for Classification: A Statistical Portrait

Introduction to Support Vector Machines

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector and Kernel Methods

Support Vector Machine

Support Vector Machines and Speaker Verification

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machine via Nonlinear Rescaling Method

ICS-E4030 Kernel Methods in Machine Learning

CS145: INTRODUCTION TO DATA MINING

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVM optimization and Kernel methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to Support Vector Machines

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Introduction to Machine Learning

Discriminative Models

SUPPORT VECTOR MACHINE

Statistical Methods for Data Mining

Transcription:

Gautam Kunapuli

Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this data set text across all documents in the corpus is represented as a bag (multiset) of its words captures the multiplicity/frequency of words and terms does not capture semantics, sentiment, grammar, or even word order vector space model, where each document is simply represented as a vector of word statistics such as counts more advanced statistics such as term-frequency/inverse document frequency (tf-idf) can be used corpus of documents document 1 document 2 document 3 13 0 4 0 1 0 7 0 8 6 0 22 0 8 7 15 0 3 0 7 2 6 0 1 19 11 2 0 2 0 category sports politics politics document n 5 3 4 0 1 2 12 2 8 1 sports 2

SVM for Linearly Separable Data Problem: Find a linear classifier f x = w x + b such that sign f x = +1, when positive example sign f x = 1, when negative example w w w The data set is linearly separable, that is separable by a linear classifier (hyperplane). There exist many different classifiers! Which one is the best? Prefer hyperplanes that achieve maximum separation between the two data sets the separation between the two data sets achieved by a classifier is called the margin of the classifier Bias: select a classifier with the largest margin Linearly separability is a simplifying assumption we make in order to derive a maximum-margin model; it assumes that there is no noise in the data set, and hence, the resulting model does not require a loss function. This is not a realistic assumption for real-world data sets. 3

Maximizing the Margin of a Classifier distance of a point x to the hyperplane w x + α = 0 is w x + α w Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example Let the margin be defined by two (parallel) hyperplanes and. The margin ( ) of the classifier is the distance between the two hyperplanes that form the boundaries of the separation Without loss of generality, we can set and (why?) and the margin is distance of a point x to the hyperplane w x + β = 0 is w x + β w 4

Maximizing the Margin of a Classifier distance of a point x to the hyperplane w x + α = 0 is w x + α w Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example Problem Formulation Given a linearly-separable data set, learn a linear classifier such that all the training examples with lie above the margin, that is all the training examples with lie below the margin the margin is maximized distance of a point x to the hyperplane w x + β = 0 is w x + β w Note that max w min w min w min w 2 w w 2 w 2 2 w w 2 5

Hard-Margin Support Vector Machine Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example Constrained optimization problem w, This model is called the hard-margin SVM as it is rigid and does not allow flexibility for misclassifications by the model; only feasible when data set is linearly separable. Convex, quadratic minimization problem called the primal problem. Guaranteed to have a global minimum The problem is no longer unconstrained; need additional optimization tools to ensure feasibility of solutions (that solutions satisfy the constraints) Further properties of the formulation can be studied by deriving the Lagrangian and the dual problem 6

Hard-Margin SVM: Primal Problem Primal problem for hard-margin SVM w, for each constraint, which corresponds to each training example ( introduce new variables called dual variables or Lagrange multipliers, (the Lagrange multipliers will give us a mechanism to ensure feasibility, that is, ensure that the optimal solutions (w and b) indeed achieve linear separation of the two classes) ), we Lagrangian function for hard-margin SVM the Lagrangian function is a function of the primal variables ( and ) and the dual variables ( ); (the Lagrangian function converts a constrained optimization problem into an unconstrained optimization problem) If we can find a minimization of the Lagrangian function with the resulting solution will also be the solution to our original constrained problem. for each training example ( ), the following must hold for the optimal solution: (primal feasibility) (dual feasibility) (complementarity) 7

Hard-Margin SVM: First-Order Conditions Differentiate the Lagrangian with respect to the primal variables ( and ) the classifier is a linear combination of training examples ( ) These are the first-order optimality conditions. We can now eliminate the primal variables by substituting the optimality conditions into the Lagrangian. 8

Hard-Margin SVM: Dual Problem the dual problem depends only on the dual variables ( ) and the inner products ( ) between each pair of training examples Why bother with the dual? both primal and dual problems are convex (quadratic) optimization problems. No duality gap (that is, the primal solution and dual solution will be exactly the same) dual has fewer constraints. Easier to solve dual solution is sparse. Easier to represent 9

Hard-Margin SVM: Support Vectors Recall that for each training example ( ), the following must hold for the final classifier (optimal solution) (primal feasibility) (dual feasibility) (complementarity) Case 3: and (degenerate case; not shown in figure) Case 1a: and (training example is not on the margin) Case 2: and (training example is on the margin) Case 1b: and (training example is not on the margin) 10

Hard-Margin SVM: Support Vectors Recall that for each training example ( ), the following must hold (primal feasibility) (dual feasibility) (complementarity) The first order condition that showed that the classifier was a linear combination of the training data: the training examples with (non-zero) are called the support vectors because they support the classifier. All other training examples have, and this makes the solution sparse. Only a small number of training examples will have and a large number of training examples will have. The optimal classifier is a sparse linear combination of the training examples, that is, the classifier depends only on the support vectors. This means that if we removed all other training examples except the support vectors, the solution would remain unchanged. 11

Soft-Margin SVM: Loss Function So far, assumed that the data is linearly separable, which is not valid in real-world applications. correctly classified points must not be penalized Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example and misclassifications are minimized. misclassified points inside the margin, or on the wrong side of the margin must be penalized 1 2 3 4 Measure the misclassification error for each training example Penalize each misclassification by the size of the violation, using the hinge loss (contrast with the loss function of the Perceptron) correctly classified points must not be penalized 12

Soft-Margin SVM: Formulation Maximize the margin (contrast with the regularization function of Ridge Regression) misclassified points inside the margin, or on the wrong side of the margin must be penalized 1 2 3 4 Optimization problem w, Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example and misclassifications are minimized. Regularization parameter (C > 0), sometimes also denoted λ, trades-off between margin maximization and loss minimization Penalize each misclassification by the size of the violation, using the hinge loss (contrast with the loss function of the Perceptron) 13

Soft-Margin SVM: Primal Problem Problem: Find a linear classifier f x = w x + b with the largest margin such that sign f x = +1, when positive example sign f x = 1, when negative example and misclassifications are minimized. n misclassified points inside the margin, or on the wrong side of the margin must be penalized 4 1 2 3 This model is called the soft-margin SVM as it is softens the classification constraints with slack variables (ξ ) and allows flexibility for misclassifications by the model 14

Soft-Margin SVM: Dual Problem misclassified points inside the margin, or on the wrong side of the margin must be penalized 1 2 3 4 the only difference between the soft-margin and hardmargin SVM dual problems is that the Lagrange multipliers ( training example weights) are upperbounded by the regularization parameter (0 α C) hard-margin svm dual 15

Soft-Margin SVM: Dual Problem the regularization constant is set by the user; this parameter trades off between the regularization term (bias) and the loss term (variance) the dual solution depends only on the inner products of the training data; this is an important property that allows us to extend linear SVMs to learn nonlinear classification functions without explicit transformation n 16

Nonlinear SVM Classifiers transform data to a higher dimensional feature space Solution Approach 1 (Explicit Transformation) transform data to a higher dimensional feature space train linear SVM classifier in the high-dim. space transform high-dimensional linear classifier back to the original space to obtain a nonlinear classifier train a linear SVM classifier in the higher dimensional space using linear SVMs transform the learned high-dimensional linear classifier back to the original space to obtain a nonlinear classifier 17

Nonlinear SVM Classifiers transform data to a higher dimensional feature space Solution Approach 1 (Explicit Transformation) transform data to a higher dimensional feature space train linear SVM classifier in the high-dim. space transform high-dimensional linear classifier back to the original space to obtain a nonlinear classifier if we have training features, the size of the transformation grows very fast; explicit transformations can become very expensive 18

The Kernel Trick data in higher-dimensional space when learning a linear SVM, recall that the dual solution depends only on the inner products of the training data, so we only need to compute inner products in the higher-dimensional space: the function is an example of a kernel function the kernel function relates the inner-products in the original and transformed spaces; with a kernel, we can avoid explicit transformation 19

Kernel SVMs n n n n n Solution Approach 2 (Kernel SVMs) instead of inner-products x x, compute the kernel function κ x, x = (x x ) between all pairs of training examples use the formulation and algorithm of the linear SVM directly, simply replacing the inner-product matrix with a kernel matrix n n n n 20

Examples of Kernel Functions 21

Some Popular Kernels polynomial kernel Gaussian kernel 22

Overfitting with Kernels observe that the Gaussian kernel can be written as exp x z 2σ = exp x + 2x z z 2σ = exp( x ) exp z exp x z σ using the Taylor expansion for, we have Gaussian kernels can represent polynomials of every degree, which means they can overfit, especially when features spaces are larger margin maximization (regularization) helps learn robust models; selection of kernel parameter ( ) is also critical 23

Regularization and Overfitting The regularization parameter, C, is chosen a priori, and defines the relative trade-off between norm (bias/complexity) and loss (error/variance) We want to find classifiers that minimize (regularization + C loss) Regularization introduces inductive bias over solutions controls the complexity of the solution imposes smoothness restriction on solutions As C increases, the effect of the regularization decreases and the SVM tends to overfit the data 24

The Effect of C on Classification C = 0.001 small values of C (low complexity, high error) C = 0.01 C = 0.1 C = 1 C = 10 C = 100 large values of C (high complexity, low error) 25

SVM Modeling Choices 26

SVM Implementations Over The Years Earliest solution approaches: Quadratic Programming Solvers (CPLEX, LOQO, Matlab QP, SeDuMi) Decomposition methods: SVM chunking (Osuna et. al., 1997); SVMlight (Joachims, 1999) Sequential Minimization Optimization (Platt, 1999); implementation: LIBSVM (Chang et. al., 2000) Interior Point Methods (Munson and Ferris, 2006), Successive Over-relaxation (Mangasarian, 2004) Co-ordinate Descent Algorithms (Keerthi et. al., 2009), Bundle Methods (Teo et. al., 2010) Present: scikit-learn s Stochastic Gradient Descent can learn linear SVMs; also has a dedicated SVM package that can handle binary and multi-class classification, regression, one-class classification and kernels 27

28