Kernel Methods and SVMs Extension

Similar documents
Which Separator? Spring 1

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines

Support Vector Machines

Support Vector Machines CS434

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Natural Language Processing and Information Retrieval

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Support Vector Machines

Pattern Classification

Lecture 10 Support Vector Machines. Oct

Support Vector Machines

Lecture 10 Support Vector Machines II

Online Classification: Perceptron and Winnow

Support Vector Machines

Chapter 6 Support vector machine. Séparateurs à vaste marge

17 Support Vector Machines

Support Vector Machines CS434

Linear Approximation with Regularization and Moving Least Squares

Intro to Visual Recognition

Lagrange Multipliers Kernel Trick

Generalized Linear Methods

Lecture 3: Dual problems and Kernels

Multilayer Perceptron (MLP)

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

1 Convex Optimization

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Maximal Margin Classifier

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Fisher Linear Discriminant Analysis

Homework Assignment 3 Due in class, Thursday October 15

CSC 411 / CSC D11 / CSC C11

Kristin P. Bennett. Rensselaer Polytechnic Institute

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture Notes on Linear Regression

Ensemble Methods: Boosting

NUMERICAL DIFFERENTIATION

18-660: Numerical Methods for Engineering Design and Optimization

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

Feature Selection: Part 1

Problem Set 9 Solutions

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Chapter 11: Simple Linear Regression and Correlation

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Linear Feature Engineering 11

The Minimum Universal Cost Flow in an Infeasible Flow Network

CSE 546 Midterm Exam, Fall 2014(with Solution)

Lecture 6: Support Vector Machines

UVA CS / Introduc8on to Machine Learning and Data Mining

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter Newton s Method

Foundations of Arithmetic

e i is a random error

CSE 252C: Computer Vision III

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Turbulence classification of load data by the frequency and severity of wind gusts. Oscar Moñux, DEWI GmbH Kevin Bleibler, DEWI GmbH

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

Canonical transformations

1 Matrix representations of canonical matrices

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Temperature. Chapter Heat Engine

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Classification as a Regression Problem

Economics 101. Lecture 4 - Equilibrium and Efficiency

Report on Image warping

Section 8.3 Polar Form of Complex Numbers

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Errors for Linear Systems

MMA and GCMMA two methods for nonlinear optimization

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Limited Dependent Variables

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Introduction to Regression

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Complex Numbers. x = B B 2 4AC 2A. or x = x = 2 ± 4 4 (1) (5) 2 (1)

Solution Thermodynamics

Learning Theory: Lecture Notes

Workshop: Approximating energies and wave functions Quantum aspects of physical chemistry

Rockefeller College University at Albany

Lecture 12: Classification

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

ECE559VV Project Report

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Kernel Methods and SVMs

This column is a continuation of our previous column

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

14 Lagrange Multipliers

EEE 241: Linear Systems

Module 9. Lecture 6. Duality in Assignment Problems

FUZZY GOAL PROGRAMMING VS ORDINARY FUZZY PROGRAMMING APPROACH FOR MULTI OBJECTIVE PROGRAMMING PROBLEM

Transcription:

Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general overvew of some extensons to that whch were descrbed n the course, ncludng non bnary classfcaton and support vector regresson. We wll ntroduce the concept of SVMs usng the smplest case for applcaton. Consder a scenaro where we have data that must be classfed nto two dfferent groups. If the data are lnearly separable, or n other words, can be separated completely nto ther groups by a dvdng hyperplane, then our goal s to fnd the equaton of the hyperplane that best dvdes the groups. To be more formal wth the problem descrpton, we label the classes for each of the data ponts x as beng 1 or 1,.e. y { 1, 1 }. Our hyperplane functon has the equaton w T x + b and s defned such that for all ponts that have a class y = 1, and for ponts wth a class y = 1, w T x + b 1 w T x + b 1. In our tranng data, we should have no ponts n between the hyperplanes w T x + b = 1 and w T x + b = 1, a regon called the margn. The dvdng plane s the functon w T x + b = 0 and we classfy new ponts by ther sgn: ˆ = sgn(w T x + b). y Copyrght 2014 Udacty, Inc. All Rghts Reserved.

(Note that w does not look perpendcular due to dfference n x and y axs scalng) There are many choces of our parameter vector w that allow us to separate the data, but some are clearly better than others. Ideally, we want to select parameters for the hyperplane that maxmze the sze of the margn. Consder two ponts that le on opposte margns, x + and x, that are as close as possble to one another. In ths case, the vector connectng these two lnes wll be perpendcular to the hyperplanes defnng the margn. w T x + + b = 1 and w T x + b = 1, Subtractng the two equatons generates w T (x + ) = 2. Snce the vectors w and x + are parallel, w * x + = 2, where v s the magntude/length of a vector v. Dvdng w on both sdes gves the dstance between the hyperplanes x + whch s equal to 2 / w. From here, we observe that maxmzng the sze of the margn s equvalent to fndng the mnmum w that mantans the relatonshp y (w T x + b) 1 for all ponts n the tranng data. (Recall that w T x + b 1 when y = 1 and w T x + b 1 when y = 1.) We approach solvng the problem by notng that mnmzng w s equvalent to mnmzng 1 w 2 2, convertng the problem nto a quadratc programmng optmzaton problem. The Lagrange multplers α transform our optmzaton problem nto one of maxmzng the output of Copyrght 2014 Udacty, Inc. All Rghts Reserved.

w(α) = α α α j y x T 21 x j whle satsfyng the constrants that all 0 α and α y = 0. To provde some context for nterpretng ths, thnk of the multplers α as weghts on data ponts. From the constrant,j α y = 0, the sum of the weghts on the ponts categorzed as y = 1 should be equal to those categorzed as y = 1. As for w(α), the second term controls the summed weghts n the frst term from gettng too large. The second term takes nto account the categores of each par of ponts (y = 1 f they are n the same class, 1 f they dffer) and a measure of smlarty (evoked by x T x j ). When we obtan the optmal Lagrange multplers, t turns out that most of the weghts α are equal to zero. The ponts that have non zero weght are the only ponts that contrbute to the calculaton of w, and all n fact fall on the margn, satsfyng y (w T x + b) = 1. These ponts are the support vectors for the model. We obtan the parameter values for our dvdng hyperplane from w = = w T x y for some pont that les on the margn. α y x and b The above descrbes the general process for computng SVMs for lnearly separable data, but real lfe datasets do not normally allow themselves to be dvded so easly. Here, we dscuss two ways to deal wth non lnearly separable datasets and move beyond hard margn SVMs. If we have data that s mostly lnearly separable, we can consder usng soft margn SVMs, relaxng the crtera that all ponts are correctly classfed. If we have data that s separable n a nonlnear fashon, we can consder usng kernel functons to be able to capture a nonlnear dvdng curve between classes. Typcally, we make consderatons of both kernel functon and value of soft margn parameter to perform classfcaton tasks. In a soft margn SVM, we do not requre the data to be completely lnearly separable and allow for some ponts to be classfed ncorrectly. We provde for each pont a non negatve slack varable ξ that llustrates to what degree each pont s msclassfed: y (w T x + b) 1 ξ. If a pont s classfed correctly on ts sde of the margn, then ξ = 0. If a pont gets placed wthn the margn or n the wrong classed Copyrght 2014 Udacty, Inc. All Rghts Reserved.

regon, then ξ takes on postve value proportonal to the pont s dstance from ts desred margnal hyperplane. Our optmzaton problem now has to balance the sze of the errors we make: our goal s to mnmze w 2 + C, where C s a 1 2 ξ regularzaton parameter that tells us the weght we want to put on msclassfcaton errors. Wth smaller values of C, we punsh errors less, thus ncreasng the sze of the margn. Larger values of C result n narrower margns; the lmt of C as t tends towards nfnty s that any msclassfcaton error s punshed to an extent that we effectvely have our orgnal hard margn SVM. When we convert the optmzaton problem nto the form maxmzng the output of the constrant that w(α) = α α α j y x T 21 x j, α y = 0,j remans the same, whle the other constrant now has an upper bound 0 α C. Wth the soft margn SVM, our support vectors (ponts that have weght 0 < α ) nclude not just ponts on the margnal hyperplanes, but also those ponts that are wthn the margn or are msclassfed. For data that s separable, but not lnearly, we can use a kernel functon to capture a nonlnear dvdng curve. The kernel functon should capture some aspect of smlarty n our data; t also Copyrght 2014 Udacty, Inc. All Rghts Reserved.

represents doman knowledge regardng the structure of the data. In general, we can wrte the functon we want to maxmze as w(α) = α 21 α α j y k(x ).,j In our orgnal, lnear SVM, our kernel functon was k(x ) = x T x j and suggested a dvdng hyperplane. The kernel functon k(x ) = (x T x j ) 2 generates a dvdng hypersphere, whle k(x ) = (x T x j + c) d s the general form for polynomal kernels. Wth kernel functons, we can project the data nto a transformed space where a dvdng hyperplane can be found, but when plotted n the orgnal feature space ends up beng a non lnear dvdng curve. In order to compute the class of a new nstance, we now utlze the sgn of the output α y k(x,x) + b. Snce most of the weghts are equal to zero, ths s stll a farly quck computaton compared to the lnear case. It s mportant to note that the natural task for SVMs les n bnary classfcaton. For classfcaton tasks nvolvng more than two groups, a common strategy s to use multple bnary classfers to decde on a sngle best class for new nstances. For example, we may create one classfer for each class n a one versus all fashon then, for new ponts, classfy them based on the classfer functon that produces the largest value. Alternatvely, we can set up classfers for all Copyrght 2014 Udacty, Inc. All Rghts Reserved.

parwse comparsons and select the class that wns the most parwse matchups for new ponts. (Fgure depcts parwse matchups approach. Gray lnes ndcate where a bnary classfer has no effect. Note central area where no class has domnance.) We can also extend SVMs to regresson tasks, or support vector regresson (SVR). As wth SVMs, we project data n an SVR task usng a kernel functon so that they can be ft by a hyperplane. Instead of dvdng the data nto classes, however, the hyperplane now provdes an estmate for the data s output value. In addton, the margn and error are treated dfferently. A parameter ε s specfed such that small devatons from the regresson hyperplane do not contrbute to error costs,.e. when we attempt to mnmze 1 2 ξ w 2 + C, ξ = 0 when a pont les wthn the margn. Non zero slack varable values are nstead the (lnear) dstance beyond the ε regon that a pont les. Compare ths to the quadratc error functon that s found n standard lnear regresson tasks, where all devatons from the estmate count aganst the functon s ft, but errors are penalzed by the quadratc dfference from the estmate. Copyrght 2014 Udacty, Inc. All Rghts Reserved.

Essentally, however, SVR operates n much the same way as SVM does. For each pont n the tranng data, we nstead have two slack varables, ξ and ξ *, one for postve devatons and one for negatve devatons from the regresson hyperplane. Ths results n two Lagrangan multplers assocated wth each pont, 0 α, α * C, and a respecfed constrant on weght values solved, the regresson functon takes the form (α α* ) = 0. When (α k(x T α* ) x) + b. As before, most of the weghts take a value of zero, and for ponts wth non zero weghts, at most one of α, α * wll be non zero. Copyrght 2014 Udacty, Inc. All Rghts Reserved.