Support vector machine revisited

Similar documents
6.867 Machine learning

6.867 Machine learning, lecture 7 (Jaakkola) 1

Linear Classifiers III

10-701/ Machine Learning Mid-term Exam Solution

Chapter 7. Support Vector Machine

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Topics Machine learning: lecture 3. Linear regression. Linear regression. Linear regression. Linear regression

CSCI567 Machine Learning (Fall 2014)

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

Linear Support Vector Machines

18.657: Mathematics of Machine Learning

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

Massachusetts Institute of Technology

Support Vector Machines and Kernel Methods

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

CSCI567 Machine Learning (Fall 2014)

Introduction to Machine Learning DIS10

Machine Learning for Data Science (CS 4786)

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

Introduction to Optimization Techniques. How to Solve Equations

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Machine Learning for Data Science (CS 4786)

IP Reference guide for integer programming formulations.

Optimization Methods MIT 2.098/6.255/ Final exam

1 Review and Overview

Machine Learning Brett Bernstein

Systems of Particles: Angular Momentum and Work Energy Principle

Binary classification, Part 1

Questions and answers, kernel part

1 Duality revisited. AM 221: Advanced Optimization Spring 2016

Algorithms for Clustering

Empirical Process Theory and Oracle Inequalities

REGRESSION WITH QUADRATIC LOSS

Lecture 4. Hw 1 and 2 will be reoped after class for every body. New deadline 4/20 Hw 3 and 4 online (Nima is lead)

Lecture 6 Ecient estimators. Rao-Cramer bound.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

Linear Regression Demystified

Regression and generalization

ALGEBRAIC GEOMETRY COURSE NOTES, LECTURE 5: SINGULARITIES.

Regression with quadratic loss

Chapter 4. Fourier Series

Learning Bounds for Support Vector Machines with Learned Kernels

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Admin REGULARIZATION. Schedule. Midterm 9/29/16. Assignment 5. Midterm next week, due Friday (more on this in 1 min)

Information-based Feature Selection

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Lecture 2: Monte Carlo Simulation

Differentiable Convex Functions

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

1 The Primal and Dual of an Optimization Problem

Linear Programming and the Simplex Method

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Notes for Lecture 5. 1 Grover Search. 1.1 The Setting. 1.2 Motivation. Lecture 5 (September 26, 2018)

The Method of Least Squares. To understand least squares fitting of data.

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Sequences and Series of Functions

Solution of Final Exam : / Machine Learning

Problem Set 2 Solutions

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Statistical Pattern Recognition

Intro to Learning Theory

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Optimally Sparse SVMs

Machine Learning Theory (CS 6783)

Apply change-of-basis formula to rewrite x as a linear combination of eigenvectors v j.

b i u x i U a i j u x i u x j

Hybridized Heredity In Support Vector Machine

Injections, Surjections, and the Pigeonhole Principle

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

CS284A: Representations and Algorithms in Molecular Biology

Math 451: Euclidean and Non-Euclidean Geometry MWF 3pm, Gasson 204 Homework 3 Solutions

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

The Growth of Functions. Theoretical Supplement

6.883: Online Methods in Machine Learning Alexander Rakhlin

Lecture 2 October 11

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

ECON 3150/4150, Spring term Lecture 3

Math Solutions to homework 6

Simple Polygons of Maximum Perimeter Contained in a Unit Disk

Inverse Matrix. A meaning that matrix B is an inverse of matrix A.

Optimization Methods: Linear Programming Applications Assignment Problem 1. Module 4 Lecture Notes 3. Assignment Problem

Ω ). Then the following inequality takes place:

6.3 Testing Series With Positive Terms

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

CALCULATION OF FIBONACCI VECTORS

Time-Domain Representations of LTI Systems

( ) (( ) ) ANSWERS TO EXERCISES IN APPENDIX B. Section B.1 VECTORS AND SETS. Exercise B.1-1: Convex sets. are convex, , hence. and. (a) Let.

Homework Set #3 - Solutions

CHAPTER 5. Theory and Solution Using Matrix Techniques

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

Math 61CM - Solutions to homework 3

Lecture 12: February 28

APPENDIX A SMO ALGORITHM

6.867 Machine learning, lecture 11 (Jaakkola)

1 Approximating Integrals using Taylor Polynomials

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

Transcription:

6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector machie ito its dual form where the examples oly appear i ier products. To this ed, assume we have mapped the examples ito feature vectors φ(x) of dimesio d ad that the resultig traiig set (φ(x 1 ), y 1 ),..., (φ(x ), y ) is liearly separable. Fidig the maximum margi liear separator i the feature space ow correspods to solvig miimize θ 2 /2 subject to y t (θ T φ(x t ) + θ 0 ) 1, t = 1,..., (1) We will discuss later o how slack variables affect the resultig kerel (dual) form. They merely complicate the derivatio without chagig the procedure. Optimizatio problems of the above type (covex, liear costraits) ca be tured ito their dual form by meas of Lagrage multipliers. Specifically, we itroduce a o-egative scalar parameter α t for each iequality costrait ad cast the estimatio problem i terms of θ ad α = {α 1,..., α }: [ ] J(θ, θ 0 ; α) = θ 2 /2 α t y t (θ T φ(x t ) + θ 0 ) 1 (2) The origial miimizatio problem for θ ad θ 0 is recovered by maximizig J(θ, θ 0 ; α) with respect to α. I other words, J(θ, θ 0 ) = max J(θ, θ 0 ; α) (3) α 0 where α 0 meas that all the compoets α t are o-egative. Let s try to see first that J(θ, θ 0 ) really is equivalet to the origial problem. Suppose we set θ ad θ 0 such that at least oe of the costraits, say the oe correspodig to (x i, y i ), is violated. I that case [ ] α i y i (θ T φ(x i ) + θ 0 ) 1 > 0 (4) for ay α i > 0. We ca the set α i = to obtai J(θ, θ 0 ) =. You ca thik of the Lagrage multipliers playig a adversarial role to eforce the margi costraits. More Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].

6.867 Machie learig, lecture 8 (Jaakkola) 2 formally, { θ J(θ, θ 0 ) = 2 /2 if y t (θ T φ(x t ) + θ 0 ) 1, t = 1,...,, otherwise (5) So the miimizig θ ad θ 0 are therefore those that satisfy the costraits. O the basis of a geeral set of criteria goverig the optimality whe dealig with Lagrage multipliers, criteria kow as Slater coditios, we ca actually switch the maximizig over α ad the miimizatio over {θ, θ 0 } ad get the same aswer: mi max J(θ, θ 0 ; α) = max mi J(θ, θ 0 ; α) (6) θ,θ 0 α 0 α 0 θ,θ 0 The left had side, equivalet to miimizig Eq.(5), is kow as the primal form, while the right had side is the dual form. Let s solve the right had side by first obtaiig θ ad θ 0 as a fuctio of the Lagrage multipliers (ad the data). To this ed d J(θ, θ 0 ; α) = α t y t = 0 (7) dθ 0 d dθ J(θ, θ 0; α) = θ α t y t φ(x t ) = 0 (8) So, agai the solutio for θ is i the spa of the feature vectors correspodig to the traiig examples. Substitutig this form of the solutio for θ back ito the objective, ad takig ito accout the costrait correspodig to the optimal θ 0, we get J(α) = mi J(θ, θ 0 ; α) (9) θ,θ 0 { = α t (1/2) i=1 j=1 α iα j y i y j [φ(x i ) T φ(x j )], if α ty t = 0 (10), otherwise The dual form of the solutio is therefore obtaied by maximizig α t (1/2) α i α j y i y j [φ(x i ) T φ(x j )], (11) i=1 j=1 subject to α t 0, α t y t = 0 (12) This is the dual or kerel form of the support vector machie, ad is also a quadratic optimizatio problem. The costraits are simpler, however. Moreover, the dimesio of Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].

6.867 Machie learig, lecture 8 (Jaakkola) 3 the iput vectors does ot appear explicitly as part of the optimizatio problem. It is formulated solely o the basis of the Gram matrix: φ(x 1 ) T φ(x 1 ) φ(x 1 ) T φ(x ) K = (13) φ(x ) T φ(x 1 )... φ(x ) T φ(x ) We have already see that the maximum margi hyperplae ca be costructed o the basis of oly a subset of the traiig examples. This should also also i terms of the feature vectors. How will this be maifested i the ˆα t s? May of them will be exactly zero due to the optimizatio. I fact, they are o-zero oly for examples (feature vectors) that are support vectors. Oce we have solved for ˆα t, we ca classify ay ew example accordig to the discrimiat fuctio ŷ(x) = θˆt φ(x) + θˆ0 (14) = αˆty t [φ(x t ) T φ(x)] + θˆ0 (15) = αˆty t [φ(x t ) T φ(x)] + θˆ0 (16) t SV where SV is the set of support vectors correspodig to o-zero values of α t. We do t kow which examples (feature vectors) become as support vectors util we have solved the optimizatio problem. Moreover, the idetity of the support vectors will deped o the feature mappig or the kerel fuctio. But what is θˆ0? It appeared to drop out of the optimizatio problem. We ca set θ 0 after solvig for ˆα t by lookig at the support vectors. Ideed, for all i SV we should have y i (θˆt φ(x i ) + θˆ0) = y i αˆt[φ(x t ) T φ(x i )] + y i θˆ0 = 1 (17) t SV from which we ca easily solve for θˆ0. I priciple, selectig ay support vector would suffice but sice we typically solve the quadratic program over α t s oly up to some resolutio, these costraits may ot be satisfied with equality. It is therefore advisable to costruct θˆ0 as the media value of the solutios implied by the support vectors. What is the geometric margi we attai with some kerel fuctio K(x, x ) = φ(x) T φ(x )? Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].

6.867 Machie learig, lecture 8 (Jaakkola) 4 It is still 1/ θˆ. I a kerel form ( ) 1/2 γˆgeom = αˆiαˆjy i y j K(x i, x j ) (18) i=1 j=1 Would it make sese to compare geometric margis we attai with differet kerels? We could perhaps use it as a criterio for selectig the best kerel fuctio. Ufortuately this wo t work without some care. For example, if we multiply all the feature vectors by 2, the the resultig geometric margi will also be twice as large (we just expaded the space; the relatios betwee the poits remai the same). It is ecessary to perform some ormalizatio before ay compariso makes sese. We have so far assumed that the examples i their feature represetatios are liearly separable. We d also like to have the kerel form of the relaxed support vector machie formulatio miimize θ 2 /2 + C ξ t (19) subject to y t (θ T φ(x t ) + θ 0 ) 1 ξ t, t = 1,..., (20) The resultig dual form is very similar to the simple oe we derived above. I fact, the oly differece is that the Lagrage multipliers α t are ow also bouded from above by C (the same C as i the above primal formulatio). Ituitively, the Lagrage multipliers α t serve to eforce the classificatio costraits ad adopt larger values for costraits that are harder to satisfy. Without ay upper limit, they would simply reach for ay costrait that caot be satisfied. The limit C specifies the poit whe we should stop from tryig to satisfy such costraits. More formally, the dual form is α t (1/2) α i α j y i y j [φ(x i ) T φ(x j )], (21) i=1 j=1 subject to 0 α t C, α t y t = 0 (22) The resultig discrimiat fuctio has the same form except that the ˆα t values ca be differet. What about θˆ0? To solve for θˆ0 we eed to idetify classificatio costraits that are satisfied with equality. These are o loger simply the oes for which ˆα t > 0 but those correspodig to 0 < αˆt < C. I other words, we have to exclude poits that violate the margi costraits. These are the oes for which ˆα t = C. Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].

6.867 Machie learig, lecture 8 (Jaakkola) 5 Kerel optimizatio Whether we are iterested i (liear) classificatio or regressio we are faced with the problem of selectig a appropriate kerel fuctio. A step i this directio might be to tailor a particular kerel a bit better to the available data. We could, for example, itroduce additioal parameters i the kerel ad optimize those parameters so as to improve the performace. These parameters could be simple as the β parameter i the radial basis kerel, weight each dimesio of the iput vectors, or more flexible as fidig the best covex combiatio of basic (fixed) kerels. Key to such a approach is the measure we would optimize. Ideally, this measure would be the geeralizatio error but we obviously have to settle for a surrogate measure. The surrogate measure could be cross-validatio or a alterative criterio related to the geeralizatio error (e.g., margi). Kerel selectio We ca also explicitly select amog possible kerels ad cast the problem as a model selectio problem. By choosig a kerel we specify the feature vectors o the basis of which liear predictios are made. Each model 1 (class) refers to a set of liear fuctios (classifiers) based o the chose feature represetatio. I may cases the models are ested i the sese that the more complex model cotais the simpler oe. We will cotiue from this further at the ext lecture. 1 I statistics, a model is a family/set of distributios or a family/set of liear separators. Cite as: Tommi Jaakkola, course materials for 6.867 Machie Learig, Fall 2006. MIT OpeCourseWare (http://ocw.mit.edu/), Massachusetts Istitute of Techology. Dowloaded o [DD Moth YYYY].