Lecture 10 Support Vector Machines II

Similar documents
Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Lecture 3: Dual problems and Kernels

Support Vector Machines

Lagrange Multipliers Kernel Trick

Which Separator? Spring 1

Lecture 10 Support Vector Machines. Oct

Kernel Methods and SVMs Extension

Support Vector Machines CS434

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

10-701/ Machine Learning, Fall 2005 Homework 3

Natural Language Processing and Information Retrieval

Support Vector Machines

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Support Vector Machines

Support Vector Machines

Maximal Margin Classifier

COS 521: Advanced Algorithms Game Theory and Linear Programming

Solutions to exam in SF1811 Optimization, Jan 14, 2015

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

APPENDIX A Some Linear Algebra

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Linear Feature Engineering 11

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Feature Selection: Part 1

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Assortment Optimization under MNL

Lecture Notes on Linear Regression

Support Vector Machines CS434

PHYS 705: Classical Mechanics. Calculus of Variations II

Lecture 21: Numerical methods for pricing American type derivatives

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Chapter 6 Support vector machine. Séparateurs à vaste marge

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Convex Optimization. Optimality conditions. (EE227BT: UC Berkeley) Lecture 9 (Optimality; Conic duality) 9/25/14. Laurent El Ghaoui.

Kristin P. Bennett. Rensselaer Polytechnic Institute

Linear Classification, SVMs and Nearest Neighbors

Differentiating Gaussian Processes

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Generalized Linear Methods

p 1 c 2 + p 2 c 2 + p 3 c p m c 2

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Math 217 Fall 2013 Homework 2 Solutions

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Report on Image warping

Lecture 12: Discrete Laplacian

1 Convex Optimization

MMA and GCMMA two methods for nonlinear optimization

Hidden Markov Models

Linear Approximation with Regularization and Moving Least Squares

IV. Performance Optimization

SELECTED SOLUTIONS, SECTION (Weak duality) Prove that the primal and dual values p and d defined by equations (4.3.2) and (4.3.3) satisfy p d.

EEE 241: Linear Systems

Support Vector Machines

The Geometry of Logit and Probit

Lecture 20: November 7

Week 5: Neural Networks

Kernel Methods and SVMs

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Canonical transformations

6.854J / J Advanced Algorithms Fall 2008

Bézier curves. Michael S. Floater. September 10, These notes provide an introduction to Bézier curves. i=0

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

1 Matrix representations of canonical matrices

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Economics 101. Lecture 4 - Equilibrium and Efficiency

Learning Theory: Lecture Notes

NUMERICAL DIFFERENTIATION

The Expectation-Maximization Algorithm

Homework Notes Week 7

The exam is closed book, closed notes except your one-page cheat sheet.

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Homework Assignment 3 Due in class, Thursday October 15

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Module 9. Lecture 6. Duality in Assignment Problems

ECE559VV Project Report

Nonlinear Classifiers II

Winter 2008 CS567 Stochastic Linear/Integer Programming Guest Lecturer: Xu, Huan

MAE140 - Linear Circuits - Fall 13 Midterm, October 31

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Radar Trackers. Study Guide. All chapters, problems, examples and page numbers refer to Applied Optimal Estimation, A. Gelb, Ed.

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Maximum Likelihood Estimation

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

DISCRIMINANTS AND RAMIFIED PRIMES. 1. Introduction A prime number p is said to be ramified in a number field K if the prime ideal factorization

Supporting Information

STAT 3008 Applied Regression Analysis

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Difference Equations

Chapter Newton s Method

14 Lagrange Multipliers

Transcription:

Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28

Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed as of 2016-02-20 2/28

Today: Optmzaton theory behnd support vector machnes More examples 3/28

Recall that we settled on the followng defnton for a the support vector machne: max β 2 =1 M s.t. y (x t β + β 0 ) > M ξ, = 1,..., n ξ > 0, ξ Constant. Ths defnes a margn around the lnear decson plane of wdth and tres to mnmze the number of errors (ξ) for ponts that are on the wrong sde of the margn. 4/28

We then re-parameterzed ths by settng the margn to 1 but allowng the sze of β to grow: mn 1 2 β 2 2 s.t. y (x t β + β 0 ) > 1 ξ, = 1,..., n ξ > 0, ξ Constant. Ths defnes a margn around the lnear decson plane of wdth 1 β, and tred to mnmze the number of errors ξ of ponts that are on the wrong sde of the margn. 5/28

6/28

Notce that we can rewrte mn 1 2 β 2 2 s.t. y (x t β + β 0 ) > 1 ξ, = 1,..., n ξ > 0, ξ Constant Wth a constant C > 0, whch depends only on the constant n the orgnal formulaton, as: mn 1 2 β 2 2 + C ξ s.t. y (x t β + β 0 ) > 1 ξ, ξ > 0, = 1,..., n. By notcng that the second form wll fnd a β that mnmzes β 2 2 such that ξ ξ. 7/28

TheLagrangan Gven a constraned optmzaton problem: mnf(x) s.t. g j (x) = 0, j = 1,..., K We can defne the prmal Lagrangan functon as: K L P = f(x) λ j g j (x) j=1 What does ths functon look lke? 8/28

TheLagranganDual The Lagrangan dual functon s then gven as the nfmum of L P as functon of the λ j over values of x: L D (λ) = nf L P(x, λ) x K = nf x f(x) λ j g j (x). And the dualproblem s to fnd the maxmum of the dual functon over all choces of λ: j=1 λ = arg max L D (λ). λ The optmal value of the prmal problem, x, can be reconstructed by workng backwards: x = arg mn L P (x, λ ). x 9/28

For a better understandng of the dual problem we can vsualze the Lagrangan soluton as a saddle pont 1 1 www.convexoptmzaton.com 10/28

It turns out that ths s a very good framework for workng wth support vector machnes. We can defne the Lagrangan functon as: L P = 1 2 β 2 2 + C ξ α [ y (x t β + β 0 ) (1 ξ ) ] µ ξ Where α and µ are the Lagrangan multplers. 11/28

Asde: Techncally, the theory of Lagrangan multplers only apply when the constrants on the soluton are equalty constrants rather than nequalty constrants. The larger theory needed for the general case uses the Karush-Kuhn-Tucker (KKT) condtons. These add addtonal constrants on top of those presented here. Followng the Elements of Statstcal Learnng, we wll not worry wth those detals here as they are more an annoyance than an nterestng conceptual dfference. 12/28

To construct the dual functon, we need to take partal dervatves wth respect to the prmal varables: β, β 0, and ξ. If we plug these nto the prmal problem, we get the dual functon. 13/28

For β j : { L P = 1 β j β j = β j 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { 1 2 β 2 2 } α y (x t β) µ ξ } = β j α y x,j 14/28

For β j : { L P = 1 β j β j = β j 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { 1 2 β 2 2 } α y (x t β) µ ξ } = β j α y x,j Settng ths equal to zero, and wrtng the equaton smultaneously for all β j, we get: β = α y x 14/28

Ths necessary condton for the soluton of the support vector machne s of ndependent nterest. It says that β can be wrtten as a lnear combnaton of the data ponts x. Any such that α s non-zero s called a support vector. 15/28

For β 0, the dervatve s gven as: { L P = 1 β 0 β 0 2 β 2 2 + C ξ { = } α y β 0 β 0 α [ y (x t β + β 0 ) (1 ξ ) ] µ ξ } = α y Whch when set to zero gves: 0 = α y Ths explans why the term β 0 s often called the bas of the support vector machne. 16/28

Fnally, the dervatve wth respect to ξ s gven as: { L P = 1 ξ ξ 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { = C ξ + α (1 ξ ) } µ ξ µ ξ } = C α µ 17/28

Fnally, the dervatve wth respect to ξ s gven as: { L P = 1 ξ ξ 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] { = C ξ + α (1 ξ ) } µ ξ µ ξ } = C α µ Settng ths equal to zero we see that: α = C µ. 17/28

We now want to plug ths nto the Lagrangan prmal functon. We frst use the fact that α = C µ to unte the tralng terms wth respect to α : L D = 1 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] µ ξ = 1 2 β 2 2 + (C µ )ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α α y x t β α y β 0 18/28

We now want to plug ths nto the Lagrangan prmal functon. We frst use the fact that α = C µ to unte the tralng terms wth respect to α : L D = 1 2 β 2 2 + C ξ [ α y (x t β + β 0 ) (1 ξ ) ] µ ξ = 1 2 β 2 2 + (C µ )ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α ξ α y x t β α y β 0 + α (1 ξ ) = 1 2 β 2 2 + α α y x t β α y β 0 And the last term drops out: L D = 1 2 β 2 2 + α α y x t β β 0 α y = 1 2 β 2 2 + α α y x t β 18/28

Now, notce that snce β = α y x, we have that: β 2 2 = j β 2 j = (α y x ) t (α y x ) = α α y y x t x Also see that we can rewrte the last term n our dual functon as: α y x t β = α α y y x t x = β 2 2. 19/28

Fnally, the dual functon can be wrtten as: L D = 1 2 β 2 2 + α α y x t β = = α 1 2 β 2 2 α 1 2 α α y y x t x. Whch we want to maxmze under the constrants (the frst s from the KKT condtons, the second from the partal dervatve of the bas): 0 α C, α y = 0. 20/28

To what extent can we make some sense of ths equaton? Frst notce the dualty of the problem f we flp the ±1 labelng of the classes y : L D = α 1 2 α α y y x t x The functon only depends on the sgn of y y. Also note that t only depends on the data that serve as support vectors: L D = α 1 2 α α y y x t x. 21/28

Perhaps most mportantly though, notce that only the outer product XX t effects the fnal results: L D = α 1 2 α α y y x t x. As x t x s the (, ) th element of XX t. Ths s a measurement of how smlar x and x are to one another (f scaled to both have length one, t s the cosne of the angle between them). 22/28

Perhaps most mportantly though, notce that only the outer product XX t effects the fnal results: L D = α 1 2 α α y y x t x. As x t x s the (, ) th element of XX t. Ths s a measurement of how smlar x and x are to one another (f scaled to both have length one, t s the cosne of the angle between them). So, we can see that: 1. There s a penalty for ncludng two smlar x s wth the same class label 2. There s a beneft for ncludng two smlar x s wth dfferent class labels Both of whch actually make sense for a classfcaton algorthm. 22/28

Takng a step back now, how does logstc regresson and support vector machnes compare? 1. Both separate the plane nto two half-spaces whch attempt to splt the classes as well as possble 2. However, logstc regresson s (prmarly) concerned wth the correlaton matrx X t X between the varables and support vector machnes only care about the smlarty matrx XX t between observatons 23/28

TheKernelTrck In the case of logstc and lnear regresson I have shown how bass expanson can be used to add non-lnear effects nto a lnear model. One observaton that makes support vector machnes attractve s that t s possble to mmc bass expanson wthout ever havng to actually project nto a hgher dmensonal space. 24/28

TheKernelTrck, cont. Assume that we have a mappng h of samples x nto a hgher dmensonal space. We can re-wrte the dual functon usng nner product notaton: L D = α 1 2 α α y y < h(x ), h(x ) >. It quckly becomes apparent that we only need a fast way of calculatng nner products n the space of h, whch may not requre actually determnng and calculatng h tself. 25/28

TheKernelTrck, cont. The projected nner product < h(x ), h(x ) > s usually wrtten drectly as K(x, x ) for a functon K called the kernel. Popular choces nclude: 1. Lnear: K(x, x ) =< x, x > 2. Polynomal: K(x, x ) = (1+ < x, x >) d 3. Radal: K(x, x ) = exp( γ x x 2 ) 4. Sgmod: K(x, x ) = tanh(κ 1 < x, x > +κ 2 ) Notce that these all requre approxmately the same effort to calculate as the lnear kernel. 26/28

Fnshngtheoptmzaton Now, we can re-wrte the optmzaton problem as: max s.t. 1 t α α t Kα 0 α C For a sutable matrx K, called the kernel matrx. Ths s a quadratc program wth box constrants, and can be solved farly effcently by general purpose solvers. 27/28

Morenformaton I try to provde addtonal references for all of my lectures on the class webste. For today s materal (and Wednesday s) I would lke to make a partcular pont to menton two references: Elements of Statstcal Learnng, Sectons 12.1-12.4 Convex Optmzaton, S. Boyd, Chapter 5 (5.5 n partcular) These contan many more detals than I have tme to cover, and assume a deeper background n statstcs / convex calculus. 28/28