Support Vector Machines for Regression

Similar documents
Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (continued)

Support Vector Machines: Maximum Margin Classifiers

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Convex Optimization and Support Vector Machine

ICS-E4030 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning

Lecture Notes on Support Vector Machine

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

CS798: Selected topics in Machine Learning

Support Vector Machine (SVM) and Kernel Methods

Introduction to Support Vector Machines

Convex Optimization and SVM

Support Vector Machine

Support Vector Machines

Support vector machines

Lecture 18: Optimization Programming

Constrained Optimization and Lagrangian Duality

Convex Optimization & Lagrange Duality

Statistical Machine Learning from Data

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Introduction to Machine Learning Lecture 7. Mehryar Mohri Courant Institute and Google Research

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines and Kernel Methods

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Lecture 2: Linear SVM in the Dual

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

L5 Support Vector Classification

I.3. LMI DUALITY. Didier HENRION EECI Graduate School on Control Supélec - Spring 2010

Support Vector Machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines, Kernel SVM

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Linear & nonlinear classifiers

Duality. Lagrange dual problem weak and strong duality optimality conditions perturbation and sensitivity analysis generalized inequalities

Max Margin-Classifier

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Kernel Methods and Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

LECTURE 7 Support vector machines

Linear Support Vector Machine. Classification. Linear SVM. Huiping Cao. Huiping Cao, Slide 1/26

SVMs, Duality and the Kernel Trick

Support Vector Machines

Applied inductive learning - Lecture 7

Lecture 10: A brief introduction to Support Vector Machine

Lagrange Relaxation and Duality

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

An introduction to Support Vector Machines

On the Method of Lagrange Multipliers

Review: Support vector machines. Machine learning techniques and image analysis

CS711008Z Algorithm Design and Analysis

Generalization to inequality constrained problem. Maximize

Lagrange Duality. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Learning with kernels and SVM

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Support Vector Machines

SUPPORT VECTOR MACHINE FOR THE SIMULTANEOUS APPROXIMATION OF A FUNCTION AND ITS DERIVATIVE

The Karush-Kuhn-Tucker (KKT) conditions

Non-linear Support Vector Machines

Soft-Margin Support Vector Machine

Announcements - Homework

Support Vector Machines

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Machine Learning Techniques

5. Duality. Lagrangian

Constrained optimization

Soft-margin SVM can address linearly separable problems with outliers

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Lecture 6: Conic Optimization September 8

Optimization. Yuh-Jye Lee. March 28, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 40

Pattern Recognition 2018 Support Vector Machines

Karush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725

SMO Algorithms for Support Vector Machines without Bias Term

Jeff Howbert Introduction to Machine Learning Winter

4TE3/6TE3. Algorithms for. Continuous Optimization

1. f(β) 0 (that is, β is a feasible point for the constraints)

In view of (31), the second of these is equal to the identity I on E m, while this, in view of (30), implies that the first can be written

10701 Recitation 5 Duality and SVM. Ahmed Hefny

Optimality Conditions for Constrained Optimization

Support Vector Machines and Speaker Verification

Numerical Optimization

Perceptron Revisited: Linear Separators. Support Vector Machines

SVM and Kernel machine

Transcription:

COMP-566 Rohan Shah (1) Support Vector Machines for Regression Provided with n training data points {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )} R s R we seek a function f for a fixed ɛ > 0 such that: f(x i ) y i ɛ (1) Let us consider the space of linear functions for now and take f(x i ) = w x i + b with w R s and b R. In order to avoid generalisation errors (over-fitting of the regression surface) we will require the function f to be as flat as possible so we want to minimise w. We define a convex quadratic optimisation: min 1 2 wwt subject to: y i w x i b ɛ w x i + b y i ɛ (2)

COMP-566 Rohan Shah (2) Feasibility We assumed the existence of a function f satisfying the constraints. It is possible that for a given ɛ no function satisfying the constraint f(x i ) y i ɛ exists. So we define slack varibles ψ i > 0 and φ i > 0 and re-write the optimization as: minimise 1 2 wwt + ζ P n i (ψ i + φ i ) subject to: y i w x i b ɛ + ψ i w x i + b y i ɛ + φ i ψ i 0, φ i 0, n (3) The constant ζ maintains the trade-off between how much deviation greater than ɛ is permitted versus the generalisation or in this case the flatness of the regression function.

COMP-566 Rohan Shah (3) We define the Lagrangian; the objective function plus a linear combination of the constraints: L(w, b, φ, ψ α, β, η, η ) = 1 n 2 w 2 + ζ (ψ i + φ i ) where α i, β i, η i and ηi multipliers. i n α i (ɛ + ψ i y i + w, x i + b) n β i (ɛ + φ i + y i w, x i b) n (η i ψ i + ηi φ i ) (4) are non-negative dual variables or lagrange

COMP-566 Rohan Shah (4) Lagrangian Dual Problem The dual of the lagrangian is defined as: L dual (α, β, η, η ) = The Lagrangian Dual Problem is then: inf L(w, b, φ, ψ α, β, η, w,b,φ,ψ η ) (5) max L dual (α, β, η, η ) subject to: α 0 and β 0 and η 0 and η 0 (6)

COMP-566 Rohan Shah (5) Lagrangian Weak Duality Theorem The primal objective function f and its dual L dual satisfy Proof: f(w, b, ψ, φ) L dual (α, β, η, η ) (7) L dual (α, β, η, η ) = inf w,b,φ,ψ L(w, b, φ, ψ α, β, η, η ) L(w, b, φ, ψ α, β, η, η ) f(w, b, ψ, φ) (8)

COMP-566 Rohan Shah (6) Lagrangian Strong Duality Theorem 1. A real valued function is said to be convex if w, u R s and for any θ (0, 1) we have: f(θw + (1 θ)u) θf(w) + (1 θ)f(u) 2. Every strictly convex constrained optimisation problem has a unique solution. 3. An affine function can be expressed in the form f(w) = Aw + b. 4. Affine functions are convex. 5. A convex optimisation problem has a convex objective function and affine constraints. Strong Duality Theorem: For a convex optimisation problem the duality gap is zero at primal optimality.

COMP-566 Rohan Shah (7) Karush-Kuhn-Tucker Complementary Conditions Let (α o, β o, η o, η o) and (w o, b, ψ o, φ o ) be optimal solutions of the dual and primal respectively, then L dual (α o, β o, η o, η o) = f(w o, b, ψ o, φ o ). From (4) and (8) we then derive the KKT conditions: α i (ɛ + ψ i y i + w, x i + b) = 0 β i (ɛ + φ i + y i w, x i b) = 0 η i ψ i = 0 η i φ i = 0 i = 1,, n (9) A constraint c t (w) is said to be tight or active if the solution w o satisfies c t (w o ) = 0 and is otherwise said to be inactive.

COMP-566 Rohan Shah (8) Lagrangian Saddlepoint Equivalence Theorem When we have an optimal primal solution (so the duality gap is zero) it is a saddle point of the lagrangian function of the primal problem and so we have: L b = n (β i α i ) = 0 (10) L w = w n (α i β i )x i = 0 (11) L = φ i ζ β i ηi = 0 (12) L = ψ i ζ α i η i = 0 (13) (11) can be written as w = n i (α i β i )x i and so f(x a ) = w x a + b = n (α i β i ) x i x a + b

COMP-566 Rohan Shah (9) Dual Formulation To remove the dependance on the primal variables we explicitly compute (5) by differentiating L(w, b, φ, ψ α, β, η, η ) with respect to the primal variables which leaves us with (10) to (13) which we substitute into (4): maximise 1 2 ɛ n subject to: (α i β i ) = 0 n n (α i β i )(α j β j ) x i x j j=1 n (α i + β i ) + n y i (α i β i ) α i, β i [0, ζ] (14) So the dual of our quadratic program (3) is another quadratic program but with simpler constraints.

COMP-566 Rohan Shah (10) Support Vectors The Karush-Kuhn-Tucker complementary conditions state that only the active constraints will have non-zero dual variables. ɛ + φ i y i + f(x i ) > 0 implies α i = 0 ɛ + φ i y i + f(x i ) = 0 implies α i 0 ɛ + ψ i + y i f(x i ) > 0 implies β i = 0 ɛ + ψ i + y i f(x i ) = 0 implies β i 0 i : 1 i n (15) The x i with non-zero α i or β i are the support vectors; if we were to train the SVM on only these x i, ignoring all the examples for which α i = 0 and β i = 0, we would still induce the same regression surface.

COMP-566 Rohan Shah (11) Sparse Support Vector Expansion We see that the vector w can be written as a linear combination of the input training data points {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )} R s R and so: f(x a ) = w x a + b = n (α i β i ) x i x a + b (16) Let Λ {1, 2,, n} such that i Λ we have both α i 0 and β i 0. Then we can rewrite our regression function as: f(x a ) = i Λ(α i β i ) x i x a + b (17) We have a sparse expansion of f(x a ).

COMP-566 Rohan Shah (12) Non-linear SVM Regression - Φ : R s F Our machine is linear. Our data might be non-linear. We will apply a mapping function Φ to our input data, essentially projecting it into a higher dimensional feature space F. f(x i ) = w Φ(x i ) + b (18) Then apply our linear machinery to find a linear regression in this new feature space. The corresponding regression surface in the input space will be non-linear, specifically Φ 1 (f(x)).

COMP-566 Rohan Shah (13) Example: Quadratic Map (u 1, u 2 ) φ(u 1, u 2 ) = (u 2 1, u 2 2, 2u 1 u 2 ) (19) 3 2 1 0 1 2 3 3 2 1 0 1 2 3 Input Space: circular data in R 2

COMP-566 Rohan Shah (14) Input data mapped into Feature Space 2 1.5 1 0.5 0 0.5 1 1.5 2 2 0 1.5 0.5 1 1 0.5 1.5 0 2 Data is linearly separable in feature space

COMP-566 Rohan Shah (15) Computational Feasibility W hat computations are we performing in the feature space? Since we are projecting only our input data x i - the dual form (14) has slightly changed; instead of computing the dot product between training examples in the input space x i x j, we compute a a dot product in the feature space F : Φ(x i ) Φ(x j ) (20) If the dimension of F is large then our problem might be computationally infeasible. We need to avoid computing in the feature space and hence projecting data points into the feature space. But we still have to compute the dot product of the features efficiently.

COMP-566 Rohan Shah (16) Kernel Functions What is a kernel: input space: a function K such that for all x, z in some K(x, z) = φ(x) φ(z) (21) where φ is a mapping from the input space to a feature space. Mercer s Theorem: characterises what constitutes a valid kernel and how they can be built. So the dimension of the feature space does not affect the computational complexity!

COMP-566 Rohan Shah (17) Inverse Multiquadric Kernel k(x, y) = 1 x y 2 + c 2 (22)

COMP-566 Rohan Shah (18) Quadratic Optimization using a Kernel Function maximise 1 2 ɛ n subject to: (α i β i ) = 0 n n (α i β i )(α j β j )K(x i, x j ) j=1 n (α i + β i ) + n y i (α i β i ) α i, β i [0, ζ] (23) So using the dual representation helps us in two ways: 1. we can operate in high dimensional spaces with the help of kernel functions which replace the dot product in the dual 2. lets us make use of optimisation algortihms designed for the dual form