Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Similar documents
Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Lecture 10: Support Vector Machine and Large Margin Classifier

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector Machine (continued)

Support Vector Machine

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Support Vector Machines for Classification: A Statistical Portrait

Jeff Howbert Introduction to Machine Learning Winter

Announcements - Homework

Support Vector Machines Explained

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

L5 Support Vector Classification

Support Vector Machines

Statistical Pattern Recognition

CSC 411 Lecture 17: Support Vector Machine

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

Linear & nonlinear classifiers

Support Vector Machines

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Perceptron Revisited: Linear Separators. Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines for Classification and Regression

Support Vector Machines

Support Vector Machines.

SVMs, Duality and the Kernel Trick

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

ICS-E4030 Kernel Methods in Machine Learning

Linear & nonlinear classifiers

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Support Vector Machines

Kernel Methods and Support Vector Machines

Machine Learning And Applications: Supervised Learning-SVM

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machines and Speaker Verification

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Machine Learning. Support Vector Machines. Manfred Huber

Support vector machines Lecture 4

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Introduction to Support Vector Machines

Support Vector Machines

Convex Optimization and Support Vector Machine

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Learning From Data Lecture 25 The Kernel Trick

Lecture Notes on Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Machine Learning from Data

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

LMS Algorithm Summary

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Introduction to SVM and RVM

Support Vector Machines and Kernel Methods

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Recita,on: Loss, Regulariza,on, and Dual*

Support Vector Machines, Kernel SVM

Support Vector Machines

Support Vector Machine

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Support Vector Machines

Basis Expansion and Nonlinear SVM. Kai Yu

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS-E4830 Kernel Methods in Machine Learning

A Magiv CV Theory for Large-Margin Classifiers

A Bahadur Representation of the Linear Support Vector Machine

MLCC 2017 Regularization Networks I: Linear Models

Support Vector Machine for Classification and Regression

Classifier Complexity and Support Vector Classifiers

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Modelli Lineari (Generalizzati) e SVM

CS145: INTRODUCTION TO DATA MINING

Lecture Support Vector Machine (SVM) Classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

(Kernels +) Support Vector Machines

Support vector machines

SUPPORT VECTOR MACHINE

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

18.9 SUPPORT VECTOR MACHINES

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Discriminative Models

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Transcription:

Indirect Rule Learning: Support Vector Machines

Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection learning aims to directly minimize an empirically approximate expected loss function. Most often, it minimizes (empirical risk minimization) L(Y i, f (X i )). For example, least squares estimation: (Y i f (X i )) 2. classification problem: I(Y i f (X i )).

Potential challenges What is a good approximation for expected loss function? Empirical risk is most commonly used but there are other alternatives. What is the choice of candidate f for optimization? How to avoid overfitting? Will computation be feasible? global minimizer computation complexity

Least squares estimation The empirical risk is (Y i f (X i )) 2. f (x) can be from a class of linear functions; a sieve space of basis functions (splines, wavelets, radial basis); or fully nonparametric (kernel estimation). Overfitting can be addressed using regularization: variable selection for linear models; penalized splines, shrinkage for sieve approximation cross-validation for tuning parameter selection. Computation convex optimization co-ordinate decent optimization for large p

Support Vector Machines Consider binary classification problem and use label { 1, 1} for two classes. We start from a simple classification rule which is a linear function of feature variables X. The idea of SVM is to identify a hyperplace on feature space that separates classes as much as possible.

SVM illustration

Mathematical formulation of SVM The goal is to find a hyperplane β 0 + X T β such that for all i = 1,..., n. Y i (β 0 + X T i β) > 0 Furthermore, we wish to maximize the margin given as M. That is, we solve max M subject to Y i(β 0 + Xi T β) M. β =1

Equivalent optimization It is equivalent to min 1 2 β 2 subject to Y i (β 0 + Xi T β) 1. There are two difficulties in practice classes may not be separable so no solution exists; classes may be separable but separation is nonlinear.

Extension to imperfect separation data For imperfect separation, Y i (β 0 + Xi T β) may not be positive, i.e., the prediction is wrong. We should allow this misclassification but impose some penalty for wrong prediction. This can be done by introducing slack variables ξ 1,..., ξ n for each subject. ξ i 0 describes the distance off the correct classification given by margins. However, we should restrict the total penalty to be not too large.

SVM optimization The optimization is max β =1 M, subject to Y i(β 0 + X T i β) M(1 ξ i), i = 1,..., n. where ξ i 0 and ξ i a pre-specified constant. Equivalently, min 1 2 β 2 + C ξ i subject to Y i (β 0 + Xi T β) (1 ξ i), ξ i 0, i = 1,..., n, where C is a given constant (called cost parameter). This is a convex minimization problem with linear constraints.

Solve SVM problem using duality The Lagrange (primal) function is 1 2 β 2 +C ξ i ] α i [Y i (β 0 + Xi T β) (1 ξ i) where α i 0, µ i 0 are the Lagrange multipliers. Differentiate with respect to β 0, β and ξ i to obtain µ i ξ i, β = α i Y i X i, 0 = α i Y i, α i = C µ i, i = 1,..., n.

Dual problem After plugging β into the primal function and using the equations, the dual objective function is L D = α i 1 2 α i α j Y i Y j Xi T X j. j=1 The dual problem becomes max L D subject to 0 α i C, i = 1,..., n, α i Y i = 0. Furthermore, KKT conditions give ] α i [Y i (Xi T β + β 0) (1 ξ i ) = 0, µ i ξ i = 0, Y i (X T i β + β 0) (1 ξ i ) 0.

KKT conditions

On SVM optimization Solving the dual problem is a simple convex quadratic programming problems (there are many solvers in all packages). Since β = n α iy i X i, the hyperplane is determined by those observations with α i 0, called support vectors. Among support vectors, some are on the margin edges (ξ i = 0) and the remainders have α i = C. Any support vectors with ξ i = 0 can be used to solve for β 0 (often taken to be the average if there are multiple). Sometimes β 0 can be solved by directly minimize the primal function.

Illustrative example

Go beyond linear SVM The most commonly used nonlinear prediction rule is to restrict f in a RKHS, H K (called kernel trick). Recall that RKHS is given by a kernel function K(x, y) where K(x, y) has an eigen-expansion K(x, y) = γ k φ k (x)φ k (y), k=1 where φ k / γ k is the normalized basis function for {H K, <, > HK }. We can represent f (x) using these basis functions f (x) = β 0 + β k φ k (x)/ γ k. k=1

Dual problem with kernel trick Follow the same derivation as linear SVM (replace X i by the vector (φ 1 (X i )/ γ 1, φ 2 (X i ) γ 2,...) T then the dual objective function becomes α i 1 2 α i α j Y i Y j ( φ k (X i )φ k (X j )/γ k ) j=1 k=1 = α i 1 2 α i α j Y i Y j K(X i, X j ). j=1 The prediction function becomes f (x) = β 0 + α i Y i φ k (X i )φ k (x)/γ k = k=1 α i Y i K(X i, x).

Advantage of kernel trick Our conclusions are (a) restricting f to H K leads to a nonlinear prediction function depending on the kernel function; (b) solving the dual problem for the prediction function only need to know the kernel function K(x, y) (not necessarily the basis functions); (c) the optimization in the dual problem depends on the number of observations (n) but not the dimensionality of X i s.

Choice of the kernel functions polynomial kernel: K(x, x ) = (1 + x T x ) d Radial basis or Gaussian kernel: K(x, x ) = exp{ γ x x 2 } Neural network: K(x, x ) = tanh(k 1 x T x + k 2 ).

Revisit SVM example

Loss formulation for SVM Revisit linear SVM formulation: we minimize β subject to separation constraints Y i (β 0 + X T i β) 1 ξ i, ξ i 0, i = 1,.., n and that n ξ i is controlled by a constant. We need to understand what exact empirical loss this optimization tries to minimize because by achieving this, we can characterize how SVM will possibly minimize classification loss (Fisher consistency); we can study the stochastic variability of the SVM classification (convergence rate and risk bound).

Loss formulation: continue Equivalently, we minimize (for a given constant C) 1 2 β 2 + C ξ i subject to ξ i [1 Y i (β 0 + X T i β)] +, where (1 z) + = max(1 z, 0). Hence, SVM is equivalent minimizing the following loss [1 Y i (β 0 + Xi T β)] + + λ 2 β 2. For nonlinear SVM, the loss is [1 Y i f (X i )] + + λ 2 f 2 H K. We name L(y, f ) = [1 yf ] + the hinge loss function.

Plot of the hinge loss

Fisher consistency of SVM Fisher consistency: suppose f minimizes E[(1 Yf (X)) + ]. Then sign(f (x)) is the Bayes rule for the classification problem. Proof: note that E[(1 Yf (X)) + X = x] = (1 f (x))+ P(Y = 1 X = x) +(1 + f (x)) + P(Y = 1 X = x), as a function of f (x), is a 3-piece of linear function, decreasing in (, 1], linear in ( 1, 1] and increasing in [1, ). The minimize is attained at f (x) = 1 if P(Y = 1 X = x) < P(Y = 1 X = 1) and 1 otherwise.. We conclude the Fisher consistency.

Extension of the hinge loss The hinge loss is one special case of the so-called large margin loss with form φ(yf ) for some convex function of φ. Additional examples include Binomial deviance: log(1 + e yf ) Squared loss: (1 yf ) 2 square hinge loss: (1 yf ) 2 + AdaBoost: exp{ yf }. A sufficient condition for the Fisher consistency is that φ(x) is differentiable at 0 and φ (0) < 0.

SVM for regression Extension of SVM to continuous Y is based on modification of the loss in SVM. Consider prediction f (X) for subject with feature X and Y is his/her true outcome. The inaccuracy of the prediction can be characterized as the so-called ɛ-insensitive loss: L(Y, f (X)) = max( Y f (X) ɛ, 0). The loss is zero if the prediction error is within ɛ.

ɛ-insensitive loss

Optimization problem in SVM for regression The objective function for linear prediction is min β 2 /2 + C{ξ i + ξ i } subject to ξ i ɛ Y i (β 0 +Xi T β) ɛ+ξ i, ξ i 0, ξ i 0, i = 1,..., n. The dual problem is min ɛ (α i +α i ) Y i (α i α i )+1 2 subject to (α i α i )(α j α j )XT i X j j=1 0 α i, α i C, (α i α i ) = 0, α iα i = 0 The prediction function is β 0 + X T β with β = (α i α i )X i.