References. Lecture 7: Support Vector Machines. Optimum Margin Perceptron. Perceptron Learning Rule

Similar documents
References. Lecture 3: Shrinkage. The Power of Amnesia. Ockham s Razor

Support Vector Machines for Classification: A Statistical Portrait

Bayesian Support Vector Machines for Feature Ranking and Selection

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Perceptron Revisited: Linear Separators. Support Vector Machines

Discriminative Models

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support Vector Machine (continued)

Jeff Howbert Introduction to Machine Learning Winter

Discriminative Learning and Big Data

Discriminative Models

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Support Vector Machines.

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine for Classification and Regression

6.036 midterm review. Wednesday, March 18, 15

Machine Learning And Applications: Supervised Learning-SVM

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Support Vector Machines

Polyhedral Computation. Linear Classifiers & the SVM

CS145: INTRODUCTION TO DATA MINING

Announcements - Homework

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Kernels and the Kernel Trick. Machine Learning Fall 2017

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines Explained

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

SVMs, Duality and the Kernel Trick

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kernelized Perceptron Support Vector Machines

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Machines: Maximum Margin Classifiers

Statistical Machine Learning from Data

Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm exam CS 189/289, Fall 2015

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

18.9 SUPPORT VECTOR MACHINES

Linear Classification and SVM. Dr. Xin Zhang

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Introduction to Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Introduction to SVM and RVM

Support Vector Machines

Support Vector Machines. Machine Learning Fall 2017

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Applied inductive learning - Lecture 7

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Classifier Complexity and Support Vector Classifiers

Support Vector Machine & Its Applications

Support Vector Machine (SVM) and Kernel Methods

SUPPORT VECTOR MACHINE

Support Vector Machines, Kernel SVM


Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Support Vector Machines

(Kernels +) Support Vector Machines

Linear & nonlinear classifiers

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Machine via Nonlinear Rescaling Method

Machine Learning Practice Page 2 of 2 10/28/13

L5 Support Vector Classification

Support Vector Machine I

Content. Learning. Regression vs Classification. Regression a.k.a. function approximation and Classification a.k.a. pattern recognition

Kernel Methods. Machine Learning A W VO

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machine. Industrial AI Lab.

Support Vector Machines. Maximizing the Margin

Support Vector Machines

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Statistical Methods for SVM

Support Vector Clustering

CIS 520: Machine Learning Oct 09, Kernel Methods

Support Vector Machines

Support Vector Machine

Homework 2 Solutions Kernel SVM and Perceptron

Introduction to Support Vector Machines

HOMEWORK 4: SVMS AND KERNELS

Max Margin-Classifier

DATA MINING AND MACHINE LEARNING

Machine Learning. Support Vector Machines. Manfred Huber

Transcription:

References Lecture 7: Support Vector Machines Isabelle Guyon guyoni@inf.ethz.ch An training algorithm for optimal margin classifiers Boser-Guyon-Vapnik, COLT, 992 http://www.clopinet.com/isabelle/p apers/colt92.ps.z Book chapters and 2 Software LibSVM http://www.csie.ntu.edu.tw/~cjlin/li bsvm/ Perceptron Learning Rule Optimum Margin Perceptron w j w j + y i x ij If x is misclassified x j wj Σ y w j w j + y i x ij If x is the least well classified x j wj Σ y w j = Σ i α i y i x ij Converges to one solution classifying well all the examples. w j = Σ i α i y i x ij Converges to the optimum margin Perceptron (Mezard-Krauth, 988). QP formulation (vapnik, 962).

Optimum Margin Solution Negative Margin Unique solution. Depends only on support vectors (SV) w j = Σ i SV α i y i x ij SVs are examples closest to the boundary. Bound on leave-one-out error: LOO n SV /m Most stable, good from MDL point of view. But: sensitive to outliers and works only for linearly separable case. Multiple negative margin solutions, which all give the same number of errors. Soft-Margin Non-Linear Perceptron x φ (x) Rosenblatt, 957 Examples within the margin area incurr a penalty and become nonmarginal support vectors. Unique solution again (Cortes-Vapnik, 995): w j = Σ i SV α i y i x ij x 2 x n φ 2 (x) φ N (x) w w 2 w N b Σ f(x) f(x) = w F(x) + b 2

Kernel Trick Kernel Method f(x) = w F(x) x k(x,x) Potential functions, Aizerman et al 964 w = Σ i α i y i F(x i ) Dual forms Aizerman-Braverman-Rozonoer -964 x 2 k(x 2,x) α α 2 Σ f(x) = Σ i α i y i k(x i, x) k(x i, x) = F(x i ) F(x) x n k(x m,x) α m b f(x) = Σ i α i k(x i,x) + b k(.,. ) is a similarity measure or kernel. Some Kernels (reminder) Support Vector Classifier A kernel is a dot product in some feature space: k(s, t) = F(s) F(t) x 2 Examples: k(s, t) = s t Linear kernel k(s, t) = exp -γ s-t 2 Gaussian kernel k(s, t) = exp -γ s-t Exponential kernel k(s, t) = ( + s t) q Polynomial kernel k(s, t) = ( + s t) q exp -γ s-t 2 Hybrid kernel f(x)< f(x)= f(x)> f(x) = Σ α j y j k(x, x j ) k SV x x=[x,x 2 ] Boser-Guyon-Vapnik-992 3

Margin and w Quadratic Programming Hard margin: min w 2 y j (w.x j +b) for all examples. Margin Maximizing the margin is equivalent to minimizing w. Soft margin: min w 2 + C Σ j ξ β j y j (w.x j +b) (-ξ j ) for all examples. ξ j, β=,2 Dual Formulation Ridge SVC Non-linear case: x F(x) min w 2 + CΣ j ξ j β ξ j, β= or β= 2 y j (w.f(x j )+b) (-ξ j ) for all examples. max -½ a T K a + a K= [ y i y j k(x i,x j )] + (/C) δ ij a T y = ; α i C Soft margin: min w 2 + CΣ j ξ j β ξ j, β=,2 y j (w.f(x j )+b) (-ξ j ) for all examples. -y f(x j ) j f(x j )<, ξ j =, no penalty, not SV -y j f(x j )=, ξ j =, no penalty, marginal SV Ridge SVC: -y j f(x j )=ξ j >, penalty ξ j, non-marginal SV Loss L(x j )=max (, -y j f(x j )) β Risk Σ i L(x j ) min (/C) w 2 + Σ j L(x j ) regularized risk 4

Ridge Regression (reminder) Structural Risk Minimization Sum of squares: R = Σ i (f(x i ) - y i ) 2 = Σ i (-y i f(x i )) 2 Add regularizer : Classification case R = Σ i (-y i f(x i )) 2 + λ w 2 Compare with SVC: R =Σ i max(, -y j f(x j )) β + λ w 2 Nested subsets of models, increasing complexity/capacity: S S 2 S N Vapnik-984 R Example, with w 2 S k = { w w 2 < A k }, A <A 2 < <A k Minimization under constraint: min R emp [f] s.t. w 2 < A k capacity Lagrangian: R reg [f] = R emp [f] + λ w 2 Radius-margin bound: LOOcv = 4 r 2 w 2 Vapnik-Chapelle-2 Loss Functions Regularizers L(y, f(x)) 3.5 Adaboost loss e -z 2.5 logistic loss.5 log(+e -z ) Perceptron.5 loss 4 3 2 max(, -z) Decision boundary Margin - -.5.5.5 2 missclassified SVC loss, β=2 max(, (- z) 2 ) / loss SVC loss, β= max(, -z) well classified square loss (- z) 2 z=y f(x) w 2 2 =Σ i w i 2 : 2-norm regularization (ridge regression, original SVM) w =Σ i w i : -norm regularization (Lasso Tibshirani 996, -norm SVM 965) w =length(w) : -norm (Weston et al., 23) 5

Regression SVM Epsilon insensitive loss: y i f(x i ) ε Unsupervised learning SVMs for: density estimation: Fit F(x) (Vapnik, 998) finding the support of a distribution (one-class SVM) (Schoelkopf et al, 999) novelty detection (Schoelkopf et al, 999) clustering (Ben Hur, 2) Summary For statistical model inference, two ingredients needed: A loss function: defines the residual error, i.e. what is not explained by the model; characterizes the data uncertainty or noise. A regularizer: defines our prior knowledge, biases the solution; characterizes our uncertainty about the model. We usually bet on simpler solutions (Ockham s razor). Exercise Class 6

Filters: see chapter 3 Homework 7 ) Download the software for homework 7. 2) Inspiring your self by the examples, write a new feature ing filter object. Choose one in Chapter 3 or invent your own. 3) Provide the pvalue and (using a tabulated distribution or the probe method). 4) Email a zip file your object and a plot of the to guyoni@inf.ethz.chwith subject "Homework7" no later than: Tuesday December 3th. 4 2 Size Type Features.9 MB Sparse integer Dexter DEXTER 5 5 2 25 3 35 4 45 5 Best entries: BER~3.3-3.9% Frac_feat~.5% DEXTER filters texts Training Examples AUC~.97-.99 Validation Examples Frac_probe~5% Test Examples 2 3 3 2 BER Baseline Dexter my_classif=svc({'coef=', 'degree=', 'gamma=', 'shrinkage=.5'}); my_model=chain({s2n('f_max=3'), normalize, my_classif}) Results: % feat.5 % probe 6.33 Train BER%.33 Valid 7 Test BER% BER% 5 Train Valid AUC AUC.982 Test AUC.988 7

Evalution of pval and Ttest object: computes pval analytically ~pval*n sc /n probe object: takes any feature ing object as an argument (e.g. s2n, relief, Ttest) pval~n sp /n p ~pval*n sc /n.4.2..8.6.4.2 Arcene 5 5 2 253 354 45 5.9.8.7.6.5.4.3.2 Gisette Analytic vs. probe Red analytic Blue probe.8.7.6.5.4.3.2. Dexter 5 52 253354 455.9.8.7.6.5.4.3.2 2 Madelon x -4 Dorothea 5 52 25335 4455. 5 5 2 253 35 445 5. 5 5 2 25 3 35 4 45 5 Relief (Fraction of features falsely found significant) 8 x -3 Relief vs. Ttest 7 6 5 4 3 2 5 5 2 25 3 35 4 45 5 Rank (Number of features found significant) (Dashed blue: Ttest with probes) 8