Basis Expansion and Nonlinear SVM. Kai Yu

Similar documents
Review: Support vector machines. Machine learning techniques and image analysis

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Support Vector Machines and Kernel Methods

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

L5 Support Vector Classification

(Kernels +) Support Vector Machines

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Kernel Methods. Barnabás Póczos

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Perceptron Revisited: Linear Separators. Support Vector Machines

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Support Vector Machine (continued)

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

5.6 Nonparametric Logistic Regression

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Kernels and the Kernel Trick. Machine Learning Fall 2017

Jeff Howbert Introduction to Machine Learning Winter

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Support Vector Machines and Kernel Methods

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Max Margin-Classifier

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Warm up: risk prediction with logistic regression

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture Support Vector Machine (SVM) Classifiers

CS145: INTRODUCTION TO DATA MINING

Lecture 10: A brief introduction to Support Vector Machine

Introduction to SVM and RVM

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

A Tutorial on Support Vector Machine

Lecture 10: Support Vector Machine and Large Margin Classifier

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Announcements. Proposals graded

CS798: Selected topics in Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines

10-701/ Machine Learning - Midterm Exam, Fall 2010

Discriminative Models

Support Vector Machines.

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Oslo Class 2 Tikhonov regularization and kernels

Statistical Machine Learning from Data

RegML 2018 Class 2 Tikhonov regularization and kernels

Kernel Methods and Support Vector Machines

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines: Maximum Margin Classifiers

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Support Vector Machines

Introduction to Support Vector Machines

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

SUPPORT VECTOR MACHINE

Machine Learning Practice Page 2 of 2 10/28/13

LMS Algorithm Summary

Statistical Methods for SVM

Classification and Support Vector Machine

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support Vector Machines, Kernel SVM

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 04: Features and Kernels. Lorenzo Rosasco

A Magiv CV Theory for Large-Margin Classifiers

18.9 SUPPORT VECTOR MACHINES

Discriminative Models

Kernel Methods & Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Methods. Machine Learning A W VO

SVMs, Duality and the Kernel Trick

Kernel Methods. Outline

COMS 4771 Regression. Nakul Verma

Introduction to Logistic Regression and Support Vector Machine

Statistical Methods for Data Mining

CSC 411 Lecture 17: Support Vector Machine

TDT4173 Machine Learning

Modelli Lineari (Generalizzati) e SVM

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Lecture 7: Kernels for Classification and Regression

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

SVMs: nonlinearity through kernels

Support Vector Machines

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Kaggle.

Polyhedral Computation. Linear Classifiers & the SVM

Nonparametric Bayesian Methods (Gaussian Processes)

Support Vector Machines

Transcription:

Basis Expansion and Nonlinear SVM Kai Yu

Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2

Nonlinear Classifiers via Basis Expansion f(x) =w > h(x)+b z(x) = sign(f(x)) Nonlinear basis functions h(x)=[h 1 (x), h 2 (x),, h m (x)] f(x) = w T x+b is a special case where h(x)=x This explains a lot of classification models, including SVMs. 8/7/12 3

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 4

Review the QP for linear SVMs After a lot of stuff, we obtain the Lagrange dual L D = N α i 1 2 N N α i α i y i y i x T i x i i =1 The solution has the form w = i y i x i In other words, the solution w is in span(x 1,x 2,...,x N ) 8/7/12 5

A more general result RKHS representation theorem (Wahba, 1971) In its simplest form, L(w T x,y) is covex w.r.t. w, the solution of min L(w T x i,y i )+ kwk 2 w has the form w = i x i Proof sketch Note: the conclusion is general, not only for SVMs. 8/7/12 6

For general basis expansion functions The solution of min w L(w > h(x i ),y i )+ kwk 2 has the form w = i h(x i ) 8/7/12 7

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 8

Kernel Define the Mercer kernel as k(x i,x j )=h(x i ) > h(x j ) 8/7/12 9

Kernel trick Apply the representation theorem w = i h(x i ) we have f(x) = i k(x i,x) kwk 2 = i j k(x i,x j )= T K i,j=1 min L i k(x i,x),y i! + > K 8/7/12 10

Primal and Kernel formulations min w L w > h(x),y i + kwk 2 k(x i,x j )=h(x i ) > h(x j ) min L i k(x i,x),y i! + > K Given a kernel, we don t even need h(x)! really? 8/7/12 11

Popular kernels k(x,x ) is a symmetric, positive (semi-) definite function dth deg. poly.: K(x, x )=(1+ x, x ) d Example: radial basis: K(x, x )=exp( x x 2 /c) K(x, x )=(1+ x, x ) 2 =(1+x 1 x 1 + x 2 x 2) 2 =1+2x 1 x 1 +2x 2 x 2 +(x 1 x 1) 2 +(x 2 x 2) 2 +2x 1 x 1x 2 x 2 h 1 (x) =1,h 2 (x) = 2x 1, h 3 (x) = 2x 2, h 4 (x) =x 2 1, h 5 (x) =x 2 2, and h 6 (x) = 2x 1 x 2, then ( )= ( ) ( ). 8/7/12 12

Non-linear feature mapping Datasets that are linearly separable 0 x But what if the dataset is just too hard? 0 x How about mapping data to a higher-dimensional space: x 2 0 x

Nonlinear feature mapping General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable: h: x h(x)

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 15

Various equivalent formulations Parametric form min w L w > h(x),y i + kwk 2 Dual form min L i k(x i,x),y i! + > K Nonparametric form min L(f(x i ),y i )+ f kfk 2 H k 8/7/12 16

Various equivalent formulations Parametric form min w L w > h(x),y i + kwk 2 Dual form min L Nonparametric form min L(f(x i ),y i )+ f! i k(x i,x),y i + > K Telling what kind of f(x) is preferred kfk 2 H k 8/7/12 17

Regularization induced by kernel (or basis functions) Eigen expansion: K(x, Hy) = f(x) = Desired kernel is a smoothing operator, smoother eigenfunctions ϕ i tend to have larger eigenvalues γ i f 2 def H K = c 2 i /γ i What does this mean? γ i φ i (x)φ i (y) c i φ i (x) 8/7/12 18

Understand regularization If push down this regularization term f 2 H K def = c 2 i /γ i In f(x), minor components ϕ i (x) with smaller γ i are penalized more heavily. à principle components are preferred in f(x)! A desired kernel is a smoothing operator, i.e., principle components are smoother functions à the regularization encourages f(x) to be smooth! 8/7/12 19

Understanding regularization f 2 H K def = c 2 i /γ i Using what kernel? Using what feature (for linear model)? Using what h(x)? Using what functional norm kfk 2 H k All pointing to one thing what kind of functions are preferred apriori 8/7/12 20

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 21

Nonlinear Logistic Regression So far, things we discussed, including - representation theorem, - kernel trick, - regularization, are not limited to SVMs. They are all applicable to logistic regression. The only difference is the loss function. 8/7/12 22

Nonlinear Logistic Regression Loss 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Binomial Log-likelihood Support Vector Parametric form min f Nonparametric form min f ln -3-2 -1 0 1 2 3 ln yf(x) (margin) 1+e y iw > h(x i ) + kwk 2 1+e y if(x i ) + kfk 2 H k 8/7/12 23

Logistic Regression vs. SVM Both can be linear or nonlinear, parametric or nonparametric, the main difference is the loss; They are very similar in performance; Outputs probabilities, useful for scoring confidence; Logistic regression is easier for multiple classes. 10 years ago, one was old, the other is new. Now, both are old. 8/7/12 24

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 25

Many known classification models follow a similar structure Neural networks RBF networks Learning VQ (LVQ) Boosting These models all learn w and h(x) together 8/7/12 26

Many known classification models follow a similar structure Neural networks RBF networks Learning VQ (LVQ) Boosting SVMs Linear Classifier Logistic Regression 8/7/12 27

Develop your own stuff! By deciding Which loss function? hinge, least square, What form of h(x)? RBF, logistic, tree, Infinite h(x) or h(x)? Learning h(x) or not? How to optimize? QP, LBFGS, functional gradient, you can obtain various classification algorithms. 8/7/12 28

Parametric vs. nonparametric models h(x) is finite dim, parametric model f(x)=w T h(x). Training complexity is O(Nm 3 ) h(x) is nonlinear and infinite dim, then has to use kernel trick. This is a nonparametric model. The training complexity is around O(N 3 ) Nonparametric models, including kernel SVMs, Gaussian processes, Dirichlet processes etc., are elegant in math, but nontrivial for large-scale computation. 8/7/12 29

Outline Representation theorem Kernel trick Understand regularization Nonlinear logistic regression General basis expansion functions Summary 8/7/12 30

Summary Representation theorem and kernels Regularization prefers principle eigenfunctions of the kernel (induced by basis functions) Basis expansion - a general framework for classification models, e.g., nonlinear logistic regression, SVMs, 8/7/12 31