KERNEL MODELS AND SUPPORT VECTOR MACHINES

Similar documents
Linear Classifiers III

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

10-701/ Machine Learning Mid-term Exam Solution

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

6.867 Machine learning

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

18.657: Mathematics of Machine Learning

MAT1026 Calculus II Basic Convergence Tests for Series

Machine Learning for Data Science (CS 4786)

Support Vector Machines and Kernel Methods

Chapter 7. Support Vector Machine

State Space Representation

Support vector machine revisited

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Brief Review of Functions of Several Variables

Machine Learning. Ilya Narsky, Caltech

CS537. Numerical Analysis and Computing

Information Theory Tutorial Communication over Channels with memory. Chi Zhang Department of Electrical Engineering University of Notre Dame

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

10/2/ , 5.9, Jacob Hays Amit Pillay James DeFelice

CMSE 820: Math. Foundations of Data Sci.

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

A survey on penalized empirical risk minimization Sara A. van de Geer

Quantile regression with multilayer perceptrons.

Learning Bounds for Support Vector Machines with Learned Kernels

CS321. Numerical Analysis and Computing

CSCI567 Machine Learning (Fall 2014)

TEACHER CERTIFICATION STUDY GUIDE

Machine Learning Brett Bernstein

Algorithms for Clustering

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Math Solutions to homework 6

A statistical method to determine sample size to estimate characteristic value of soil parameters

Math 61CM - Solutions to homework 3

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Discrete Orthogonal Moment Features Using Chebyshev Polynomials

Chandrasekhar Type Algorithms. for the Riccati Equation of Lainiotis Filter

6.3 Testing Series With Positive Terms

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

Machine Learning Theory (CS 6783)

A New Solution Method for the Finite-Horizon Discrete-Time EOQ Problem

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

CSE 527, Additional notes on MLE & EM

Physics 324, Fall Dirac Notation. These notes were produced by David Kaplan for Phys. 324 in Autumn 2001.

Definitions and Theorems. where x are the decision variables. c, b, and a are constant coefficients.

A Note on Matrix Rigidity

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

Machine Learning for Data Science (CS 4786)

Advanced Analysis. Min Yan Department of Mathematics Hong Kong University of Science and Technology

1.3 Convergence Theorems of Fourier Series. k k k k. N N k 1. With this in mind, we state (without proof) the convergence of Fourier series.

On the convergence rates of Gladyshev s Hurst index estimator

Math 312 Lecture Notes One Dimensional Maps

Topics Machine learning: lecture 2. Review: the learning problem. Hypotheses and estimation. Estimation criterion cont d. Estimation criterion

Practical Spectral Anaysis (continue) (from Boaz Porat s book) Frequency Measurement

Dimension-free PAC-Bayesian bounds for the estimation of the mean of a random vector

Similarity Solutions to Unsteady Pseudoplastic. Flow Near a Moving Wall

Sieve Estimators: Consistency and Rates of Convergence

Infinite Sequences and Series

Empirical Process Theory and Oracle Inequalities

Lecture 2 Clustering Part II

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

6.867 Machine learning, lecture 7 (Jaakkola) 1

A collocation method for singular integral equations with cosecant kernel via Semi-trigonometric interpolation

Numerical Solution of the Two Point Boundary Value Problems By Using Wavelet Bases of Hermite Cubic Spline Wavelets

REGRESSION WITH QUADRATIC LOSS

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

A Hadamard-type lower bound for symmetric diagonally dominant positive matrices

10.6 ALTERNATING SERIES

Law of the sum of Bernoulli random variables

Signal Processing in Mechatronics. Lecture 3, Convolution, Fourier Series and Fourier Transform

Information-based Feature Selection

Definition 4.2. (a) A sequence {x n } in a Banach space X is a basis for X if. unique scalars a n (x) such that x = n. a n (x) x n. (4.

x c the remainder is Pc ().

Properties of Fuzzy Length on Fuzzy Set

A new iterative algorithm for reconstructing a signal from its dyadic wavelet transform modulus maxima

Week 1, Lecture 2. Neural Network Basics. Announcements: HW 1 Due on 10/8 Data sets for HW 1 are online Project selection 10/11. Suggested reading :

Regression with quadratic loss

Study the bias (due to the nite dimensional approximation) and variance of the estimators

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Lecture 3. Properties of Summary Statistics: Sampling Distribution

G. R. Pasha Department of Statistics Bahauddin Zakariya University Multan, Pakistan

ACO Comprehensive Exam 9 October 2007 Student code A. 1. Graph Theory

3. Z Transform. Recall that the Fourier transform (FT) of a DT signal xn [ ] is ( ) [ ] = In order for the FT to exist in the finite magnitude sense,

Advanced Course of Algorithm Design and Analysis

Optimization Methods MIT 2.098/6.255/ Final exam

Singular value decomposition. Mathématiques appliquées (MATH0504-1) B. Dewals, Ch. Geuzaine

Self-normalized deviation inequalities with application to t-statistic

Lecture 9: Boosting. Akshay Krishnamurthy October 3, 2017

Riesz-Fischer Sequences and Lower Frame Bounds

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Efficient GMM LECTURE 12 GMM II

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Boosting. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 1, / 32

A Risk Comparison of Ordinary Least Squares vs Ridge Regression

Discrete-Time Systems, LTI Systems, and Discrete-Time Convolution

PAijpam.eu ON TENSOR PRODUCT DECOMPOSITION

Optimally Sparse SVMs

The Method of Least Squares. To understand least squares fitting of data.

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Transcription:

COMPUAIONAL INELLIGENCE Vol. I - Kerel Models ad Support Vector Machies - K azushi Ikeda KERNEL MODELS AND SUPPOR VECOR MACHINES Kazushi Ikeda Nara Istitute of Sciece ad echology, Ikoma, Nara, Japa Keywords: kerel, margi, statistical learig theory Cotets. Itroductio. Kerel Fuctio ad Feature Space 3. Represeter heorem 4. Example 5. Pre-Image Problem 6. Properties of Kerel Methods 7. Statistical Learig heory 8. Support Vector Machies 9. Variatios of SVMs 0. Coclusios Glossary Bibliography Biographical Sketches Summary Kerel models are a method for itroducig oliear preprocessig to liear models. A kerel fuctio determies a feature space called a reproducig kerel Hilbert space. he solutio of a optimizatio problem with kerel models is expressed as a liear combiatio of give examples thaks to the represeter theorem. his meas that o explicit feature vectors are required ad hece kerel models are applicable to eve sequeces or graphs. he first half of this chapter itroduces the readers to the fudametals of kerel models such as the kerel trick ad the represeter theorem. he most successful applicatio of kerel models is support vector machies (SVMs). SVMs are derived based o the statistical learig theory, which does ot assume ay statistical model beforehad ad leads to margi maximizatio for higher geeralizatio ability. Although a SVM is costructed so that oly the upper boud of the geeralizatio error i a worst case is miimized, it performs better tha covetioal methods o average. Aother advatage of SVMs is its covexity, that is, it has a uique solutio differetly from other hierarchical learig models. he idea of margi maximizatio has bee applied to regressio, desity estimatio ad other areas. he secod half of this chapter itroduces the readers to the derivatio of SVMs ad their properties as well as several variatios.. Itroductio he simplest model of a euro is Perceptro that calculates the weighted sum of iput sigals ad outputs + whe the sum exceeds a threshold ad otherwise (Misky ad Ecyclopedia of Life Support Systems(EOLSS) 63

Papert, 969). he model attracted much attetio as a biary classifier or dichotomy with a adaptive algorithm for a give dataset called the Perceptro learig rule because it has good properties such as the covergece theorem that assures fiite times of iteratios for a liearly separable dataset ad the cyclig theorem that proves periodical solutios otherwise. Oe way to itroduce a oliear preprocessig to Perceptro is the multi-layer Perceptro model (MLP), where the threshold fuctio is replaced with a sigmoid fuctio such as the hyperbolic taget ad the preprocessor is autoomously acquired through learig. his model is simple ad works very well but it suffers a lot of local optimal solutios ad plateaus i a gradiet learig method whe the cost fuctio is give as the sum of squared errors. A alterative to the MLP is a kerel model. he preprocessor of a kerel model must be fixed a priori but it has a large variety of choices. I fact, we eed oly the ier product of preprocessed data, without describig them i a explicit form. his eables the feature space to have eve ifiite dimesios, or treat graphs directly, which leads to a wide rage of applicatios. Nowadays, kerel models are applied to ot oly classifiers but also regressio problems ad other liear sigal processig fields. From the mathematical viewpoit, a kerel model is based o its positive (or oegative) defiiteess. Positive defiite kerels appeared i the literature early i the twetieth cetury ad their geeral theory was established as the reproducig kerel Hilbert space i 950 s. However, the idea of kerel models has become popular after it was itroduced as a oliearity for support vector machies or SVMs i 990 s. A SVM is a liear dichotomy like Perceptro but it maximizes the margi, that is, the miimum distace of examples from the discrimiative boudary. he idea of margi maximizatio is based o Vapik s statistical learig theory that miimizes the structural risk i the framework of the probably approximately correct (PAC) learig itroduced by Valiat i 984. Oe of the good properties of SVMs is the structural risk miimizatio or SRM. he predictio error by SVMs is theoretically bouded eve i the worst case ad is ofte much lower tha the boud or that of MLPs. Aother pro is the uiqueess of the solutio. A SVM results i a covex quadratic programmig problem that has a uique solutio. Moreover, recet developmet of computer sciece ad egieerig such as ier poit methods makes the solvable size larger ad larger. haks to such superiority, SVMs have attracted much attetio ad their theoretical properties as well as their variats have bee well studied.. Kerel Fuctio ad Feature Space A kerel method maps a iput vector x X to the correspodig feature vector f( x ) i a high-dimesioal feature space F ad processes it liearly, such as Perceptro, the liear regressio or the pricipal compoet aalysis (PCA). Let us cosider the Perceptro learig here as a simple example. 64

Give the t - th example ( x () t, y( t) ) where () t { k} k= () x x is the t -th iput vector ad y t is the true label for the iput vector, the Perceptro learig rule i the feature space F is formulated as () t = ( t ) + y() t ( () t ) y() t () t ( () t ) () t = w( t ) otherwise, w w f x if sg w f x, w () where w () t is the weight vector of Perceptro at iteratio t ad deotes the traspose. Here, we assume that the weight vector ca be expressed as a weighted sum of iputs, that is, w() t = αk () t f( x k) () k= he, the algorithm () is simply rewritte as α α () t α ( t ) y() t if y() t sg () t ( () t ), () t () t = α ( t ) otherwise, k k k k = + w f x x = x k (3) for all k,, =. his is called the kerel Perceptro. Here, the ier product of w ( t) ad f( x ) for a arbitrary x X is calculated as ( t) ( ) = α ( t) ( ) ( ) = α ( t) K( ) w f x f x f x x, x. (4) k k k k k= k= where K ( xk, x) = f ( xk) f( x ) is called a kerel fuctio. Hece, although we do ot kow the feature vector f( x ) itself, we ca calculate the ier product. his makes the feature vector very high-dimesioal or eve ifiite without icrease of computatioal complexity, which is called the kerel trick. Oe example of kerel fuctios is the p -th order polyomial kerel K xx, = x x + p. Whe x R ad =, a possible feature vector is ( ) ( ) ( ) = ( x x x x x x ) f x,,,,,, (5) 65

ad hece w () t ( ) f x ca express ay polyomial fuctio with order two or less. x x Aother popular kerel fuctio is Gaussia kerel K ( xx, ) = exp, whose σ feature vector is essetially the Fourier trasform as show below: Suppose that the feature of x idexed by ω R is exp( iω x) where ω ad x are oe-dimesioal for simplicity. he, the ier product of exp( iω x) ad exp( iω x ) is calculated as ω K( x, x ) = exp( iωx) exp( iωx ) exp dω π ( ω i( x x )) ( x x ) = exp exp dω (6) π ( x x ) = exp where ( ω ) exp π is a measure of the space of ω. A advatage of kerel models is that we eed ot cosider the feature space for a kerel fuctio explicitly. his property makes kerel models possible to apply to the spaces of statistical models, sequeces, ad eve graphs. I fact, Mercer s theorem assures the existece of a ier product space F for ay positive semi defiite (or o-egative defiite) kerel fuctio K (, ), where a symmetric cotiuous fuctio K (, ) is said to be positive semi defiite if ad oly if it satisfies cc k mk( xk, x m) 0 (7) k= m= k k= c k k= for ay vector { x }, ay real umber { } ad ay positive iteger. he Gaussia kerel, for example, has a feature space thaks to this property although it is difficult to describe explicitly. From the practical viewpoit, however, provig a kerel fuctio beig positive semi defiite is more difficult tha costructig a feature space explicitly. Hece, popular kerel fuctios are simple oes such as the polyomial kerel or Gaussia kerel, or their combiatios. I fact, whe two fuctios K ad K are positive semi defiite, so are the followig fuctios: ( ) ( ) ( ) ( ) ck + ck for c, c 0, KK, exp K, g x K x, x g x. (8) 66

he polyomial kerel as well as Gaussia kerel are prove to be positive semi defiite usig (8). What kerel should we use? he aswer strogly depeds o the applicatio we cosider but a alterative approach is kerel learig, that is, automatic selectio of optimal of a c i i= kerel from data. Oe typical example is to optimize the coefficiets { } weighted sum of kerels, becomes best. 3. Represeter heorem i= ck i i, so that the performace of the learig machie I the previous sectio, we assumed that the weight vector could be expressed as a weighted sum of iputs, that is, w( x) = αkk ( x, x k). (9) k= A geeral theory to ustify the above assumptio is the Represeter heorem, which says the solutio of the optimizatio problem below has a solutio w of the form i (9): mi { }, { }, L xk yk w( xk) + ch i i( xk) +Ψ( w ), (0), k= k= w Hc R i= k = { } = where (, y ) x is a dataset, K is a kerel fuctio of x, H is the reproducig k k k kerel Hilbert space of K, { i} i h = is a set of basis fuctios of x, ad () Ψ is a mootoically icreasig fuctio (regularizatio term). Ituitively speakig, ay compoet i the complemet space H 0 of H0 = spa ({ K(, x k )} k= ) i w does ot chage the cost L but oly icreases the ormalizatio term Ψ. Oe simple example of L is mi w H ( ) y k w( x k ) + λ w H k= () ad aother example is the support vector machie as show later. haks to the kerel trick, the computatioal complexity of kerel methods does ot deped o the dimesios of the feature space but oly o the umber of examples. 67

Perceptro (MLP) PAC learig Perceptro Pricipal Compoet Aalysis (PCA) Represeter heorem Support Vector Machie (SVM) Support Vector Regressio (SVR) sigmoid fuctio. It ca approximate ay cotiuous fuctio with a arbitrary precisio if it has eough hidde uits. : Probably Approximately correct learig. It cosiders the coditio uder which the probability that the error of a learig algorithm exceeds a specific value is bouded by a specific value. : A liear dichotomy. It outputs the sig of the ier product betwee a weight vector ad a iput vector. : A dimesioality reductio method. It leaves the space spaed by eigevectors with large eigevalues (pricipal compoets) ad removes the compoets iduced by oise. : A theorem that assures the solutio of a optimizatio problem is described as a weighted sum of data. his is the theoretical base of kerel models. : he liear dichotomy that maximizes the margi to give examples. Sice such oe does ot exist for a liearly iseparable dataset, there are variatios to cope with this difficulty. : he liear regressio that employs the ε -isesitive error fuctio. his is robust agaist outliers due to the L-orm regularizer as well as a small umber of support vectors due to the isesitivity. Bibliography Misky M.L., Papert S.A. (969). Perceptros, MI Press, Cambridge, MA. [his is a comprehesive textbook for perceptros]. Rumelhert D.E., McClellad J.L. (986). Parallel Distributed Processig, MI Press, Cambridge, MA. [his presets a theory for multi-layer perceptros ad their applicatios]. Fukumizu K (00). Itroductio to Kerel Methods, Asakura-Shote, okyo. [his presets a theoretical backgroud of kerel methods for begiers. Note that it is writte i Japaese]. Wataabe S (00). Algebraic Aalysis for Noidetifiable Learig Machies. Neural Computatio, 3(4), 899-933. [his is a epoch-makig paper itroducig algebraic geometry to machie learig theory]. Vapik V.N. (995). he Nature of Statistical Learig heory, Spriger, Heidelberg. [his is the first book itroducig support vector machies]. Valiat L.G. (984). A heory of Learable. Commuicatios of the ACM, 7(), 34-4. [his paper presets a framework of the PAC learig]. Kwok J.., sag I.W. (004). he Pre-Image Problem i Kerel Methods, IEEE rasactios o Neural Networks, 5(6), 57-55. [his gives a cocrete method to cope with the pre-image problem of kerel methods]. Schoelkopf B., Smola A.J., Williamso R.C., Bartlett P.L. (000). New Support Vector Algorithms, Neural Computatio,, 07-45. [his paper itroduces the framework of the ν -SVMs]. Beett K.P., Bredesteier E.J. (000). Duality ad Geometry i SVM Classifiers, Proc. ICML, 57-64. [his is the first paper to show aother view of the ν -SVMs, that is, the covex hulls of data]. Ikeda K. (004). A Asymptotic Statistical heory of Polyomial Kerel Methods. Neural Computatio, 6(8), 705-79. [his paper solves the cotradictio of the theoretical results o SVMs]. 77

Ikeda K., Aoishi. (005). A Asymptotic Statistical Aalysis of Support Vector Machies with Soft Margis, Neural Computatio, 8, 5-59. [his paper gives a theoretical aalysis of the effect of soft margis]. Schoelkopf B., Platt J.C., Shawe-aylor J., Smola A.J. (00). Estimatig the Support of a High-Dimesioal Distributio, Neural Computatio, 3, 443-47. [his paper applies the idea of support vector macchies to a desity estimatio by formulatig it as a oe-class SVM]. Biographical Sketch Kazushi Ikeda received the B.E., M.E. ad Ph.D. degrees i Mathematical Egieerig ad Iformatio Physics from the Uiversity of okyo, okyo, Japa, i 989, 99 ad 994, respectively. He the oied the Departmet of Electrical ad Computer Egieerig, Kaazawa Uiversity, Kaazawa, Japa, as a Assistat Professor ad moved to the Departmet of Systems Sciece, Kyoto Uiversity, Kyoto, Japa, as a Associate Professor, i 998. Sice 008, he has bee a Professor i the Graduate School of Iformatio Sciece, Nara Istitute of Sciece ad echology, Nara, Japa. His research iterests iclude machie learig theory such as support vector machies ad iformatio geometry, ad applicatios to adaptive systems ad brai iformatics. He is curretly the Editor-i-Chief of Joural of Japaese Neural Network Society (JNNS), a Actio Editor of Neural Networks (Elsevier) ad a Associate Editor of IEEE rasactios o Neural Networks ad Learig Systems ad IEICE rasactios o Iformatio ad Systems. He has served as a member of Board of Goverors of JNNS. 78