CSE 252C: Computer Vision III

Size: px

Start display at page:

Download "CSE 252C: Computer Vision III"

Blaze Bradley
5 years ago
Views:

1 CSE 252C: Computer Vson III Lecturer: Serge Belonge Scrbe: Catherne Wah LECTURE 15 Kernel Machnes Kernels We wll study two methods based on a specal knd of functon k(x, y) called a kernel: Kernel PCA (to help buld our ntuton) and Support Vector Machnes, whch are a powerful dscrmnatve learnng method. What s a kernel? In ths context, t s a functon that satsfes the followng condton: (15.1) k(x, x j )c c j 0,,j for some set of feature vectors x and any vector c. We say the kernel s postve sem-defnte, or that t satsfes Mercer s condton. 1 Department of Computer Scence and Engneerng, Unversty of Calforna, San Dego. December 31,

2 SERGE BELONGIE, CSE 252C: COMPUTER VISION III 15.2. Kernel Trck What s nterestng about ths, and what forms the bass of the so-called kernel trck (Azerman et al.

2 2 SERGE BELONGIE, CSE 252C: COMPUTER VISION III Kernel Trck What s nterestng about ths, and what forms the bass of the so-called kernel trck (Azerman et al., 1964), s that kernels that satsfy ths condton are equvalent to a smple dot product n some hgher dmensonal space: (15.2) k(x, y) = φ(x) φ(y), where φ( ) maps a vector from nput space to feature space (usually hgher, or nfnte, dmensonal). In some cases we specfy φ( ), n other cases, t s mplct for a choce of k(, ). The kernel trck refers to the dea of swappng n k(x, y) for x y n algorthms that call for a dot product between two vectors, for example, the perceptron, SVMs (supervsed), PCA (unsupervsed). In some cases, t may not be obvous that the algorthms, at ther heart, come down to dot products, but these and several others have been formulated n such a way to exhbt ths characterstc. So what does ths buy us? In the supervsed case, classes that aren t lnearly separable n nput space mght be lnearly separable n feature space (Fgure 1). Consder the Fgure 1. The clump and annulus problem. These classes are lnearly nseparable n the nput space. Note: a quadratc surface would do the trck here, but lnear algorthms are smpler and better understood. mappng Φ : (x 1, x 2 ) T (x 1, x 2, x x 2 2) T, where x R 2 and φ(x) R 3. In ths way, by addng a new dmenson, t becomes possble to separate the clump from the annulus. Ths hnts to us that hgh dmensonalty can be good (greater classfcaton power)! Your nstnct mght be tellng you that hgh dmensonalty can also get you nto trouble, conceptually or computatonally; ths nstnct s correct.

3 LECTURE 15. KERNEL MACHINES 3 Ths dea s the subject of Vapnk-Chervonenks (VC) theory, a crtcal component of statstcal learnng theory, whch addresses questons such as how best to perform classfcaton gven nothng but labelled example data (wth no pror knowledge). We won t study VC theory n ths class, but t provdes us wth answers to questons such as under what condtons are hgh dmensonal representatons good? Supposng we restrct ourselves to cases n whch the hgh dmensonal mappng s good, there s stll the problem of computaton tme. For example, a smple polynomal mappng on MNIST dgts can easly explode the dmensonalty nto the bllons. Ths s where the kernel trck comes n. Except for these toy examples, n practce, one never makes explct use of Φ( ), or even necessarly needs to know what t s. We only need to know that t exsts, whch Mercer s condton answers for us. Ths was foreshadowed n the second homework, where we consdered the Mahalanobs dstance between two vectors: (15.3) (x x j ) T Σ 1 (x x j ) = (y y j ) T (y y j ) = y y j 2 where, f 1 s postve sem-defnte, we can wrte 1 = QQ T and set y = Q T x. In ths case, Q represents a smple case of a feature mappng; the Mahalanobs dstance s equvalent to a dot product n a dfferent space Kernel PCA Now let s look at an applcaton of the kernel trck: Kernel PCA (Schölkopf et al., 1998). Ths s an unsupervsed example, used for dmensonalty reducton when the data doesn t lve on a lnear manfold. We use the kernel trck to perform classc PCA n hgh dmensonal space, then map t back to nput space, where the axes wll appear curved (Fgure 2). Before we kernelze PCA, let s revew how classc PCA works (assumng centered data: N =1 x = 0). PCA fnds the egenvectors of the covarance matrx C: (15.4) C = 1 N (x )(x ) T. N =1 By constructon, C s postve semdefnte. Egenvectors λv = Cv are used to capture prncpal axes of varaton. We can use C n the followng way: (a) Fnd the egenvectors and egenvalues, sorted n decreasng order by egenvalue, possbly truncatng (b) Project test data onto egenvectors (c) Use the projectons for classfcaton, denosng, compresson, etc.

4 4 SERGE BELONGIE, CSE 252C: COMPUTER VISION III Fgure 2. The basc dea of kernel PCA wth 2D nputs. In some hghdmensonal feature space F (bottom rght), we are performng lnear PCA, just lke a PCA n nput space (top). Snce F s nonlnearly related to the nput space (va Φ), the contour lnes of constant projectons onto the prncpal egenvector become nonlnear n nput space (Schölkopf et al., 1998). We d lke to be able to do ths for nonlnear data, therefore, the soluton s to kernelze. In order to do so, we start by expressng PCA n terms of dot products: (15.5) Cv = 1 N x x T v = λv, therefore (15.6) v = 1 Nλ j x j x jt v = 1 Nλ (x j v)x j. (Note here that (x j v) s just a scalar.) By usng dot products, all solutons v wth λ 0 le n the span of {x }, or v = α x, where the α s are the dot products. j

5 LECTURE 15. KERNEL MACHINES 5 Now suppose we pck some mappng φ( ). Assume we center the data (a homework problem), and as before, we wrte (15.7) C = 1 φ(x )φ(x ) T, N whch s possbly hgh dmensonal, and we dagonalze t as λv = Cv. Lke before, v les n the span of φ(x ) s. Now we dot both sdes by φ(x j ): (15.8) λ(φ(x j ) v) = φ(x j ) Cv, j and note that v = α φ(x ). The motvaton for ths s that we want to avod explct use of φ( ); we only want to use t va k(, ). Therefore, (15.9) λ ( α φ(x j ) φ(x ) ) ( = 1 α φ(x j ) (φ(x φ(x )) l l ) φ(x ) ), k N or (15.10) NλKα = K 2 α where K j = φ(x ) φ(x j ) and α s a vector of α s. Note that snce K s full rank, t follows that (15.11) Nλα = Kα. Ths s our egenvector problem, now on K nstead of the covarance matrx. So n the end, kernel PCA just requres us to dagonalze K nstead of C, a dfferent (and bgger) covarance matrx defned on vectors n a hgh dmensonal space. In practce, the data needs to be centered frst, whch can be done wa a smple operaton on K; see Homework 4. How do we use these prncpal components? Consder some test pont x (e.g., any pont n R 2 ). To project t onto egenvectors, we compute: (15.12) v k φ(x) = α k(x, x) for the kth kernel prncpal component coeffcent. We demonstrate ths for a 2D toy example wth 3 gaussan clumps, usng a gaussan kernel (Fgure 3). Recall that regular PCA would just gve two orthogonal axes n R 2. Surface brghtness shows the value of the kth engenvector projecton; for lnear PCA you would just have 2 ramps. Note the smlarty to spectral clusterng; NCut used D 1/2 W D 1/2 nstead of centerng n feature space. It allows extrapolaton to the full plane; ths s the same thng as the Nyström extenson, whch s used n spectral clusterng. Note n ths example, each cluster s frst found, then a local x-y coordnate system s extracted for l

6 SERGE BELONGIE, CSE 252C: COMPUTER VISION III Fgure 3. Toy example wth three data clusters; frst eght nonlnear prncpal components are extracted wth a radal bass functon (.e., Gaussan) for the kernel.

6 6 SERGE BELONGIE, CSE 252C: COMPUTER VISION III Fgure 3. Toy example wth three data clusters; frst eght nonlnear prncpal components are extracted wth a radal bass functon (.e., Gaussan) for the kernel. Note that the frst two prncpal components (top left) ncely separate the three clusters (Schölkopf et al., 1998). t. For denosng, we use a fxed pont teraton to fnd the premage; ths s related to mean-shft Support Vector Machnes Now let s look at a dscrmnatve example: the support vector machne (SVM). Now we have data labeled (postve and negatve): (15.13) {(x 1, y 1 ), (x 2, y 2 ),..., (x N, y N )}, y {+1, 1} In the lnear case, we want to fnd a separatng hyperplane wth maxmum margn,.e., we want a hyperplane wth maxmal dstance to the nearest data pont (Fgure 4). The lnear classfer (decson boundary) has the form: (15.14) w x b = 0, where w s the normal vector for the hyperplane and b s the offset. SVM learnng requres the soluton of a quadratc programmng problem (beyond the scope of ths class) that returns the data ponts that prop up the hyperplanes that defne the margn, parallel to the separatng hyperplane. Ths can be generalzed to the soft-margn case, when perfect separaton s not possble. The extenson to cases requrng a nonlnear separaton boundary requres kernelzng; the lnear separatng boundary looks curved n nput space. When testng what sde of the decson boundary you are on, we use a kernel n place of the dot product, whch we can thnk of as the dstance (or smlarty) to exemplars rght on ether sde of the margn; e.g., for face gender classfcaton, these are hghly hermaphrodtc faces ±ɛ.

LECTURE 15. KERNEL MACHINES 7 Fgure 4. Maxmum-margn hyperplane and margns for a SVM traned wth samples from two classes. Samples on the margn are called the support vectors. (http://en.wkpeda.

7 LECTURE 15. KERNEL MACHINES 7 Fgure 4. Maxmum-margn hyperplane and margns for a SVM traned wth samples from two classes. Samples on the margn are called the support vectors. ( vector machne) To summarze, SVMs are a hghly general method wth excellent results. The usual questons of what kernel to use, what value of σ, etc. can be addressed wth experence and cross valdaton.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest