CHAPTER 5. THE KERNEL METHOD 138

Size: px
Start display at page:

Download "CHAPTER 5. THE KERNEL METHOD 138"

Transcription

1 CHAPTER 5. THE KERNEL METHOD 138 Chapter 5 The Kernel Method Before we can mine data, it is important to first find a suitable data representation that facilitates data analysis. For example, for complex data like text, sequences, images, and so on, we must typically extract or construct a set of attributes or features, so that we may represent the data instances as multivariate vectors. That is, given a data instance x e.g., a sequence), we need to find a mapping φ, so that φx) is the vector representation of x. Even when the input data is a numeric data matrix, if we wish to discover non-linear relationships among the attributes, then a non-linear mapping φ may be used, so that φx) represents a vector in the corresponding high-dimensional space comprising non-linear attributes. We use the term input space to refer to the data space for the input data x, andfeature space to refer to the space of mapped vectors φx). Thus, given a set of data objects or instances x i,andgivenamappingfunctionφ, wecantransformthemintofeature vectors φx i ),whichthenallowsustoanalyzecomplexdatainstancesvianumeric analysis methods. Example 5.1 Sequence-based Features): Consider a dataset of DNA sequences over the alphabet Σ {A, C, G, T }. One simple feature space is to represent each sequence in terms of the probability distribution over symbols in Σ. That is, given a sequence x with length x m, themappingintofeature space is given as φx) {P A),PC),PG),PT)} where P s) ns m for s Σ and n s is the number of times symbol s appears in sequence x. HeretheinputspaceisthesetofsequencesΣ,andthefeaturespace is R 4.Forexample,ifx ACAGCAGT A, withm x 9,sinceA occurs four times, C and G occur twice, and T occurs once, we have φx) 4/9, 2/9, 2/9, 1/9) 0.44, 0.22, 0.22, 0.11)

2 CHAPTER 5. THE KERNEL METHOD 139 Likewise for y AGCAAGCGAG, we have φy) 4/10, 2/10, 4/10, 0) 0.4, 0.2, 0.4, 0) The mapping φ now allows one to compute statistics over the data sample, and make inferences about the population. For example, we may compute the mean symbol composition. We can also define the distance between any two sequences, for example, δx, y) φx) φy) ) ) ) ) We can compute larger feature spaces by considering, for example, the probability distribution over all substrings or words of size up to k over the alphabet Σ, and so on. Example 5.2 Non-linear Features): As an example of a non-linear mapping consider the mapping φ that takes as input a vector x x 1,x 2 ) T R 2 and maps it to a quadratic feature space via the non-linear mapping φx) x 2 1,x2 2, 2x 1 x 2 ) T R 3 For example, the point x 5.9, 3) T is mapped to the vector φx) 5.9 2, 3 2, ) T 34.81, 9, 25.03) T The main benefit of this transformation is that we may apply well known linear analysis methods in the feature space. However, since the features are non-linear combinations of the original attributes, this allows us to mine non-linear patterns and relationships. Whereas mapping into feature space allows one to analyze the data via algebraic and probabilistic modeling, the resulting feature space is usually very high dimensional; it may even be infinite dimensional as we shall see below). Thus, transforming all the input points into feature space can be very expensive, or even impossible. Since the dimensionality is high, we also run into the curse of dimensionality highlighted in Chapter 6. Kernel methods avoid explicitly transforming each point x in the input space into the mapped point φx) in the feature space. Instead, the input objects are represented via their n n pair-wise similarity values. The similarity function, called

3 CHAPTER 5. THE KERNEL METHOD 140 the kernel, ischosensothatitrepresentsadotproductinsomehigh-dimensional feature space, yet it can be computed without directly constructing φx). LetI denote the input space, which can comprise any arbitrary set of objects, and let D {x i } n Ibe a dataset comprising n objects in the input space. We can represent the pair-wise similarity values between point in D via the n n kernel matrix, alsocalledthegram matrix, definedas Kx 1, x 1 ) Kx 1, x 2 ) Kx 1, x n ) Kx 2, x 1 ) Kx 2, x 2 ) Kx 2, x n ) K ). Kx n, x 1 ) Kx n, x 2 ) Kx n, x n ) where K : I I R is a kernel function on any two points in input space. However, we require that K corresponds to a dot product in some feature space. That is, for any x i, x j I,wehave Kx i, x j )φx i ) T φx j ) 5.2) where φ : I F is a mapping from the input space I to the feature space F. Intuitively, this means that we should be able to compute the value of the dot product using the original input representation x, without having recourse to the mapping φx). Obviously, not just any arbitrary function can be used as a kernel; a valid kernel function must satisfy certain conditions so that 5.2) remains valid, as discussed later. X x 4 x 1 x 2 x a) x 5 X 1 K x 1 x 2 x 3 x 4 x 5 x x x x x b) Figure 5.1: a) Example Points. b) Linear Kernel Matrix Example 5.3: Linear Kernel: Consider the identity mapping, φx) x. This naturally leads to the linear kernel, which is simply the dot product between two input vectors, and thus satisfies 5.2) φx) T φy) x T y Kx, y)

4 CHAPTER 5. THE KERNEL METHOD 141 For example, consider the first five points from the two-dimensional Iris dataset shown in Figure 5.1a ) ) ) ) ) x 1 x 3 2 x x x The kernel matrix for the linear kernel is shown in Figure 5.1b. For example, Kx 1, x 2 )x T 1 x Quadratic Kernel: Consider the quadratic mapping φ : R 2 R 3 from Example 5.2, that maps x x 1,x 2 ) T as follows φx) x 2 1,x 2 2, 2x 1 x 2 ) T The dot product between the mapping for two input points x, y R 2 is given as φx) T φy) x 2 1y x 2 2y x 1 y 1 x 2 y 2 We can rearrange the above to obtain the homogeneous) quadratic kernel function as follows φx) T φy) x 2 1 y2 1 + x2 2 y2 2 +2x 1y 1 x 2 y 2 x 1 y 1 + x 2 y 2 ) 2 x T y) 2 Kx, y) We can thus see that the dot product in feature space can be computed by evaluating the kernel in input space, without explicitly mapping the points into feature space. For example, we have φx 1 )5.9 2, 3 2, ) T 34.81, 9, 25.03) T φx 2 )6.9 2, 3.1 2, ) T 47.61, 9.61, 30.25) T φx 1 ) T φx 2 ) We can verify that the homogeneous quadratic kernel gives the same value Kx 1, x 2 )x T 1 x 2) ) We shall see that many data mining methods can be kernelized, i.e.,insteadof mapping the input points into feature space, the data can be represented via the n n

5 CHAPTER 5. THE KERNEL METHOD 142 kernel matrix K, and all relevant analysis can be performed over K. This is usually done via the so called kernel trick, i.e.,showthattheanalysistaskrequiresonlydot products φx i ) T φx j ) in feature space, which can be replaced by the corresponding kernel Kx i, x j )φx i ) T φx j ) that can be computed efficiently in input space. Once the kernel matrix has been computed, we no longer even need the input points x i, since all operations involving only dot products in the feature space can be performed over the n n kernel matrix K. Animmediateconsequenceisthatwhen the input data is the typical n d numeric matrix D and we employ the linear kernel, the results obtained by analyzing K are equivalent to those obtained by analyzing D as long as only dot products are involved in the analysis). Of course,kernels methods allow much more flexibility, since we can just as easily perform non-linear analysis by employing non-linear kernels, or we may analyze non-numeric) complex objects without explicitly constructing the mapping φx). Example 5.4: Consider the five points from Example 5.3 along with the linear kernel matrix shown in Figure 5.1. The mean of the five points in featurespace is simply the mean in input space, since φ is the identity function for the linear kernel µ φ 5 φx i ) 5 x i 6.00, 2.88) T Now consider the squared magnitude of the mean in feature space µ φ 2 µ T φ µ φ )44.29 Since this involves only a dot product in feature space, the squared magnitude can be computed directly from K. Asweshallseelatersee5.15))thesquarednorm of the mean vector in feature space is equivalent to the average value of the kernel matrix K. For the kernel matrix in Figure 5.1b we have j1 Kx i, x j ) which matches the µ φ 2 value computed above. This example illustrates that operations involving dot products in feature space can be cast as operations over the kernel matrix K. Kernel methods offer a radically different view of the data. Instead of thinking of the data as vectors in input or feature space, we consider only the kernel values between pairs of points. The kernel matrix can also be considered as the weighted adjacency matrix for the complete graph over the n input points, and consequently

6 CHAPTER 5. THE KERNEL METHOD 143 there is a strong connection between kernels and graph analysis, in particular algebraic graph theory. 5.1 Kernel Matrix Let I denote the input space which can be any arbitrary set of data objects, and let D {x 1, x 2,, x n } Idenote a subset of n objects in the input space. Let φ : I F be a mapping from the input space into the feature space F, which is endowed with a dot product and norm. Let K : I I R be a function that maps pairs of input objects to their dot product value in feature space, i.e., Kx i, x j )φx i ) T φx j ),andletk be the n n kernel matrix corresponding to the subset D. The function K is called a positive semi-definite kernel if and only if it is symmetric Kx i, x j )Kx j, x i ) and the corresponding kernel matrix K for any subset D Iis positive semi-definite, that is, a T Ka 0, for all vectors a R n a i a j Kx i, x j ) 0, for all a i R,i [1,n] 5.3) j1 Other common names for a positive semi-definite kernel include Mercer kernel and reproducing kernel. We first verify that if Kx i, x j ) represents the dot product φx i ) T φx j ) in some feature space, then K is a positive semi-definite kernel. Consider any dataset D, and let K {Kx i, x j )} be the corresponding kernel matrix. First, K is symmetric since the dot product is symmetric, which also implies that K is symmetric. Second, K is positive semi-definite since a T Ka j1 j1 a i a j Kx i, x j ) a i a j φx i ) T φx j ) ) T a i φx i ) a j φx j ) a i φx i ) 2 0 j1

7 CHAPTER 5. THE KERNEL METHOD 144 Thus, K is a positive semi-definite kernel. We now show that if we are given a positive semi-definite kernel K : I I R, then it corresponds to a dot product in some feature space F Reproducing Kernel Map For any x Iin the input space, define the mapping φx) as follows x φx) φ x ) Kx, ) where the stands for any argument in I. Whatwehavedoneismapeachpointx to a function φ x : I R, sothatforanyparticularargumentx i,wehave φ x x i )Kx, x i ) In other words, if we evaluate the function φ x at the point x i,thentheresultis simply the kernel value between x and x i. Thus, each object in input space get mapped to a feature point which is in fact a function. Such a space is also called functional space, sinceitisaspaceoffunctions,butalgebraicallyitisjustavector space where each point happens to be a function. Let F be the set of all functions or points that can be obtained as a linear combination of any subset of feature points F { f f ) m } α i Kx i, ) m is any natural number, and α i R In other words, F includes all possible linear combinations for any value of m) over feature points. We use the dual notation f f ) to emphasize the fact that each point f in the feature space is in fact a function f ). Note that the mapped point φx) φ x ) Kx, ) is itself a point in F. Let f, g F be any two points in the feature space m a f f ) α i Kx i, ) g g ) β j Kx j, ) Define the dot product between two points as m b m b j1 m a f T g f ) T g ) α i β j Kx i, x j ) j1 We can verify that the dot product is bilinear, i.e.,linearinbotharguments,since m b m a m a f T g α i β j Kx i, x j ) α i gx i ) β j fx j ) j1 m b j1

8 CHAPTER 5. THE KERNEL METHOD 145 The fact that K is positive semi-definite implies that m a m a f 2 f T f α i α j Kx i, x) 0 j1 Thus, the space F is a normed inner product space, i.e., it is endowed with a symmetric bilinear dot product and a norm. In fact the functional space F is a Hilbert space, definedasanormedinnerproductspacethatiscompleteandseparable. The space F has the so called reproducing property, thatisthedotproductofany function f f ) with any feature point φx) φ x ) results in the same function f ), since m a f T φx) f ) T φ x ) α i Kx i, x) fx) Thus, the dot product of any function and any mapped point reproduces the function. For this reason, the space F is also called a reproducing kernel Hilbert space. All we have to do now, is to show that Kx i, x j ) corresponds to a dot product in the feature space F. Thisisindeedthecase,sinceforanytwomappedpointsφx i ) and φx j ) φx i )φ xi ) Kx i, ) φx j )φ xj ) Kx j, ) their dot product in feature space is given as φx i ) T φx j )φ xi ) T φ xj ) Kx i, x j ) Empirical Kernel Map: The reproducing kernel map φ maps the input space into a potentially infinite dimensional feature space. However, given a dataset D {x i } n,wecanobtainafinitedimensionalmappingbyevaluatingthekernel only on points in D. Thatis,definethemapφ as follows ) T φx) Kx 1, x),kx 2, x),,kx n, x) R n which maps each point x Ito the n-dimensional vector comprising the kernel values of x with each of the objects x i D. Thedotproductinthefeaturespaceis given as φx i ) T φx j ) Kx k, x i )Kx k, x j )K T i K j 5.4) k1 where K i denotes the i-th row of K, whichisalsothesameasthei-th column of K, sincek is symmetric. However, for φ to be a valid map, we require that

9 CHAPTER 5. THE KERNEL METHOD 146 φx i ) T φx j )Kx i, x j ),whichisclearlynotsatisfiedby5.4). Onesolutionisto replace K T i K j in 5.4) with K T i AK j for some positive semi-definite matrix A such that K T i AK j Kx i, x j ) If we can find such an A, itwouldimplythatoverallpairsofmappedpointswe have } { } n {K T i AK j ni,j1 Kx i, x j ) i,j1 KAK K This immediately suggests that we take A K 1,thepseudo)inverseofthekernel matrix K, withthethedesiredmapφ, calledtheempirical kernel map, definedas ) T φx) K 1/2 Kx 1, x),kx 2, x),,kx n, x) R n so that the dot product yields ) T ) φx i ) T φx j ) K 1/2 K T i K 1/2 K j K T i K 1/2 K 1/2) K j K T i K 1 K j Over all pairs of mapped points, we have { K T i K 1 K j } n i,j1 KK 1 K K as desired Mercer Kernel Map Data-specific Kernel Map: The Mercer kernel map is best understood starting from the kernel matrix for the dataset D in input space. Since K is a symmetric positive semi-definite matrix, it has real and non-negative eigenvalues, and it can be decomposed as follows K UΛU T 5.5) where U is the orthonormal matrix of eigenvectors u i u i1,u i2,,u in ) R n for i 1,,n), and Λ is the diagonal matrix of eigenvalues, with both arranged in non-increasing order of the eigenvalues λ 1 λ 2 λ n 0 λ U u 1 u 2 u n 0 λ 2 0 Λ λ n

10 CHAPTER 5. THE KERNEL METHOD 147 The kernel matrix K can therefore be rewritten as the spectral sum K λ 1 u 1 u T 1 + λ 2 u 2 u T λ n u n u T n In particular the kernel function between x i and x j is given as Kx i, x j )λ 1 u 1i u 1j + λ 2 u 2i u 2j + λ n u ni u nj k1 λ k u ki u kj 5.6) where u ki denotes the i-th component of eigenvector u k.itfollowsthatifwedefine the Mercer map φ as follows φx i ) λ 1 u i1, λ 2 u 2i,, λ n u ni ) T 5.7) then, Kx i, x j ) is indeed the dot product in feature space between the mapped points φx i ) and φx j ),since φx i ) T φx j ) λ1 u 1i,, ) λ n u ni λ1 u 1j,, ) T λ n u nj λ 1 u 1i u 1j + + λ n u ni u nj Kx i, x j ) Noting that U i u 1i,u 2i,,u ni ) T is the i-th row of U, wecanrewritethemercer map φ as φx i ) ΛU i 5.8) Thus, the kernel value is simply the dot product between scaled rows of U ) T ) φx i ) T φx j ) ΛUi ΛUj Ui T ΛU j 5.9) The Mercer map, defined equivalently in 5.7) and 5.8), is obviously restricted to the input data set D, and is therefore called the data-specific Mercer kernel map. It defines a data-specific feature space of dimensionality at most n, comprisingthe eigenvectors of K. Example 5.5: Let the input dataset comprise the five points shown in Figure 5.1a, and let the corresponding kernel matrix be as shown in Figure 5.1b. Computing the eigen-decomposition of K, weobtainλ , λ , andλ 3 λ 4 λ 5 0. The effective dimensionality of the feature space is 2, comprising the eigenvectors u 1 and u 2.Thus,thematrixUis given as follows u 1 u 2 U U U U U U

11 CHAPTER 5. THE KERNEL METHOD 148 and we have ) Λ ) ) Λ The kernel map is specified via 5.8). For example, for x 1 5.9, 3) T and x 2 6.9, 3.1) T we have φx 1 ) ) ) ) ΛU 1 φx 2 ) ΛU 2 Their dot product is given as ) ) φx 1 ) T φx 2 ) which matches the kernel value Kx 1, x 2 ) in Figure 5.1b ) Mercer Kernel Map: For continuous spaces, analogous to the discrete case in 5.6), the kernel value between any two points can be written as the infinite spectral decomposition Kx i, x j ) λ k u k x i ) u k x j ) k1 where {λ 1,λ 2, } is the infinite set of eigenvalues, and { u 1 ), u 2 ), } is the corresponding set of orthogonal and normalized eigenfunctions, i.e.,each function u i ) is a solution to the integral equation Kx, y) u i y) dy λ i u i x) and K is positive semi-definite kernel, i.e., for all functions a ) with finite square integrals ax) 2 dx < 0) Kx 1, x 2 ) ax 1 ) ax 2 ) dx 1 dx 2 0 Contrast this with the discrete kernel in 5.3).

12 CHAPTER 5. THE KERNEL METHOD 149 Analogously to the data-specific Mercer map 5.7), the general Mercer kernel map is given as φx i ) λ1 u 1 x i ), ) T λ 2 u 2 x i ), with the kernel value being equivalent to the dot product between two mapped points 5.2 Vector Kernels Kx i, x j )φx i ) T φx j ) Kernels that map an input) vector space into the another feature) vector space are called vector kernels. For multivariate input data, the input vector space will be the d-dimensional real space R d. Let D consist of n input points x i R d,for i 1, 2,...,n.Somecommonlyusednon-linear)kernelfunctionsovervector data include the polynomial and Gaussian kernels. Polynomial Kernel: Polynomial kernels are of two types: homogeneous or inhomogeneous. Let x, y R d.thehomogeneous polynomial kernel is defined as K q x, y) φx) T φy) x T y) q 5.10) where q is the degree of the polynomial. This kernel corresponds to a feature space spanned by all products of exactly q attributes or dimensions. The most typical cases are the linear with q 1)andquadratic with q 2) kernels, given as K 1 x, y) x T y K 2 x, y) x T y) 2 The inhomogeneous polynomial kernel is defined as K q x, y) φx) T φy) c + x T y) q 5.11) where q is the degree of the polynomial, and c 0 is some constant. When c 0we obtain the homogeneous kernel. When c>0, this kernel corresponds to the feature space spanned by all products of at most q attributes. This can be seen from the Binomial expansion K q x, y) c + x T y) q q k1 ) q c q k x T y ) k k

13 CHAPTER 5. THE KERNEL METHOD 150 For example, for the typical value of c 1,theinhomogeneouskernelisaweighted sum of the homogeneous polynomial kernels for all powers up to q, i.e., ) q x 1 + x T y) q 1+qx T y + T y ) q x T y ) q 1 + x T y ) q 2 Example 5.6: Consider the points x 1 and x 2 in Figure 5.1. ) ) x 1 x The homogeneous quadratic kernel is given as Kx 1, x 2 )x T 1 x 2 ) The inhomogeneous quadratic kernel is given as Kx 1, x 2 )1+x T 1 x 2) ) For the polynomial kernel it is possible to derive the mapping φ from the input to the feature space. Let n 0,n 1,...,n d denote non-negative integers, such that d i0 n i q. Further let n n 0,n 1,...,n d ),andlet n d i0 n i. Also, let q) n denote the multinomial coefficient ) ) q q q! n n 0,n 1,...,n d n 0!n 1!...n d! The multinomial expansion of the inhomogeneous kernel is then given as ) q d K q x, y) c + x T y) q c + x k y k c + x 1 y x d y d ) q k1 ) q c n 0 x 1 y 1 ) n 1 x 2 y 2 ) n 2...x d y d ) n d n n q ) q c n 0 x n ) 1 1 n xn xn d d y n 1 ) 1 yn yn d d n q ) ) d d an an n q φx) T φy) k1 x n k k k1 y n k k

14 CHAPTER 5. THE KERNEL METHOD 151 where a n q n) c n 0,andthesummationisoveralln n 0,n 1,...,n d ) such that n n 0 + n n d q. Using the notation x n d k1 xn k k,themapping φ : R d R m is given as the vector φx)...,a n x n,...) T..., q ) n c n 0 d k1 x n k k,... where the variable n n 0,...,n d ) ranges over all the possible assignments, such that n q. It can be shown that the dimensionality of the feature space is given as ) d + q m q ) T Example 5.7 Quadratic Polynomial Kernel): Let x, y R 2. Let c 1. The inhomogeneous quadratic polynomial kernel is given as Kx, y) 1+x T y) 2 1+x 1 y 1 + x 2 y 2 ) 2 The set of all assignments n n 0,n 1,n 2 ),suchthat n q 2,andthecorresponding terms in the multinomial expansion are shown below. assignments coefficient variables n n 0,n 1,n 2 ) a n q ) n 0,n 1,n 2 c n 0 x n y n d k1 x iy i ) n i 1, 1, 0) 2 x 1 y 1 1, 0, 1) 2 x 2 y 2 0, 1, 1) 2 x 1 y 1 x 2 y 2 2, 0, 0) 1 1 0, 2, 0) 1 x 1 y 1 ) 2 0, 0, 2) 1 x 2 y 2 ) 2 Thus, the kernel can be written as Kx, y) 1+2x 1 y 1 +2x 2 y 2 +2x 1 y 1 x 2 y 2 + x 2 1 y2 1 + x2 2 y2 2 1, 2x 1, 2x 2, 2x 1 x 2,x 2 1,x2 2 )1, 2y 1, 2y 2, 2y 1 y 2,y 2 1,y2 2 )T φx) T φy) When the input space is R 2,thedimensionalityofthefeaturespaceisgivenas ) ) ) d + q m 6 q 2 2

15 CHAPTER 5. THE KERNEL METHOD 152 In this case the inhomogeneous quadratic kernel with c 1corresponds to the mapping φ : R 2 R 6,givenas φx) 1, 2x 1, 2x 2, 2x 1 x 2,x 2 1,x 2 2) T For example, for x 1 5.9, 3) T and x 2 6.9, 3.1) T,wehave φx 1 )1, 2 5.9, 2 3, , 5.9 2, 3 2 ) T 1, 8.34, 4.24, 25.03, 34.81, 9) T φx 2 )1, 2 6.9, 2 3.1, , 6.9 2, ) T 1, 9.76, 4.38, 30.25, 47.61, 9.61) T Thus, the inhomogeneous kernel value is φx 1 ) T φx 2 ) On the other hand, when the input space is R 2,thehomogeneousquadratickernel corresponds to the mapping φ : R 2 R 3,definedas φx) 2x 1 x 2,x 2 1,x2 2 )T since only the degree 2 terms are considered. For example, for x 1 and x 2,wehave and thus φx 1 ) , 5.9 2, 3 2 ) T 25.03, 34.81, 9) T φx 2 ) , 6.9 2, ) T 30.25, 47.61, 9.61) T Kx 1, x 2 )φx 1 ) T φx 2 ) These values essentially match those shown in Example 5.6 up to four significant digits. Gaussian Kernel: defined as The Gaussian kernel, also called the Radial Basis Kernel, is Kx, y) exp { } x y 2 2σ ) where σ > 0 is the spread parameter that plays the same role as the standard deviation in a normal density function. Note that Kx, x) 1, and further that the

16 CHAPTER 5. THE KERNEL METHOD 153 kernel value is inversely proportional to the distance between the two points x and y. Example 5.8: Consider again the points x 1 and x 2 in Figure 5.1 ) ) x 1 x The squared distance between them is given as x 1 x 2 2 1, 0.1) T With σ 1,theGaussiankernelis } Kx 1, x 2 )exp { exp{ 0.51} It is interesting to note that the feature space for a Gaussian kernelhasinfinite dimensionality. To see this, note that the exponential function can be written as the infinite expansion exp{a} n0 a n n! 1+a + 1 2! a ! a Further, using γ 1,andnotingthat x y 2 x 2 + y 2 2x T y,wecan 2σ 2 rewrite the Gaussian kernel as follows { γ x y 2} Kx, y) exp exp { γ x 2} exp { γ y 2} exp { 2γx T y } In particular, the last term is given as the infinite expansion exp { 2γx T y } 2γ) q q! q0 x T y ) q 1+2γ)x T y + 2γ)2 2! x T y )

17 CHAPTER 5. THE KERNEL METHOD 154 Using the multinomial expansion of x T y) q,wecanwritethegaussiankernelas Kx, y) { exp γ x 2} exp { γ y 2} q0 n q φx) T φy) q0 aq,n exp { γ x 2} d k1 2γ) q q! x n k k n q ) d q n k1 x k y k ) n k ) aq,n exp { γ y 2} d where a q,n 2γ)q q ) q! n,andn n1,n 2,...,n d ),with n n 1 + n n d q. The mapping into feature space corresponds to the function φ : R d R φx)..., 2γ) q q! ) ) T q exp { γ x 2} d x n k n k,... k1 with the dimensions ranging over all degrees q 0,...,, andwiththevariable n n 1,...,n d ) ranging over all possible assignments such that n q for each value of q. Since φ maps the input space into an infinite dimensional feature space, we obviously cannot compute φx), yet computing the Gaussian kernel Kx, y) is straightforward. 5.3 Basic Kernel Operations in Feature Space Let us look at some of the basic data analysis tasks that can be performed solely via kernels, without instantiating φx). k1 y n k k ) Norm of a Point: follows We can compute the norm of a point φx) in feature space as which implies that φx) Kx, x). φx) 2 φx) T φx) Kx, x) Distance between Points: The distance between two points φx i ) and φx j ) can be computed as follows φx i ) φx j ) 2 φx i ) 2 + φx j ) 2 2φx i ) T φx j ) 5.13) Kx i, x i )+Kx j, x j ) 2Kx i, x j )

18 CHAPTER 5. THE KERNEL METHOD 155 which implies that δ φx i ),φx j ) ) φx i ) φx j ) Kx i, x i )+Kx j, x j ) 2Kx i, x j ) Rearranging 5.13), we can see that the kernel value can be considered as a measure of the similarity between two points, since Kx i, x j )φx i ) T φx j ) 1 2 φxi ) 2 + φx j ) 2 φx i ) φx j ) 2) 5.14) Thus, the more the distance φx i ) φx j ) between the two points in feature space, the less the kernel value, i.e., the less the similarity. Example 5.9: Consider the two points x 1 and x 2 ) 5.9 x 1 x 3 2 ) Assuming the homogeneous quadratic kernel, the norm of φx 1 ) can be computed as φx 1 ) 2 Kx 1, x 1 )x T 1 x 1 ) which implies φx 1 ) The distance between φx 1 ) and φx 2 ) is δ φx 1 ),φx 2 ) ) Kx 1, x 1 )+Kx 2, x 2 ) 2Kx 1, x 2 ) Mean in Feature Space: The mean of the points in feature space is given as µ φ 1 n φx i ) Since we do not, in general, have access to φx), wecannotexplicitlycomputethe mean point in feature space.

19 CHAPTER 5. THE KERNEL METHOD 156 Nevertheless we can compute the norm of the mean as follows µ φ 2 µ φ ) T µ φ ) T 1 φx i ) 1 n n 1 n 2 1 n 2 j1 j1 φx j ) j1 φx i ) T φx j ) Kx i, x j ) 5.15) The above derivation implies that the squared norm of the mean in feature space is simply the average of the values in the kernel matrix K. Example 5.10: Consider the five points from Example 5.3, also shown in Figure 5.1. Example 5.4 showed the norm of mean for the linear kernel. Let us consider the Guassian kernel with σ 1.TheGaussiankernelmatrixisgivenas K The squared norm of the mean in feature space is therefore µ φ j1 Kx i, x j ) which implies that µ φ Total Variance in Feature Space: Let us first derive the formula for the distance of a point to the mean in feature space φx i ) µ φ 2 φx i ) 2 2φx i ) T µ φ + µ φ 2 Kx i, x i ) 2 Kx i, x j )+ 1 n n 2 Kx a, x b ) j1 a1 b1

20 CHAPTER 5. THE KERNEL METHOD 157 The total variance 1.9) in feature space is obtained by taking the average deviation of points from the mean in feature space σφ 2 1 n 1 n 1 n 1 n φx i ) µ φ 2 Kx i, x i ) 2 n Kx i, x i ) 2 n 2 Kx i, x i ) 1 n 2 Kx i, x j )+ 1 n 2 Kx a, x b ) j1 j1 j1 Kx i, x j )+ n n 3 a1 b1 a1 b1 Kx a, x b ) Kx i, x j ) 5.16) In other words, the total variance in feature space is given as thedifferencebetween the average of the diagonal entries and the average of the entire kernel matrix K. Example 5.11: Continuing Example 5.10, the total variance in feature space for the five points for the Gaussian kernel is given as ) σφ 2 1 Kx i, x i ) µ φ n 5 The distance between φx 1 ) from the mean µ φ in feature space is given as φx 1 ) µ φ 2 Kx 1, x 1 ) Kx 1, x j )+ µ φ 2 j ) Centering in Feature Space: We can center each point in feature space by subtracting the mean from it, as follows ˆφx i )φx i ) µ φ Since we do not have explicit representation of φx i ) or µ φ,wecannotexplicitly center the points. However, we can still compute the centered kernel matrix, i.e., the kernel matrix over centered points.

21 CHAPTER 5. THE KERNEL METHOD 158 The centered kernel matrix is then given as ˆK { ˆKxi, x j ) } n i,j1 where each cell corresponds to the kernel between centered points, given as ˆKx i, x j ) ˆφx i ) T ˆφxj ) φx i ) µ φ ) T φx j ) µ φ ) φx i ) T φx j ) φx i ) T µ φ φx j ) T µ φ +µ φ ) T µ φ Kx i, x j ) 1 φx i ) T φx k ) 1 φx j ) T φx k )+ µ φ 2 n n Kx i, x j ) 1 n k1 Kx i, x k ) 1 n k1 k1 Kx j, x k )+ 1 n 2 k1 a1 b1 Kx a, x b ) In other words, we can compute the centered kernel matrix using only the kernel function. Over all the pairs of points, we can write all n 2 entries in compact matrix notation as follows ˆK K 1 n 1 n nk 1 n K1 n n + 1 n 2 1 n nk1 n n I 1 ) n 1 n n K I 1 ) n 1 n n 5.17) where 1 n n is the n n singular matrix, all of whose entries equal one. Example 5.12: Consider the first five points from the two-dimensional Iris dataset shown in Figure 5.1a ) ) ) ) ) x 1 x 3 2 x x x Consider the linear kernel matrix shown in Figure 5.1b. We can centeritbyfirst computing I

22 CHAPTER 5. THE KERNEL METHOD 159 The centered kernel matrix 5.17) is given as ˆK I 1 ) I 1 ) To verify that ˆK is the same as the kernel matrix for the centered points, let us first center the points by subtracting the mean µ 6.0, 2.88) T. The centered points in feature space are given as z 1 ) z 2 ) z 3 ) For example, the kernel between φz 1 ) and φz 2 ) is z 4 ) φz 1 ) T φz 2 )z T 1 z z 5 ) which matches ˆKx 1, x 2 ) as expected. The other entries can be verified in a similar manner. Thus, the kernel matrix obtained by centering the data and then computing the kernel is the same as that obtained via 5.17). Normalizing in Feature Space: Acommonformofnormalizationistoensure that points in feature space have unit length by replacing φx i ) with the corresponding unit vector φ n x i ) φx i) φx i ).Thedotproductinfeaturespacethencorresponds to the cosine of the angle between the two mapped points, since φ n x i ) T φ n x j ) φx i) T φx j ) φx i ) φx j ) cos φθ) If the mapped points are both centered and normalized, then a dot product corresponds to the correlation between the two points in feature space. The normalized kernel matrix, K n,canbecomputedusingonlythekernelfunc-

23 CHAPTER 5. THE KERNEL METHOD 160 tion K, since K n x i, x j ) φx i) T φx j ) φx i ) φx j ) Kx i, x j ) Kxi, x i ) Kx j, x j ) K n has all diagonal elements as one. Let W denote the diagonal matrix comprising the diagonal elements ofk Kx 1, x 1 ) Kx 2, x 2 ) 0 W diagk) Kx n, x n ) The normalized kernel matrix can then be expressed compactly as K n W 1/2 K W 1/2 5.18) where W 1/2 is the diagonal matrix, defined as W 1/2 1 x i, x i ),withall Kxi,x i ) other elements being zero. Example 5.13: Consider the five points and the linear kernel matrix shown in Figure 5.1. We have W And the normalized kernel is given as K n W 1/2 K W 1/ The same kernel would have been obtained, if we had first normalized feature vectors to have unit length, and then taken the dot products. For example, with the linear kernel, the normalized point φ n x 1 ) is given as φ n x 1 ) φx 1) φx 1 ) x ) 1 x )

24 CHAPTER 5. THE KERNEL METHOD 161 ) Likewise, we have φ n x 2 ) ) Theirdotproductis φ n x 1 ) T φ n x 2 ) which matches K n x 1, x 2 ). If we start with the centered kernel matrix ˆK from Example 5.12, and then normalize it, we obtain the normalized and centered kernel matrix ˆK n ˆK n As noted earlier, the kernel value ˆK n x i, x j ) denotes the correlation between x i and x j in feature space, i.e., it is cosine of the angle between the centered points φx i ) and φx j ). 5.4 Kernels for Complex Objects We conclude this chapter with some examples of kernels definedforcomplexdatalike strings and graphs. The use of kernels for dimensionality reduction, classification, and clustering will be considered in later chapters Spectrum Kernel for Strings Consider text or sequence data defined over an alphabet Σ. The l-spectrum feature map is the mapping φ :Σ R Σ l from the set of substrings over Σ to the Σ l - dimensional space representing the number of occurrences of all possible substrings of length l, definedas φx) ) T, #α), α Σ l where #α) is the number of occurrences of the l-length string α in x. The full) spectrum map is an extension of the l-spectrum map, obtained by considering all lengths from l 0 to l, leading to an infinite dimensional feature map φ :Σ R φx) ) T, #α), α Σ

25 CHAPTER 5. THE KERNEL METHOD 162 where #α) is the number of occurrences of the string α in x. The l-)spectrum kernel between two strings x i, x j Σ is simply the dot product between their l-)spectrum maps Kx i, x j )φx i ) T φx j ) Anaivecomputationofthel-spectrum kernel takes O Σ l ) time. However, for a given string x of length n, thevastmajorityofthel-length strings have an occurrence count of zero, which can be ignored. The l-spectrum map can be effectively computed in On) time for a string of length n assuming n l), since there can be at most n l +1substrings of length l, andthel-spectrum kernel can thus be computed in On + m) time for any two strings of length n and m, respectively. The feature map for the full) spectrum kernel is infinite dimensional, but once again, for a given string x of length n, thevastmajorityofthestringswillhavean occurrence count of zero. A straightforward implementation ofthespectrummap for a string x of length n can be computed in On 2 ) time, since x can have at most n l1 n l +1nn +1)/2 distinct non-empty substrings. The spectrum kernel can then be computed in On 2 + m 2 ) time for any two strings of length n and m, respectively. However, a much more efficient computation is enabled via suffix trees see Chapter??), with a total time of On + m). Example 5.14: Consider sequences over the DNA alphabet Σ{A, C, G, T }. Let x 1 ACAGCAGT A, andletx 2 AGCAAGCGAG. For l 3,thefeature space has dimensionality Σ l Nevertheless,wedonothavetomapthe input points into the full feature space; we can compute the reduced 3-spectrum mapping by counting the number of occurrences for only the length 3 substrings that occur in each input sequence, as follows φx 1 )ACA :1,AGC :1,AGT :1,CAG:2,GCA:1,GTA :1) φx 2 )AAG :1,AGC :2,CAA:1,CGA :1,GAG :1,GCA:1,GCG:1) where α :#α) denotes that substring α has #α) occurrences in x i.wecanthen compute the dot product by considering only the common substrings, as follows Kx 1, x 2 ) The first term in the dot product is due to the substring AGC, andtheseconddue to GCA, whicharetheonlycommonlength3substringsbetweenx 1 and x 2. The full spectrum can be computed by considering the occurrences of all common substrings. For x 1 and x 2,thecommonsubstringsandtheiroccurrencecounts are given as α A C G AG CA AGC GCA AGCA #α) in x #α) in x

26 CHAPTER 5. THE KERNEL METHOD 163 Thus, the full spectrum kernel value is given as Kx 1, x 2 ) Diffusion Kernels on Graph Nodes Let S be some symmetric similarity matrix between nodes of a graph G V,E). For instance S can be the weighted) adjacency matrix A 4.2) or the negated Laplacian matrix L A 17.7), where is the degree matrix for an undirected graph G. Consider the similarity between any two nodes obtained by summing up the product of the similarities over paths of length 2 S 2) x i, x j ) Sx i, x a )Sx a, x j )S T i S j where S i a1 ) T Sx i, x 1 ),Sx i, x 2 ),,Sx i, x n ) denotes the vector representing the i-th row of S and since S is symmetric, it also denotes the i-th column of S). Over all pairs of nodes the 2 length path similarity matrix S 2) is thus given as the square of the base similarity matrix S S 2) S S S 2 In general, if we sum up the product of the base similarities over all l-length paths between two nodes we obtain the l-length similarity matrix S l),whichcanbeshown to be simply the l-th power of S S l) S l Power Kernels: Even path lengths lead to positive semi-definite kernels, but odd path lengths are not guaranteed to do so, unless the base matrix S is itself a positive semi-definite matrix. In particular, K S 2 is a valid kernel. To see this, assume that the i-th row of S denotes the feature map for x i,i.e.,φx i )S i. The kernel value between any two points is then a dot product in feature space Kx i, x j )S 2) x i, x j )S T i S j φx i ) T φx j ) For a general path length l, letk S l.considertheeigen-decompositionofs S UΛU T u i λ i u T i

27 CHAPTER 5. THE KERNEL METHOD 164 where U is the orthogonal matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues of S λ U u 1 u 2 u n 0 λ 2 0 Λ λ n The eigen-decomposition of K can be obtained as follows K S l UΛU T ) l U Λ l ) U T where we used the fact that eigenvectors of S and S l are identical, and further that eigenvalues of S l are given as λ i ) l for all i 1,,n), whereλ i is an eigenvalue of S. ForK S l to be a positive semi-definite matrix, all its eigenvalues must be nonnegative, which is guaranteed for all even path lengths. Since λ i ) l will be negative if l is odd and λ i is negative, odd path lengths lead to a positive semi-definite kernel only if S is positive semi-definite. Exponential Diffusion Kernel: Instead of fixing the path length a priori, we can obtain a new kernel between nodes of a graph by considering paths of all possible lengths, but by damping the contribution of longer paths, which leads to the exponential diffusion kernel, definedas K l0 1 l! βl S l I + βs + 1 2! β2 S ! β3 S 3 + exp { βs } 5.19) where β R is a damping factor, and exp{βs} is the matrix exponential. Substituting S UΛU T n λ iu i u T i in 5.19), and utilizing the fact that UU T n u iu T i I, wehave K I + βs + 1 2! β2 S 2 + ) ) ) u i u T i + u i βλ i u T 1 i + u i 2! β2 λ 2 i ut i + u i 1+βλi + 1 2! β2 λ 2 i + )u T i u i e βλ i u T i

28 CHAPTER 5. THE KERNEL METHOD 165 e βλ e βλ 2 0 U UT 5.20) 0 0 e βλn Thus, the eigenvectors of K are the same as those for S, whereasitseigenvaluesare given as e βλ i,whereλ i is an eigenvalue of S. K is symmetric, since S is symmetric, and its eigenvalues are real and non-negative, since the exponential of a real number is non-negative. K is thus a positive semi-definite kernel matrix. The complexity of computing the diffusion kernel is On 3 ) corresponding to the complexity of computing the eigen-decomposition. Von Neumann Diffusion Kernel: Arelatedkernelbasedonpowers ofs is the von Neumann diffusion kernel, definedas K β l S l 5.21) l0 Expanding the above, we have K I + βs + β 2 S 2 + β 3 S 3 + I + βsi + βs + β 2 S 2 + ) I + βsk Rearranging the terms above we obtain a closed form expression for the von Neumann kernel K βsk I I βs)k I K I βs) ) Plugging in the eigen-decomposition S UΛU T,andrewritingI UU T,wehave K UU ) T UβΛ)U T 1 U ) I βλ) U T 1 U I βλ) 1 U T where I βλ) 1 is the diagonal matrix whose i-th diagonal entry is 1 βλ i ) 1. The eigenvectors of K and S are identical, but the eigenvalues of K are given as

29 CHAPTER 5. THE KERNEL METHOD 166 1/1 βλ i ). For K to be a positive semi-definite kernel, all its eigenvalues should be non-negative, which in turn implies that 1 βλ i ) βλ i 0 β 1/λ i Furthermore, the inverse matrix I βλ) 1 exists only if deti βλ) n 1 βλ i ) 0,whichimpliesthatβ 1/λ i for all i. Thus, for K to be a valid kernel, we require that β < 1/λ i for all i 1,,n. The von Neumann kernel is thus guaranteed to be positive semi-definite if β < 1/ρS), whereρs) max i { λ i }, the largest eigenvalue in absolute value, is called the spectral radius of S. v 4 v 5 v 1 v 3 v 2 Figure 5.2: Graph Diffusion Kernel Example 5.15: Consider the graph in Figure 5.2. It s adjacency matrix and degree matrix is given as A D The negated Laplacian matrix for the graph is therefore S L A D The eigenvalues of S are as follows λ 1 0 λ λ λ λ

30 CHAPTER 5. THE KERNEL METHOD 167 and the eigenvectors of S are u 1 u 2 u 3 u 4 u U Assuming β 0.2, theexponentialdiffusionkernelmatrixisgivenas e 0.2λ K exp { 0.2S } 0 e 0.2λ 2 0 U UT 0 0 e 0.2λn For the von Neumann diffusion kernel, we have I 0.2Λ) For instance, since λ , wehave1 βλ , and therefore the second diagonal entry is 1 βλ 2 ) 1 1/ The von Neumann kernel is given as K UI 0.2Λ) 1 U T

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Kernel Methods. Charles Elkan October 17, 2007

Kernel Methods. Charles Elkan October 17, 2007 Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1

Kernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1 Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA

More information

Tutorials in Optimization. Richard Socher

Tutorials in Optimization. Richard Socher Tutorials in Optimization Richard Socher July 20, 2008 CONTENTS 1 Contents 1 Linear Algebra: Bilinear Form - A Simple Optimization Problem 2 1.1 Definitions........................................ 2 1.2

More information

Econ Slides from Lecture 7

Econ Slides from Lecture 7 Econ 205 Sobel Econ 205 - Slides from Lecture 7 Joel Sobel August 31, 2010 Linear Algebra: Main Theory A linear combination of a collection of vectors {x 1,..., x k } is a vector of the form k λ ix i for

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017 The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the

More information

Jordan normal form notes (version date: 11/21/07)

Jordan normal form notes (version date: 11/21/07) Jordan normal form notes (version date: /2/7) If A has an eigenbasis {u,, u n }, ie a basis made up of eigenvectors, so that Au j = λ j u j, then A is diagonal with respect to that basis To see this, let

More information

Linear Algebra. Maths Preliminaries. Wei CSE, UNSW. September 9, /29

Linear Algebra. Maths Preliminaries. Wei CSE, UNSW. September 9, /29 September 9, 2018 1/29 Introduction This review focuses on, in the context of COMP6714. Key take-away points Matrices as Linear mappings/functions 2/29 Note You ve probability learned from matrix/system

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Second Order Elliptic PDE

Second Order Elliptic PDE Second Order Elliptic PDE T. Muthukumar tmk@iitk.ac.in December 16, 2014 Contents 1 A Quick Introduction to PDE 1 2 Classification of Second Order PDE 3 3 Linear Second Order Elliptic Operators 4 4 Periodic

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY

More information

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

More information

Eigenvalue comparisons in graph theory

Eigenvalue comparisons in graph theory Eigenvalue comparisons in graph theory Gregory T. Quenell July 1994 1 Introduction A standard technique for estimating the eigenvalues of the Laplacian on a compact Riemannian manifold M with bounded curvature

More information

Lecture 7. Econ August 18

Lecture 7. Econ August 18 Lecture 7 Econ 2001 2015 August 18 Lecture 7 Outline First, the theorem of the maximum, an amazing result about continuity in optimization problems. Then, we start linear algebra, mostly looking at familiar

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

12. Cholesky factorization

12. Cholesky factorization L. Vandenberghe ECE133A (Winter 2018) 12. Cholesky factorization positive definite matrices examples Cholesky factorization complex positive definite matrices kernel methods 12-1 Definitions a symmetric

More information

The following definition is fundamental.

The following definition is fundamental. 1. Some Basics from Linear Algebra With these notes, I will try and clarify certain topics that I only quickly mention in class. First and foremost, I will assume that you are familiar with many basic

More information

Review and problem list for Applied Math I

Review and problem list for Applied Math I Review and problem list for Applied Math I (This is a first version of a serious review sheet; it may contain errors and it certainly omits a number of topic which were covered in the course. Let me know

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

Data Mining and Analysis

Data Mining and Analysis 978--5-766- - Data Mining and Analysis: Fundamental Concepts and Algorithms CHAPTER Data Mining and Analysis Data mining is the process of discovering insightful, interesting, and novel patterns, as well

More information

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay

A Statistical Look at Spectral Graph Analysis. Deep Mukhopadhyay A Statistical Look at Spectral Graph Analysis Deep Mukhopadhyay Department of Statistics, Temple University Office: Speakman 335 deep@temple.edu http://sites.temple.edu/deepstat/ Graph Signal Processing

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

Linear Algebra. Session 12

Linear Algebra. Session 12 Linear Algebra. Session 12 Dr. Marco A Roque Sol 08/01/2017 Example 12.1 Find the constant function that is the least squares fit to the following data x 0 1 2 3 f(x) 1 0 1 2 Solution c = 1 c = 0 f (x)

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Manifold Learning: Theory and Applications to HRI

Manifold Learning: Theory and Applications to HRI Manifold Learning: Theory and Applications to HRI Seungjin Choi Department of Computer Science Pohang University of Science and Technology, Korea seungjin@postech.ac.kr August 19, 2008 1 / 46 Greek Philosopher

More information

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

COMPSCI 514: Algorithms for Data Science

COMPSCI 514: Algorithms for Data Science COMPSCI 514: Algorithms for Data Science Arya Mazumdar University of Massachusetts at Amherst Fall 2018 Lecture 8 Spectral Clustering Spectral clustering Curse of dimensionality Dimensionality Reduction

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Kernel methods CSE 250B

Kernel methods CSE 250B Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from

More information

Eigenvalues and Eigenvectors

Eigenvalues and Eigenvectors Contents Eigenvalues and Eigenvectors. Basic Concepts. Applications of Eigenvalues and Eigenvectors 8.3 Repeated Eigenvalues and Symmetric Matrices 3.4 Numerical Determination of Eigenvalues and Eigenvectors

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

Nonlinear Dimensionality Reduction

Nonlinear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Kernel PCA 2 Isomap 3 Locally Linear Embedding 4 Laplacian Eigenmap

More information

Elements of linear algebra

Elements of linear algebra Elements of linear algebra Elements of linear algebra A vector space S is a set (numbers, vectors, functions) which has addition and scalar multiplication defined, so that the linear combination c 1 v

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015

Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015 Midterm for Introduction to Numerical Analysis I, AMSC/CMSC 466, on 10/29/2015 The test lasts 1 hour and 15 minutes. No documents are allowed. The use of a calculator, cell phone or other equivalent electronic

More information

Lecture 2: Linear Algebra Review

Lecture 2: Linear Algebra Review EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

EXAM MATHEMATICAL METHODS OF PHYSICS. TRACK ANALYSIS (Chapters I-V). Thursday, June 7th,

EXAM MATHEMATICAL METHODS OF PHYSICS. TRACK ANALYSIS (Chapters I-V). Thursday, June 7th, EXAM MATHEMATICAL METHODS OF PHYSICS TRACK ANALYSIS (Chapters I-V) Thursday, June 7th, 1-13 Students who are entitled to a lighter version of the exam may skip problems 1, 8-11 and 16 Consider the differential

More information

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML) Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2 Norwegian University of Science and Technology Department of Mathematical Sciences TMA445 Linear Methods Fall 07 Exercise set Please justify your answers! The most important part is how you arrive at an

More information

SVMs: nonlinearity through kernels

SVMs: nonlinearity through kernels Non-separable data e-8. Support Vector Machines 8.. The Optimal Hyperplane Consider the following two datasets: SVMs: nonlinearity through kernels ER Chapter 3.4, e-8 (a) Few noisy data. (b) Nonlinearly

More information

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Beyond the Point Cloud: From Transductive to Semi-Supervised Learning Vikas Sindhwani, Partha Niyogi, Mikhail Belkin Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of

More information

Perceptron Revisited: Linear Separators. Support Vector Machines

Perceptron Revisited: Linear Separators. Support Vector Machines Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department

More information

Linear Algebra. Matrices Operations. Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0.

Linear Algebra. Matrices Operations. Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0. Matrices Operations Linear Algebra Consider, for example, a system of equations such as x + 2y z + 4w = 0, 3x 4y + 2z 6w = 0, x 3y 2z + w = 0 The rectangular array 1 2 1 4 3 4 2 6 1 3 2 1 in which the

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity-

More information

Some Notes on Linear Algebra

Some Notes on Linear Algebra Some Notes on Linear Algebra prepared for a first course in differential equations Thomas L Scofield Department of Mathematics and Statistics Calvin College 1998 1 The purpose of these notes is to present

More information

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto Unsupervised Learning Techniques 9.520 Class 07, 1 March 2006 Andrea Caponnetto About this class Goal To introduce some methods for unsupervised learning: Gaussian Mixtures, K-Means, ISOMAP, HLLE, Laplacian

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix.

Quadratic forms. Here. Thus symmetric matrices are diagonalizable, and the diagonalization can be performed by means of an orthogonal matrix. Quadratic forms 1. Symmetric matrices An n n matrix (a ij ) n ij=1 with entries on R is called symmetric if A T, that is, if a ij = a ji for all 1 i, j n. We denote by S n (R) the set of all n n symmetric

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Fundamentals of Linear Algebra. Marcel B. Finan Arkansas Tech University c All Rights Reserved

Fundamentals of Linear Algebra. Marcel B. Finan Arkansas Tech University c All Rights Reserved Fundamentals of Linear Algebra Marcel B. Finan Arkansas Tech University c All Rights Reserved 2 PREFACE Linear algebra has evolved as a branch of mathematics with wide range of applications to the natural

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Kernel B Splines and Interpolation

Kernel B Splines and Interpolation Kernel B Splines and Interpolation M. Bozzini, L. Lenarduzzi and R. Schaback February 6, 5 Abstract This paper applies divided differences to conditionally positive definite kernels in order to generate

More information

MATH 220: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY

MATH 220: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY MATH 22: INNER PRODUCT SPACES, SYMMETRIC OPERATORS, ORTHOGONALITY When discussing separation of variables, we noted that at the last step we need to express the inhomogeneous initial or boundary data as

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS

SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS SPECTRAL THEOREM FOR COMPACT SELF-ADJOINT OPERATORS G. RAMESH Contents Introduction 1 1. Bounded Operators 1 1.3. Examples 3 2. Compact Operators 5 2.1. Properties 6 3. The Spectral Theorem 9 3.3. Self-adjoint

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM Unless otherwise stated, all vector spaces in this worksheet are finite dimensional and the scalar field F is R or C. Definition 1. A linear operator

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Kernel Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 / 21

More information

Your first day at work MATH 806 (Fall 2015)

Your first day at work MATH 806 (Fall 2015) Your first day at work MATH 806 (Fall 2015) 1. Let X be a set (with no particular algebraic structure). A function d : X X R is called a metric on X (and then X is called a metric space) when d satisfies

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Data Mining and Analysis: Fundamental Concepts and Algorithms

Data Mining and Analysis: Fundamental Concepts and Algorithms : Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 9, 2011 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Linear Algebra: Matrix Eigenvalue Problems

Linear Algebra: Matrix Eigenvalue Problems CHAPTER8 Linear Algebra: Matrix Eigenvalue Problems Chapter 8 p1 A matrix eigenvalue problem considers the vector equation (1) Ax = λx. 8.0 Linear Algebra: Matrix Eigenvalue Problems Here A is a given

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

CS 143 Linear Algebra Review

CS 143 Linear Algebra Review CS 143 Linear Algebra Review Stefan Roth September 29, 2003 Introductory Remarks This review does not aim at mathematical rigor very much, but instead at ease of understanding and conciseness. Please see

More information

Kernel Principal Component Analysis

Kernel Principal Component Analysis Kernel Principal Component Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462 Lecture 6 Sept. 25-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Dual PCA It turns out that the singular value decomposition also allows us to formulate the principle components

More information

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO

Kernel PCA: keep walking... in informative directions. Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor Muñiz, Rogelio Ramos CIMAT, Guanajuato, GTO Kernel PCA: keep walking... in informative directions Johan Van Horebeek, Victor

More information

Linear Algebra March 16, 2019

Linear Algebra March 16, 2019 Linear Algebra March 16, 2019 2 Contents 0.1 Notation................................ 4 1 Systems of linear equations, and matrices 5 1.1 Systems of linear equations..................... 5 1.2 Augmented

More information

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms.

Vector Spaces. Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. Vector Spaces Vector space, ν, over the field of complex numbers, C, is a set of elements a, b,..., satisfying the following axioms. For each two vectors a, b ν there exists a summation procedure: a +

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

The spectra of super line multigraphs

The spectra of super line multigraphs The spectra of super line multigraphs Jay Bagga Department of Computer Science Ball State University Muncie, IN jbagga@bsuedu Robert B Ellis Department of Applied Mathematics Illinois Institute of Technology

More information

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up!

Each new feature uses a pair of the original features. Problem: Mapping usually leads to the number of features blow up! Feature Mapping Consider the following mapping φ for an example x = {x 1,...,x D } φ : x {x1,x 2 2,...,x 2 D,,x 2 1 x 2,x 1 x 2,...,x 1 x D,...,x D 1 x D } It s an example of a quadratic mapping Each new

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Systems of Algebraic Equations and Systems of Differential Equations

Systems of Algebraic Equations and Systems of Differential Equations Systems of Algebraic Equations and Systems of Differential Equations Topics: 2 by 2 systems of linear equations Matrix expression; Ax = b Solving 2 by 2 homogeneous systems Functions defined on matrices

More information

Principal Component Analysis

Principal Component Analysis CSci 5525: Machine Learning Dec 3, 2008 The Main Idea Given a dataset X = {x 1,..., x N } The Main Idea Given a dataset X = {x 1,..., x N } Find a low-dimensional linear projection The Main Idea Given

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS VIKAS CHANDRAKANT RAYKAR DECEMBER 5, 24 Abstract. We interpret spectral clustering algorithms in the light of unsupervised

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

1 Distributions (due January 22, 2009)

1 Distributions (due January 22, 2009) Distributions (due January 22, 29). The distribution derivative of the locally integrable function ln( x ) is the principal value distribution /x. We know that, φ = lim φ(x) dx. x ɛ x Show that x, φ =

More information

LINEAR ALGEBRA QUESTION BANK

LINEAR ALGEBRA QUESTION BANK LINEAR ALGEBRA QUESTION BANK () ( points total) Circle True or False: TRUE / FALSE: If A is any n n matrix, and I n is the n n identity matrix, then I n A = AI n = A. TRUE / FALSE: If A, B are n n matrices,

More information

STATISTICAL LEARNING SYSTEMS

STATISTICAL LEARNING SYSTEMS STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis

More information