Statistical Learning & Applications. f w (x) =< f w, K x > H = w T x. α i α j < x i x T i, x j x T j. = < α i x i x T i, α j x j x T j > F

Size: px

Start display at page:

Download "Statistical Learning & Applications. f w (x) =< f w, K x > H = w T x. α i α j < x i x T i, x j x T j. = < α i x i x T i, α j x j x T j > F"

Sabina Chase
5 years ago
Views:

1 CR2: Statistical Learning & Applications Examples of Kernels and Unsupervised Learning Lecturer: Julien Mairal Scribes: Rémi De Joannis de Verclos & Karthik Srikanta Kernel Inventory Linear Kernel The linear kernel is dened as K(x, y) = x T y, where K : R p R p R. Let us show that the corresponding reproducible kernel Hilbert space is H = {y w T y w R p } where f w is the function which maps y to w T y, with the inner product < f w, f w2 > H = w T w 2. We check the following conditions: () x R p, (2) f w H, x R p, {K x y K(x, y)} H f w (x) =< f w, K x > H = w T x (3) K is positive semidenite i.e. m R, (x,..., x m ) (R p ) m, [K] i,j = K(x i, x j ) and α R m, α T Kα 0 Polynomial kernel with degree 2 The polynomial kernel is given by K(x, y) = (x T y) 2. We now see that it is positive semidenite as, α i α j K(x i, x j ) 2 = α i α j Trace(x i x T i x j x T j ) i,j i,j = i,j α i α j < x i x T i, x j x T j > F = < α i x i x T i, α j x j x T j > F i,j ( ) = Trace α i x i x T i α j x j x T j i = Trace ( Z T Z ) ( ( )) where Z = α i x i x T i = Z 2 F 0 j i

2 2 Examples of Kernels and Unsupervised Learning It is important here to dene the Frobenius norm. Z F = i,j Z2 i,j. Also note that < Z, Z 2 >= Trace(Z T Z 2 ). Now we would like to nd H such that () and (2) mentioned above are satised. To satisfy (), we need f x : y K(x, y) H, x R p and, f x (y) = (x T y) 2 =< xx T, yy T > F = y T (xx T )y Because H is a Hilbert space, we also have β in R n, (x,..., x n ) (R p ) n, y y ( T n i= β ) ix i x T i y is in H. A good candidate for H is thus H : {y y T Ay, A symmetric R p p } with < f A, f B > H = Trace(A B). This is easy to check: f A (x) = x T Ax =< A, xx T > F, K x = f xx T, which implies f A (x) =< f A, K x >. Gaussian kernel The Gaussian kernel is given by K(x, y) = e y x 2 ω 2 = k(x y). The corresponding r.k.h.s is more involved. We will admit that if we dene < f, g > H = we have, H = {f f is integrable, continuous and < f, f > H < + } ˆf(w)ĝ(w) (2π) p ˆk(w) dw, The min kernel Let K(x, y) = min(x, y). x, y [0, ]. Let is show that K is positive semidenite: α R n, (x,.., x n ) [0, ] n we have, α i α j min(x i, x j ) = α i α j t x (t) t y (t) i,j i,j 0 ( n ) n = α i t xi α j t xj dt = 0 0 i= j= ( n ) n Z(t) 2 dt where Z(t) = α j t xj = α i t xi j= i= 0 It is thus appealing to believe that H = {f f L 2 [0, ]} and < f, g >= f(t)g(t)dt. However, this is a 0 mistake since < f, f >= 0 does not imply that f = 0. In fact, H = {f : [0, ] R continuous and dierentiable almost everywhere and f(0) = 0} Now we have < f, g > H = 0 f (t)g (t)dt and thus < f, f > H = 0 f = 0 f = 0. We can now check the remaining conditions K x : y min(x, y) H

3 Examples of Kernels and Unsupervised Learning 3 f H, x [0; ] f(x) =< f, K x >= f (t) t<=x dt = 0 0 f (t)dt = f(x) f(0) = f(x) x K x x Figure : Graph of K x Histogram kernel The Histogram kernel is given by K(h, h 2 ) = k j= min(h (j), h 2 (j)) where h [0, ] k. It is notably used in computer vision for comparing histograms of visual words. Spectrum kernel We now try to motivate the spectrum kernel through biology. There are human proteincoding genes. In other words there are 3, 000, 000, 000 bases of building blocks A, T, C, G. Consider a sequence u A k of size k (where in this case we assume A = {A, T, C, G}). Let φ u (x) = number of occurrences of u in x. We dene the spectrum kernel as K(x, x ) = u A φ u(x)φ k u (x ) =< φ(x), φ(x ) > where φ(x) R A k. We are interested in computing K(x, x ) eciently. K(x, x ) = x k+ x k+ i= j= x[i,i+k ]=x[j,j+k ]. The computation of this can be done in O( x x k) by using a trie. We demonstrate this with an example. Let x = ACGT T T ACGA and x = AGT T T ACG. The corresponding tries are: ɛ ɛ A C G T A G T AC CG GT T T T A AG AC GT T T T A ACG CGT CGA GT T T T T T T A T AC AGT ACG GT T T T A T T T T AC 2 x x

4 4 Examples of Kernels and Unsupervised Learning Algorithm Process Node(v) Input: Trie representation for sequences x and x. Output: Compute K(x, x ). : if v is a leaf then 2: K(x, x ) = K(x, x ) + i(v x ) i(v x ) 3: else 4: for All children in common of v x, v x do 5: Process Node(children). 6: end for 7: end if 8: Repeat Process for Node(r). Exercises. Show that if K and K 2 are positive semidenite then, (a) K + K 2 is positive semidenite. (b) αk + βk 2 is positive semidenite, where α, β > Show that if (K n ) n>0 K (pointwise), then K is positive semidenite. 3. Show that K(x, y) = min(x,y) is positive semidenite for all x, y [0, [. Walk kernel We dene G = (V, E ) is a subgraph of G if V V, E E and labels(v ) = labels(v ). Let φ U (G) = { Number of times U appears as a subgraph of G}. The subgraph kernel is dened as, K(G, G ) = U φ U (G)φ U (G ) Unfortunately computing this in NP-complete. We now dene the walk kernel. We recall some denitions rst. A walk is an alternating sequence of vertices and connecting edges. Less formally a walk is any route through a graph from vertex to vertex along edges. A walk can end on the same vertex on which it began or on a dierent vertex. A walk can travel over any edge and any vertex any number of times. A path is a walk that does not include any vertex twice, except that its rst vertex might be the same as its last. Now let, W k = {walks in G of size k} S k = {sequence of labels of size k} = A k So s S k, φ s (G) = w W k (G) labels(w)=s. This gives us K(G, G 2 ) = s S φ s(g )φ s (G 2 ). Computing the walk kernel seems hard at rst sight, but it can be done eciently by using the product graph, dened as follows. Given graphs G = (V, E ) and G 2 = (V 2, E 2 ), we dene, G G 2 = ({(v, v 2 ) V V 2 with same labels}{((v, v 2 ), (v, v 2)) where ((v, v ) E and (v 2, v 2) E 2 )}) There is a bijection between walks in G G 2 and walks in G and G 2 with same labels. More formally,

5 Examples of Kernels and Unsupervised Learning 5 a a b b c 2 2 c d 2 d e 2 G G 2 ad aa ab bd ba bb de 2cc 2ce 2dc 2 G G 2 Figure 2: Example of graph product K(G, G 2 ) = = w W k (G ) w 2 W k (G 2) w W k (G G 2) l(w)=l(w2) Exercise Let A be the adjacency matrix of G R V V,. Prove that number of walks of size k starting at i and ending at j is [A k ] i,j. 2. Show that K(G, G 2 ) can be computed in polynomial time.

6 6 Examples of Kernels and Unsupervised Learning 2 Unsupervised Learning Unsupervised learning consists of discovering the underlying structure of data without labels. It is useful for many tasks, such as removing noise from data (preprocessing), interpreting and visualizing the data, compression... Principal component analysis We have a centered data set X = [x,..., x n ] R p n. Consider the projection of x onto the direction w R p, then x wt x variance captured by w is Var (w) = n n i= (w T x i) 2. w 2 2 w 2 2 We provide below the PCA (Principle Component Analysis) algorithm: Algorithm 2 PCA : Suppose we have (w,..., w i ) 2: Construct w i arg max V ar(w) under the condition w i (w,..., w i ) w. We observe that the empirical w 2 2 Lemma: w i w i 2 are successive vectors of XX T ordered by decreasing eigenvalue. Proof: We have n w = arg max n i= wt x i x T i w = arg maxw T XX T w. w 2= w 2= XX T is symmetric, so it can be diagonalized into XX T = USU T where U is orthogonal and S is diagonal. For w i, i > we have, w = arg max w 2= w T USU T w = arg max (U T w) T S(U T w) = w i = w 2= ( arg max z 2= z T Sz ) where z = w T U arg max w T USU T w w 2=,w u,...,u i = arg max w 2=,w u,...,u i (U T w) T S(U T w) = arg max z T Sz z 2=,z=e,...,e i As a consequence we have PCA can be computed with SVD. Theorem(Aqart - Young theorem) PCA (SVD) provides the best lowrank approximation. min X X 2, rank(x ) < k X R n p Proof: Any matrix X of rank k can be written X = k i= s iu i v T i = U k S k V T k where U k R p k, V k R k n,

7 Examples of Kernels and Unsupervised Learning 7 U k V k = I and S k M k (R) is diagonal. Thus, min X X 2 = min X USV T 2 X R n p U T p U=I rank(x )<k V T V =I S diagonal rank(usv T )=k Lemma: Let U R p m and V R n m where m = min(n, p). If U T U = I and V T V = I then Z ij = U i Vj T form an orthogonal basis of R p n for the Frobenius norm. Proof: It suces to check the scalar product: < Z ij, Z kl > = Trace(Z T ijz kl ) = Trace(V j U T i U k V T l ) = (V T j V l )(U T j U k ) = {j=k and i=l} since U and V are orthogonal. We have X = m i= s iz ii and because of the lemma we can write X in the basis Z ij : X = m i= s ijz ij Under the condition rank(x ) = r, X X 2 is minimal i i,j (s ij s ij) 2 is minimal i n i= (s ii s i) 2 + i j (s ij )2 is minimal Then, the rst term should be minimized by taking s ii = s i for the k longest values of s i with respect to the rank constaint and the second term should be set to 0. Kernel PCA If we have the kernel φ : x φ(x) V ar(f) = n n i= assuming φ(x i ) are centered in the feature space f 2 (x i) f 2 H There are few remaining questions f i. What does it mean to have φ(x i ) centered? 2. How do we solve () arg max f f 2,...,f i f i H V ar(f) (). Having φ(x i ) centered means that you want to implicitely replace φ(x i ) by φ(x i ) m where m = n n i= φ(x i) is the average of φ. If our kernel is K(x i, x j ) =< φ(x i ), φ(x j ) > H, the new kernel becomes: K c (x i, x j ) =< φ(x i ) m, φ(x j ) m > =< φ(x i ), φ(x j ) > < n φ(x l ), φ(x i ) > < n n l= m φ(x l ), φ(x j ) > + n 2 < φ(x l ), φ(x k ) > l= i,k

8 8 Examples of Kernels and Unsupervised Learning In term of matrices, it is writen K c = (I n )K(I n ) where is the square matrix with all coecients equal to. 2. Due to the representation theorem, any solution of () has the form f : x n i= α ik(x i, x). What we need are the α's. We have f 2 H = αt Kα, f(x i ) = [Kα] i and < f, f l >= 0 = α T Kα l We want to nd max α R n Kα 2 2 n α T Kα s.t. αt Kα l = 0, l < i. i.e. max α α T K 2 α α T Kα and αt Kα l = 0, l < i. K is symetric so K = US 2 U T where U is orthogonal and S diagonal. We want max α α T US 4 U T α α T US 2 U T α = max β R n βt S 2 β β 2 2 where β = SU T α, β l = SU T α l. If we assume that the eigenvalues are ordered in S, the solutions are the duals of β l = e l. Recipe:. Center the kernel matrix. 2. K L = US 2 U T. 3. a l = D 2 U i (eigenvalue decomposition) 4. f i (x i ) = [K i α] i K-means clustering Let X = [X,.., X n ] R p n be data points and we would like to form clusters C = [C,..., C k ] R p k where each of these are centroids of the cluster. Our goals are to:. Learn C. 2. Assign each data point to a centroid C j.

9 Examples of Kernels and Unsupervised Learning 9 y x A popular way to do this is by K-means algorithm: Algorithm 3 K-means : for i =,..., n do 2: Assign l i argmin j=,..,k X i C j 2 2 3: end for 4: for j =,..., k do 5: Reestimate centroids by C j n j i such that l i=j X i 6: end for The interpretation of the algorithm can be seen as following: We are trying to nd C such that we have n min C,li i= X i C li 2 2. Now given some xed labels, c f(c ) = 2[C X i ] = 0 i such that l i=. Note that Kmeans is not an optimal algorithm because min C,li n i= X i C li 2 2 is a non-convex optimization problem with several local minima. Kernel K-means Let us dene the Kernel for this to be min c,l H n i= φ(x i) C li 2 H. Given some labels (reestimation) C j n j i such that l i=j φ(x i)

10 0 Examples of Kernels and Unsupervised Learning Now l i arg min j φ(x i ) n j l S j φ(x l ) 2 = φ(x i ) 2 + C j 2 2 n j i S j K(x i, x j ). Mixture of Gaussians The Gaussian distribution is given by: ( N(x µ, Σ) = exp ) (2π) p/2 Σ /2 2 (x µ)σ (x µ) Assume that the data is generated by the following procedure:. Random class assignment: C: P(C = C j ) = π j where k j= π j =, π j x N(x, µ i, σ i ) The parameters are: π,..., π k (µ, σ ),..., (µ i, σ i ) If we observe X,..., X n, can we infer the parameters (π, µ, Σ)? Let θ represent the parameters. Let ˆθ = arg max θ P θ (X,..., X n ) (Maximum = arg max θ n i= P θ(x i ) One algorithm to get an approximate solution is called EM for "Expectation-maximization", which iteratively increases the likelihood Algorithm 4 Expectation Maximizer : Dene L θ = log P θ (X,..., X n ) = n i= log P θ(x i ). 2: for =,..., n do 3: E-step: nd q given θ xed (auxiliary soft assignment) 4: M-step: Maximize some function l(θ, q) with q xed. 5: end for 6: Repeat step 2 until convergence. Explanations: j At E-step q i,j = N(xi µj,σj) N(xi µ j ;σ j ). Notice that i, k j= q ij =. This can be interpreted as a soft assignment of very data point to the classes At M-step we have: π j n q ij n Σ j µ j i= n q ij x i i= m q ij (x i µ j (x i µ j ) T i=

11 Examples of Kernels and Unsupervised Learning The key ingredient here is the use of Jensen inequality, but we will not have the time to present the details in this course.

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)