Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009

Size: px
Start display at page:

Download "Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009"

Transcription

1 Kernel Methods Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik

2 Roadmap Similarity, kernels, feature spaces Positive definite kernels and their RKHS Kernel means, representer theorem Support Vector Machines

3 Learning and Similarity: some Informal Thoughts input/output sets X, Y training set (x 1,y 1 ),..., (x m,y m ) X Y generalization : given a previously unseen x X, find a suitable y Y (x, y) should be similar to (x 1,y 1 ),..., (x m,y m ) how to measure similarity? for outputs: loss function (e.g., for Y = {±1}, zero-one loss) for inputs: kernel

4 Similarity of Inputs symmetric function k : X X R (x, x ) k(x, x ) for example, if X = R N : canonical dot product k(x, x )= N i=1 [x] i[x ] i if X is not a dot product space: assume that k has a representation as a dot product in a linear space H, i.e., there exists a map Φ : X H such that k(x, x )= Φ(x), Φ(x ). in that case, we can think of the patterns as Φ(x), Φ(x ), and carry out geometric algorithms in the dot product space ( feature space ) H.

5 An Example of a Kernel Algorithm Idea: classify points x := Φ(x) in feature space according to which of the two class means is closer. c + := 1 Φ(x m i ), c := 1 Φ(x + m i ) y i =1 y i = 1 o o o w + c 2 o c c 1. x x-c + + Compute the sign of the dot product between w := c + c and x c.

6 An Example of a Kernel Algorithm, ctd. [32] f(x) = sgn 1 where = sgn b = 1 2 m + 1 m + 1 m 2 {i:y i =+1} {i:y i =+1} {(i,j):y i =y j = 1} Φ(x), Φ(x i ) 1 k(x, x i ) 1 m k(x i,x j ) 1 m 2 + m {i:y i = 1} {i:y i = 1} {(i,j):y i =y j =+1} Φ(x), Φ(x i ) +b k(x, x i )+b k(x i,x j ). provides a geometric interpretation of Parzen windows

7 An Example of a Kernel Algorithm, ctd. Demo Exercise: derive the Parzen windows classifier by computing the distance criterion directly

8 Statistical Learning Theory 1. started by Vapnik and Chervonenkis in the Sixties 2. model: we observe data generated by an unknown stochastic regularity 3. learning = extraction of the regularity from the data 4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement. 5. support vector machines use a particular type of function class: classifiers with large margins in a feature space induced by a kernel. [39, 40]

9 Kernels and Feature Spaces Preprocess the data with Φ : X H x Φ(x), where H is a dot product space, and learn the mapping from Φ(x) to y [6]. usually, dim(x) dim(h) Curse of Dimensionality? crucial issue: capacity, not dimensionality

10 Example: All Degree 2 Monomials Φ: R 2 R 3 (x 1,x 2 ) (z 1,z 2,z 3 ) := (x 2 1, 2 x 1 x 2,x 2 2 )!!!!!!!! x 1 x 2!!!!!!!! z 1 z 3 z 2

11 General Product Feature Space How about patterns x R N and product features of order d? Here, dim(h) grows like N d. E.g. N = 16 16, and d =5 dimension 10 10

12 The Kernel Trick, N = d =2 Φ(x), Φ(x ) =(x 2 1, 2 x 1 x 2,x 2 2 )(x 2 1, 2 x 1 x 2,x 2 2 ) = x, x 2 =:k(x, x ) the dot product in H can be computed in R 2

13 The Kernel Trick, II More generally: x, x R N, d N: x, x d d N = x j x j = j=1 N j 1,...,j d =1 x j1 x jd x j 1 x j d = Φ(x), Φ(x ), where Φ maps into the space spanned by all ordered products of d input directions

14 Mercer s Theorem If k is a continuous kernel of a positive definite integral operator on L 2 (X) (where X is some compact space), k(x, x )f(x)f(x ) dx dx 0, X it can be expanded as k(x, x )= λ i ψ i (x)ψ i (x ) i=1 using eigenfunctions ψ i and eigenvalues λ i 0 [26].

15 The Mercer Feature Map In that case Φ(x) := satisfies Φ(x), Φ(x ) = k(x, x ). λ1 ψ 1 (x) λ2 ψ 2 (x) Proof: Φ(x), Φ(x ) λ1 ψ 1 (x) = λ2 ψ 2 (x),. = λ i ψ i (x)ψ i (x )=k(x, x ) i=1. λ1 ψ 1 (x ) λ2 ψ 2 (x ).

16 The Kernel Trick Summary any algorithm that only depends on dot products can benefit from the kernel trick this way, we can apply linear methods to vectorial as well as non-vectorial data think of the kernel as a nonlinear similarity measure examples of common kernels: Polynomial k(x, x )=( x, x + c) d Sigmoid k(x, x ) = tanh(κ x, x + Θ) Gaussian k(x, x ) = exp( x x 2 /(2 σ 2 )) Kernels are also known as covariance functions [44, 41, 45, 25]

17 Positive Definite Kernels It can be shown that the admissible class of kernels coincides with the one of positive definite (pd) kernels: kernels which are symmetric (i.e., k(x, x )=k(x,x)), and for any set of training points x 1,..., x m X and any a 1,..., a m R satisfy a i a j K ij 0, where K ij := k(x i,x j ). i,j K is called the Gram matrix or kernel matrix. If for pairwise distinct points, i,j a ia j K ij =0= a = 0, call it strictly positive definite.

18 Elementary Properties of PD Kernels Kernels from Feature Maps. If Φ maps X into a dot product space H, then Φ(x), Φ(x ) is a pd kernel on X X. Positivity on the Diagonal. k(x, x) 0 for all x X Cauchy-Schwarz Inequality. k(x, x ) 2 k(x, x)k(x,x ) (Hint: compute the determinant of the Gram matrix) Vanishing Diagonals. k(x, x) = 0 for all x X = k(x, x ) = 0 for all x, x X

19 The Feature Space for PD Kernels [5, 2, 29] define a feature map Φ: X R X x k(., x). E.g., for the Gaussian kernel:!.. x x'!(x)!(x') Next steps: turn Φ(X) into a linear space endow it with a dot product satisfying Φ(x), Φ(x ) = k(x, x ), i.e., k(., x),k(., x ) = k(x, x ) complete the space to get a reproducing kernel Hilbert space

20 Turn it Into a Linear Space Form linear combinations f(.) = g(.) = m α i k(., x i ), i=1 m j=1 (m, m N, α i,β j R, x i,x j X). β j k(., x j )

21 Endow it With a Dot Product f, g := = m i=1 m j=1 m α i g(x i )= i=1 α i β j k(x i,x j ) m j=1 β j f(x j ) This is well-defined, symmetric, and bilinear (more later). So far, it also works for non-pd kernels

22 The Reproducing Kernel Property Two special cases: Assume In this case, we have f(.) =k(., x). k(., x),g = g(x). If moreover we have g(.) =k(., x ), k(., x),k(., x ) = k(x, x ). k is called a reproducing kernel (up to here, have not used positive definiteness)

23 Endow it With a Dot Product, II It can be shown that.,. is a p.d. kernel on the set of functions {f(.) = m i=1 α i k(., x i ) α i R,x i X} : = γ i γ j fi,f j = γ i f i, γ j f j =: f, f ij i j α i k(., x i ), α i k(., x i ) = α i α j k(x i,x j ) 0 i i ij furthermore, it is strictly positive definite: f(x) 2 = f, k(., x) 2 f, f k(., x),k(., x) = f, f k(x, x) hence f, f = 0 implies f = 0. Complete the space in the corresponding norm to get a Hilbert space H k.

24 Explicit Construction of the RKHS Map for Mercer Kernels Recall that the dot product has to satisfy k(x,.),k(x,.) = k(x, x ). For a Mercer kernel k(x, x )= N F j=1 λ j ψ j (x)ψ j (x ) (with λ i > 0 for all i, N F N { }, and ψ i,ψ j L 2 (X) = δ ij), this can be achieved by choosing.,. such that ψ i,ψ j = δ ij /λ i.

25 ctd. To see this, compute k(x,.),k(x,.) = λ i ψ i (x)ψ i, λ j ψ j (x )ψ j i j = λ i λ j ψ i (x)ψ j (x ) ψ i,ψ j i,j = i,j = i λ i λ j ψ i (x)ψ j (x )δ ij /λ i λ i ψ i (x)ψ i (x ) = k(x, x ).

26 Deriving the Kernel from the RKHS An RKHS is a Hilbert space H of functions f where all point evaluation functionals p x : H R f p x (f) =f(x) exist and are continuous. Continuity means that whenever f and f are close in H, then f(x) and f (x) are close in R. This can be thought of as a topological prerequisite for generalization ability. By Riesz representation theorem, there exists an element of H, call it r x, such that r x,f = f(x), in particular, r x,r x = r x (x). Define k(x, x ) := r x (x )=r x (x). (cf. Canu & Mary, 2002)

27 The Empirical Kernel Map Recall the feature map Φ: X R X x k(., x). each point is represented by its similarity to all other points how about representing it by its similarity to a sample of points? Consider Φ m : X R m x k(., x) (x1,...,x m ) =(k(x 1,x),..., k(x m,x))

28 ctd. Φ m (x 1 ),..., Φ m (x m ) contain all necessary information about Φ(x 1 ),..., Φ(x m ) the Gram matrix G ij := Φ m (x i ), Φ m (x j ) satisfies G = K 2 where K ij = k(x i,x j ) modify Φ m to Φ w m : X R m x K 1 2(k(x 1,x),..., k(x m,x)) this whitened map ( kernel PCA map ) satifies Φ w m (x i ), Φ w m(x j ) = k(x i,x j ) for all i, j =1,..., m.

29 Some Properties of Kernels [32, 34] If k 1,k 2,... are pd kernels, then so are αk 1, provided α 0 k 1 + k 2 k 1 k 2 k(x, x ) := lim n k n (x, x ), provided it exists k(a, B) := x A,x B k 1(x, x ), where A, B are finite subsets of X (using the feature map Φ(A) := x A Φ(x)) Further operations to construct kernels from kernels: tensor products, direct sums, convolutions [19].

30 Properties of Kernel Matrices, I [30] Suppose we are given distinct training patterns x 1,..., x m, and a positive definite m m matrix K. K can be diagonalized as K = SDS, with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then K ij =(SDS ) ij = S i, DS j = DSi, DS j, where the S i are the rows of S. We have thus constructed a map Φ into an m-dimensional feature space H such that K ij = Φ(x i ), Φ(x j ).

31 Properties, II: Functional Calculus [33] K symmetric m m matrix with spectrum σ(k) f a continuous function on σ(k) Then there is a symmetric matrix f(k) with eigenvalues in f(σ(k)). compute f(k) via Taylor series, or eigenvalue decomposition of K: If K = S DS (D diagonal and S unitary), then f(k) = S f(d)s, where f(d) is defined elementwise on the diagonal can treat functions of symmetric matrices like functions on R (αf + g)(k) =αf(k)+g(k) (f g)(k) =f(k)g(k) =g(k)f(k) f,σ(k) = f(k) σ(f(k)) = f(σ(k)) (the C -algebra generated by K is isomorphic to the set of continuous functions on σ(k))

32 An example of a kernel algorithm, revisited o + µ(x ) + w +. o µ(y ) o + X compact subset of a separable metric space, m, n N. Positive class X := {x 1,..., x m } X Negative class Y := {y 1,..., y n } X RKHS means µ(x) = 1 m mi=1 k(x i, ), µ(y )= 1 n ni=1 k(y i, ). Get a problem if µ(x) =µ(y )!

33 When do the means coincide? k(x, x )= x, x : k(x, x )=( x, x + 1) d : the means coincide all empirical moments up to order d coincide k strictly pd: X = Y. The mean remembers each point that contributed to it.

34 Proposition 1 Assume X, Y are defined as above, k is strictly pd, and for all i, j, x i x j, and y i y j. If for some α i,β j R {0}, we have m n α i k(x i,.)= β j k(y j,.), (1) then X = Y. i=1 j=1

35 Exercise: generalize to the case of nonsingular kernel (i.e., leading to nonsingular Gram matrices for pairwise distinct points). Proof (by contradiction) W.l.o.g., assume that x 1 Y. Subtract n j=1 β j k(y j,.) from (1), and make it a sum over pairwise distinct points, to get 0= γ i k(z i,.), i where z 1 = x 1,γ 1 = α 1 0, and z 2, X Y {x 1 },γ 2, R. Take the RKHS dot product with j γ jk(z j,.) to get 0= ij γ i γ j k(z i,z j ), with γ 0, hence k cannot be strictly pd.

36 Generalization We will prove a more general statement, without assuming positive definiteness. Definition 2 We call a kernel k : X 2 R nonsingular if for any n N and pairwise distinct x 1,..., x n X, the Gram matrix (k(x i,x j )) ij is nonsingular. Note that strictly positive definite kernels are nonsingular: if the matrix K is singular, then there exists a β 0 such that Kβ = 0, hence β Kβ = 0, hence k is not strictly positive definite. Proposition 3 Assume X, Y are defined as above, k is nonsingular, and for all i, j, x i x j, and y i y j. If for some α i,β j R {0}, we have m n α i k(x i,.)= β j k(y j,.), (2) then X = Y. i=1 Proof (by contradiction) W.l.o.g., assume that x 1 Y. Subtract n j=1 β jk(y j,.) from (2), and make it a sum over pairwise distinct points, to get 0= γ i k(z i,.), i where z 1 = x 1,γ 1 = α 1 0, and z 2, X Y {x 1 },γ 2, R. Similar to the pd case, k induces a linear space with a bilinear form satisfying the reproducing kernel property. Take the bilinear form between j λ jk(z j,.) and the above, to get j=1 0= ij λ j γ i k(z j,z i )=λ Kγ, where λ R is arbitrary. Hence Kγ = 0. However, γ 0, hence K is singular. Since the z i are pairwise distinct, k cannot be nonsingular.

37 The mean map satisfies and µ: X =(x 1,..., x m ) 1 m µ(x),f = 1 m m k(x i, ),f i=1 µ(x) µ(y ) = sup µ(x) µ(y ),f = sup f 1 f 1 m k(x i, ) i=1 = 1 m 1 m m f(x i ) i=1 m f(x i ) 1 n i=1 n f(y i ). i=1 Note: distance in the RKHS = solution of a high-dimensional optimization problem.

38 Witness function f = µ(x) µ(y ), thus f(x) µ(x) µ(y ),k(x,.) ): µ(x) µ(y ) Prob. density and f Witness f for Gauss and Laplace data f Gauss Laplace This function is in the RKHS of a Gaussian kernel, but not in the RKHS of the linear kernel. X

39 The mean map for measures p, q Borel probability measures, E x,x p[k(x, x )], E x,x q[k(x, x )] < ( k(x,.) M< is sufficient) Define Note and µ(p) µ(q) = µ: p E x p [k(x, )]. µ(p),f = E x p [f(x)] sup Ex p [f(x)] E x q [f(x)]. f 1 Recall that in the finite sample case, for strictly p.d. kernels, µ was injective how about now?

40 Theorem 4 [13, 10] p = q sup f C(X) E x p (f(x)) E x q (f(x)) =0, where C(X) is the space of continuous bounded functions on X. Replace C(X) by the unit ball in an RKHS that is dense in C(X) universal kernel [38], e.g., Gaussian. Theorem 5 [16] If k is universal, then p = q µ(p) µ(q) =0.

41 µ is invertible on its image M = {µ(p) p is a probability distribution} (the marginal polytope, [42]) generalization of the moment generating function of a RV x with distribution p: M p (.) =E x p [e x, ].

42 Uniform convergence bounds Let X be an i.i.d. m-sample from p. The discrepancy µ(p) µ(x) = E x p[f(x)] 1 m f(x m i ) sup f 1 i=1 can be bounded using uniform convergence methods [37].

43 Application 1: Two-sample problem [16] X, Y i.i.d. m-samples from p, q, respectively. µ(p) µ(q) 2 =E x,x p [k(x, x )] 2E x p,y q [k(x, y)] + E y,y q [k(y, y )] =E x,x p,y,y q [h((x, y), (x,y ))] with Define h((x, y), (x,y )) := k(x, x ) k(x, y ) k(y, x )+k(y, y ). D(p, q) 2 := E x,x p,y,y qh((x, y), (x,y )) ˆD(X, Y ) 2 1 := m(m 1) h((x i,y i ), (x j,y j )). ˆD(X, Y ) 2 is an unbiased estimator of D(p, q) 2. It s easy to compute, and works on structured data. i j

44 Theorem 6 Assume k is bounded. ˆD(X, Y ) 2 converges to D(p, q) 2 in probability with rate O(m 1 2 ). This could be used as a basis for a test, but uniform convergence bounds are often loose.. Theorem 7 We assume E ( h 2) <. When p q, then m( ˆD(X, Y ) 2 D(p, q) 2 ) converges in distribution to a zero mean Gaussian with variance ( [ σu 2 =4 E z (Ez h(z, z )) 2] [ E z,z (h(z, z )) ] ) 2. When p = q, then m( ˆD(X, Y ) 2 D(p, q) 2 )=m ˆD(X, Y ) 2 converges in distribution to [ λ l q 2 l 2 ], (3) where q l N(0, 2) i.i.d., λ i are the solutions to the eigenvalue equation k(x, x )ψ i (x)dp(x) =λ i ψ i (x ), X l=1 and k(x i,x j ) := k(x i,x j ) E x k(x i,x) E x k(x, x j )+E x,x k(x, x ) is the centred RKHS kernel.

45 Application 2: Dependence Measures Assume that (x, y) are drawn from p xy, with marginals p x,p y. Want to know whether p xy factorizes. [3, 14]: kernel generalized variance [17, 18]: kernel constrained covariance, HSIC Main idea [22, 28]: x and y independent bounded continuous functions f, g, we have Cov(f(x),g(y)) = 0.

46 k kernel on X Y. µ(p xy ) := E (x,y) pxy [k((x, y), )] µ(p x p y ) := E x px,y p y [k((x, y), )]. Use := µ(pxy ) µ(p x p y ) as a measure of dependence. For k((x, y), (x,y )) = k x (x, x )k y (y, y ): 2 equals the Hilbert-Schmidt norm of the covariance operator between the two RKHSs (HSIC), with empirical estimate m 2 tr HK x HK y, where H = I 1/m [17, 37].

47 Witness function of the equivalent optimisation problem: 1.5 Dependence witness and sample Y X 0.04 Application: learning causal structures (Sun, Janzing, Schölkopf, Fukumizu, ICML 2007; Fukumizu, Gretton, Sun, Schölkopf, NIPS 2007))

48 Application 3: Covariate Shift Correction and Local Learning training set X = {(x 1,y 1 ),..., (x m,y m )} drawn from p, test set X = { (x 1,y 1 ),..., (x n,y n )} from p p. Assume p y x = p y x. [35]: reweight training set

49 Minimize m β i k(x i, ) µ(x ) i=1 Equivalent QP: minimize β 2 +λ β 2 2 subject to β i 0, i 1 2 β (K + λ1) β β l subject to β i 0 and i β i =1, β i =1. where K ij := k(x i,x j ), l i = k(x i, ),µ(x ). Experiments show that in underspecified situations (e.g., large kernel widths), this helps [21]. X = { x } leads to a local sample weighting scheme.

50 Application 4: Measure estimation and dataset squashing [9, 4, 1, 37] Given a sample X, minimize µ(x) µ(p) 2 over a convex combination of measures p i, p = i α ip i, α i 0, i α i =1. This can be written as a convex QP with objective function where µ(x) µ(p) 2 = α Qα+1 mk1 m 2α L1 m, L ij :=E x pi [ k(x, xj ) ] Q ij :=E x pi,x p j [ k(x, x ) ] K ij =k(x i,x j ) 1 m :=(1/m,..., 1/m) R m.

51 In practice, use α [Q + λi]α 2α L1 m Some cases where Q and L can be computed in closed form [37]: Gaussian p i and k (cf. [4, 43]) X training set, Dirac measures p i = δ xi : dataset squashing, [11] X test set, Dirac measures p i = δ yi centered on the training points Y : covariate shift correction [20]

52 The Representer Theorem Theorem 8 Given: a p.d. kernel k on X X, a training set (x 1,y 1 ),..., (x m,y m ) X R, a strictly monotonic increasing real-valued function Ω on [0, [, and an arbitrary cost function c :(X R 2 ) m R { } Any f H minimizing the regularized risk functional c ((x 1,y 1,f(x 1 )),..., (x m,y m,f(x m ))) + Ω ( f ) (4) admits a representation of the form f(.) = m i=1 α ik(x i,.).

53 Remarks significance: many learning algorithms have solutions that can be expressed as expansions in terms of the training examples original form, with mean squared loss c((x 1,y 1,f(x 1 )),..., (x m,y m,f(x m ))) = 1 m and Ω( f ) =λ f 2 (λ> 0): [24] m (y i f(x i )) 2, i=1 generalization to non-quadratic cost functions: [8] present form: [32]

54 Proof Decompose f H into a part in the span of the k(x i,.) and an orthogonal one: f = α i k(x i,.)+f, where for all j i f,k(x j,.) =0. Application of f to an arbitrary training point x j yields f(x j )= f, k(x j,.) = α i k(x i,.)+f,k(x j,.) = α i k(x i,.),k(x j,.), i independent of f. i

55 Proof: second part of (4) Since f is orthogonal to i α ik(x i,.), and Ω is strictly monotonic, we get ( Ω( f ) =Ω ) i α ik(x i,.)+f ) ( i α ik(x i,.) 2 + f 2 =Ω Ω ( ) i α ik(x i,.), (5) with equality occuring if and only if f = 0. Hence, any minimizer must have f = 0. solution takes the form f = i α ik(x i,.). Consequently, any

56 Application: Support Vector Classification Here, y i {±1}. Use c ((x i,y i,f(x i )) i )= 1 λ and the regularizer Ω ( f ) = f 2. λ 0 leads to the hard margin SVM max (0, 1 y i f (x i )), i

57 Further Applications Bayesian MAP Estimates. Identify (4) with the negative log posterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990), i.e. exp( c((x i,y i,f(x i )) i )) likelihood of the data exp( Ω( f )) prior over the set of functions; e.g., Ω( f ) = λ f 2 Gaussian process prior [45] with covariance function k minimizer of (4) = MAP estimate Kernel PCA (see below) can be shown to correspond to the case of c((x i,y i,f(x i )) i=1,...,m )= 0 if 1 m i otherwise ( f(x i ) 1 m with g an arbitrary strictly monotonically increasing function. j f(x j)) 2 =1

58 The Pre-Image Problem due to the representer theorem, the solution of kernel algorithms usually corresponds to a single vector in H m w = α i Φ(x i ). i=1 However, there is usually no x X such that Φ(x) =w, i.e., Φ(X) is not closed under linear combinations it is a nonlinear manifold (cf. [7, 31]).

59 Conclusion so far the kernel corresponds to a similarity measure for the data, or a (linear) representation of the data, or a hypothesis space for learning, kernels allow the formulation of a multitude of geometrical algorithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)

60 References [1] Y. Altun and A.J. Smola. Unifying divergence minimization and statistical inference via convex duality. In H.U. Simon and G. Lugosi, editors, Proc. Annual Conf. Computational Learning Theory, LNCS, pages Springer, [2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: , [3] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, [4] N. Balakrishnan and D. Schonfeld. A maximum entropy kernel density estimator with applications to function interpolation and texture segmentation. In SPIE Proceedings of Electronic Imaging: Science and Technology. Conference on Computational Imaging IV, San Jose, CA, [5] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages , Pittsburgh, PA, July ACM Press. [7] C. J. C. Burges. Geometry and invariance in kernel based methods. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods Support Vector Learning, pages , Cambridge, MA, MIT Press. [8] D. Cox and F. O Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18: , [9] M. Dudík, S. Phillips, and R.E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. Annual Conf. Computational Learning Theory. Springer Verlag, [10] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, [11] W. DuMouchel, C. Volinsky, C. Cortes, D. Pregibon, and T. Johnson. Squashing flat files flatter. In International Conference on Knowledge Discovery and Data Mining (KDD), 1999.

61 [12] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages , Cambridge, MA, MIT Press. [13] R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 70: , [14] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, [15] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6): , [16] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The MIT Press, Cambridge, MA, [17] A. Gretton, O. Bousquet, A.J. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 63 77, Berlin, Germany, Springer-Verlag. [18] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf. Kernel methods for measuring independence. J. Mach. Learn. Res., 6: , [19] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Department, University of California at Santa Cruz, [20] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Cambridge, MA, MIT Press. [21] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The MIT Press, Cambridge, MA, [22] J. Jacod and P. Protter. Probability Essentials. Springer, New York, [23] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41: , 1970.

62 [24] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82 95, [25] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages Springer-Verlag, Berlin, [26] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A 209: , [27] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September [28] A. Rényi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10: , [29] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & Technical, Harlow, England, [30] B. Schölkopf. Support Vector Learning. R. Oldenbourg Verlag, München, Doktorarbeit, TU Berlin. Download: [31] B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5): , [32] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, [33] B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W. S. Noble. A kernel approach for learning from almost orthogonal patterns. In T. Elomaa, H. Mannila, and H. Toivonen, editors, 13th European Conference on Machine Learning (ECML 2002) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2002), Helsinki, volume 2430/2431 of Lecture Notes in Computer Science, pages , Berlin, Springer. [34] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, [35] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, [36] A. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11: , [37] A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf. Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.

63 [38] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67 93, [39] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, [40] V. Vapnik. Statistical Learning Theory. Wiley, NY, [41] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, [42] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September [43] C. Walder, K. Kim, and B. Schölkopf. Sparse multiscale gaussian process regression. Technical Report 162, Max-Planck- Institut für biologische Kybernetik, [44] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, [45] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.

64 Regularization Interpretation of Kernel Machines The norm in H can be interpreted as a regularization term (Girosi 1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regularization operator (mapping into a dot product space D) such that k is Green s function of P P, then where and w = P f, w = m i=1 α iφ(x i ) f(x) = i α ik(x i,x). Example: for the Gaussian kernel, P is a linear combination of differential operators.

65 w 2 = i,j = i,j = i,j α i α j k(x i,x j ) α i α j k(x i,.),δ xj (.) α i α j k(xi,.), (P Pk)(x j,.) = α i α j (Pk)(xi,.),(Pk)(x j,.) D i,j = (P α i k)(x i,.),(p α j k)(x j,.) i j = Pf 2, using f(x) = i α ik(x i,x). D

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

Kernel Methods. Outline

Kernel Methods. Outline Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert

More information

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Statistical Convergence of Kernel CCA

Statistical Convergence of Kernel CCA Statistical Convergence of Kernel CCA Kenji Fukumizu Institute of Statistical Mathematics Tokyo 106-8569 Japan fukumizu@ism.ac.jp Francis R. Bach Centre de Morphologie Mathematique Ecole des Mines de Paris,

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Support Vector Machines and Kernel Algorithms

Support Vector Machines and Kernel Algorithms Support Vector Machines and Kernel Algorithms Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik 72076 Tübingen, Germany Bernhard.Schoelkopf@tuebingen.mpg.de Alex Smola RSISE, Australian

More information

Maximum Mean Discrepancy

Maximum Mean Discrepancy Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia

More information

Kernel Methods. Machine Learning A W VO

Kernel Methods. Machine Learning A W VO Kernel Methods Machine Learning A 708.063 07W VO Outline 1. Dual representation 2. The kernel concept 3. Properties of kernels 4. Examples of kernel machines Kernel PCA Support vector regression (Relevance

More information

Strictly Positive Definite Functions on a Real Inner Product Space

Strictly Positive Definite Functions on a Real Inner Product Space Strictly Positive Definite Functions on a Real Inner Product Space Allan Pinkus Abstract. If ft) = a kt k converges for all t IR with all coefficients a k 0, then the function f< x, y >) is positive definite

More information

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space Robert Jenssen, Deniz Erdogmus 2, Jose Principe 2, Torbjørn Eltoft Department of Physics, University of Tromsø, Norway

More information

Kernel Method: Data Analysis with Positive Definite Kernels

Kernel Method: Data Analysis with Positive Definite Kernels Kernel Method: Data Analysis with Positive Definite Kernels 2. Positive Definite Kernel and Reproducing Kernel Hilbert Space Kenji Fukumizu The Institute of Statistical Mathematics. Graduate University

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Kernels for Multi task Learning

Kernels for Multi task Learning Kernels for Multi task Learning Charles A Micchelli Department of Mathematics and Statistics State University of New York, The University at Albany 1400 Washington Avenue, Albany, NY, 12222, USA Massimiliano

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik 72076 Tübingen, Germany Bernhard.Schoelkopf@tuebingen.mpg.de Alex Smola RSISE, Australian National

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Strictly positive definite functions on a real inner product space

Strictly positive definite functions on a real inner product space Advances in Computational Mathematics 20: 263 271, 2004. 2004 Kluwer Academic Publishers. Printed in the Netherlands. Strictly positive definite functions on a real inner product space Allan Pinkus Department

More information

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil

A note on the generalization performance of kernel classifiers with margin. Theodoros Evgeniou and Massimiliano Pontil MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 68 November 999 C.B.C.L

More information

Complexity and regularization issues in kernel-based learning

Complexity and regularization issues in kernel-based learning Complexity and regularization issues in kernel-based learning Marcello Sanguineti Department of Communications, Computer, and System Sciences (DIST) University of Genoa - Via Opera Pia 13, 16145 Genova,

More information

Kernel methods for Bayesian inference

Kernel methods for Bayesian inference Kernel methods for Bayesian inference Arthur Gretton Gatsby Computational Neuroscience Unit Lancaster, Nov. 2014 Motivating Example: Bayesian inference without a model 3600 downsampled frames of 20 20

More information

Robust Support Vector Machines for Probability Distributions

Robust Support Vector Machines for Probability Distributions Robust Support Vector Machines for Probability Distributions Andreas Christmann joint work with Ingo Steinwart (Los Alamos National Lab) ICORS 2008, Antalya, Turkey, September 8-12, 2008 Andreas Christmann,

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stephane Canu a Alex Smola b a 1-PSI-FRE CNRS 645, INSA de Rouen, France, St Etienne du Rouvray, France b Statistical Machine Learning Program, National ICT Australia

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Overview Motivation

More information

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract Scale-Invariance of Support Vector Machines based on the Triangular Kernel François Fleuret Hichem Sahbi IMEDIA Research Group INRIA Domaine de Voluceau 78150 Le Chesnay, France Abstract This paper focuses

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Kernel Methods in Machine Learning

Kernel Methods in Machine Learning Kernel Methods in Machine Learning Autumn 2015 Lecture 1: Introduction Juho Rousu ICS-E4030 Kernel Methods in Machine Learning 9. September, 2015 uho Rousu (ICS-E4030 Kernel Methods in Machine Learning)

More information

Linear Dependency Between and the Input Noise in -Support Vector Regression

Linear Dependency Between and the Input Noise in -Support Vector Regression 544 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 14, NO. 3, MAY 2003 Linear Dependency Between the Input Noise in -Support Vector Regression James T. Kwok Ivor W. Tsang Abstract In using the -support vector

More information

Kernels MIT Course Notes

Kernels MIT Course Notes Kernels MIT 15.097 Course Notes Cynthia Rudin Credits: Bartlett, Schölkopf and Smola, Cristianini and Shawe-Taylor The kernel trick that I m going to show you applies much more broadly than SVM, but we

More information

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space

Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Elements of Positive Definite Kernel and Reproducing Kernel Hilbert Space Statistical Inference with Reproducing Kernel Hilbert Space Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department

More information

Kernel Measures of Conditional Dependence

Kernel Measures of Conditional Dependence Kernel Measures of Conditional Dependence Kenji Fukumizu Institute of Statistical Mathematics 4-6-7 Minami-Azabu, Minato-ku Tokyo 6-8569 Japan fukumizu@ism.ac.jp Arthur Gretton Max-Planck Institute for

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Approximate Kernel PCA with Random Features

Approximate Kernel PCA with Random Features Approximate Kernel PCA with Random Features (Computational vs. Statistical Tradeoff) Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Journées de Statistique Paris May 28,

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes

A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes A Note on Extending Generalization Bounds for Binary Large-Margin Classifiers to Multiple Classes Ürün Dogan 1 Tobias Glasmachers 2 and Christian Igel 3 1 Institut für Mathematik Universität Potsdam Germany

More information

A Review of Kernel Methods in Machine Learning

A Review of Kernel Methods in Machine Learning A Review of Kernel Methods in Machine Learning Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola December 14, 2006 We review recent methods for learning with positive definite kernels. All these

More information

Regularization on Discrete Spaces

Regularization on Discrete Spaces Regularization on Discrete Spaces Dengyong Zhou and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics Spemannstr. 38, 72076 Tuebingen, Germany {dengyong.zhou, bernhard.schoelkopf}@tuebingen.mpg.de

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

A Tutorial Review of RKHS Methods in Machine Learning

A Tutorial Review of RKHS Methods in Machine Learning A Tutorial Review of RKHS Methods in Machine Learning Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola December 6, 2005 1 Introduction Over the last ten years, estimation and learning methods

More information

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee

The Learning Problem and Regularization Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee The Learning Problem and Regularization 9.520 Class 03, 11 February 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing

More information

Semi-Supervised Learning through Principal Directions Estimation

Semi-Supervised Learning through Principal Directions Estimation Semi-Supervised Learning through Principal Directions Estimation Olivier Chapelle, Bernhard Schölkopf, Jason Weston Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany {first.last}@tuebingen.mpg.de

More information

Support Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee

More information

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee

9.520: Class 20. Bayesian Interpretations. Tomaso Poggio and Sayan Mukherjee 9.520: Class 20 Bayesian Interpretations Tomaso Poggio and Sayan Mukherjee Plan Bayesian interpretation of Regularization Bayesian interpretation of the regularizer Bayesian interpretation of quadratic

More information

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis Alvina Goh Vision Reading Group 13 October 2005 Connection of Local Linear Embedding, ISOMAP, and Kernel Principal

More information

Derivative reproducing properties for kernel methods in learning theory

Derivative reproducing properties for kernel methods in learning theory Journal of Computational and Applied Mathematics 220 (2008) 456 463 www.elsevier.com/locate/cam Derivative reproducing properties for kernel methods in learning theory Ding-Xuan Zhou Department of Mathematics,

More information

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444

Kernel Methods. Jean-Philippe Vert Last update: Jan Jean-Philippe Vert (Mines ParisTech) 1 / 444 Kernel Methods Jean-Philippe Vert Jean-Philippe.Vert@mines.org Last update: Jan 2015 Jean-Philippe Vert (Mines ParisTech) 1 / 444 What we know how to solve Jean-Philippe Vert (Mines ParisTech) 2 / 444

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Formulation with slack variables

Formulation with slack variables Formulation with slack variables Optimal margin classifier with slack variables and kernel functions described by Support Vector Machine (SVM). min (w,ξ) ½ w 2 + γσξ(i) subject to ξ(i) 0 i, d(i) (w T x(i)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

An introduction to Support Vector Machines

An introduction to Support Vector Machines 1 An introduction to Support Vector Machines Giorgio Valentini DSI - Dipartimento di Scienze dell Informazione Università degli Studi di Milano e-mail: valenti@dsi.unimi.it 2 Outline Linear classifiers

More information

Statistical learning on graphs

Statistical learning on graphs Statistical learning on graphs Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr ParisTech, Ecole des Mines de Paris Institut Curie INSERM U900 Seminar of probabilities, Institut Joseph Fourier, Grenoble,

More information

Vectors in many applications, most of the i, which are found by solving a quadratic program, turn out to be 0. Excellent classication accuracies in bo

Vectors in many applications, most of the i, which are found by solving a quadratic program, turn out to be 0. Excellent classication accuracies in bo Fast Approximation of Support Vector Kernel Expansions, and an Interpretation of Clustering as Approximation in Feature Spaces Bernhard Scholkopf 1, Phil Knirsch 2, Alex Smola 1, and Chris Burges 2 1 GMD

More information

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels

EECS 598: Statistical Learning Theory, Winter 2014 Topic 11. Kernels EECS 598: Statistical Learning Theory, Winter 2014 Topic 11 Kernels Lecturer: Clayton Scott Scribe: Jun Guo, Soumik Chatterjee Disclaimer: These notes have not been subjected to the usual scrutiny reserved

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Kernel Methods Motivation Non-linear decision boundary. Efficient computation of inner

More information

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi

On the Noise Model of Support Vector Machine Regression. Massimiliano Pontil, Sayan Mukherjee, Federico Girosi MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES A.I. Memo No. 1651 October 1998

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

1 Exponential manifold by reproducing kernel Hilbert spaces

1 Exponential manifold by reproducing kernel Hilbert spaces 1 Exponential manifold by reproducing kernel Hilbert spaces 1.1 Introduction The purpose of this paper is to propose a method of constructing exponential families of Hilbert manifold, on which estimation

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Regularization in Reproducing Kernel Banach Spaces

Regularization in Reproducing Kernel Banach Spaces .... Regularization in Reproducing Kernel Banach Spaces Guohui Song School of Mathematical and Statistical Sciences Arizona State University Comp Math Seminar, September 16, 2010 Joint work with Dr. Fred

More information

Statistical Properties and Adaptive Tuning of Support Vector Machines

Statistical Properties and Adaptive Tuning of Support Vector Machines Machine Learning, 48, 115 136, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Statistical Properties and Adaptive Tuning of Support Vector Machines YI LIN yilin@stat.wisc.edu

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

1. Mathematical Foundations of Machine Learning

1. Mathematical Foundations of Machine Learning 1. Mathematical Foundations of Machine Learning 1.1 Basic Concepts Definition of Learning Definition [Mitchell (1997)] A computer program is said to learn from experience E with respect to some class of

More information

Bayesian Support Vector Machines for Feature Ranking and Selection

Bayesian Support Vector Machines for Feature Ranking and Selection Bayesian Support Vector Machines for Feature Ranking and Selection written by Chu, Keerthi, Ong, Ghahramani Patrick Pletscher pat@student.ethz.ch ETH Zurich, Switzerland 12th January 2006 Overview 1 Introduction

More information

Approximate Kernel Methods

Approximate Kernel Methods Lecture 3 Approximate Kernel Methods Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 207 Outline Motivating example Ridge regression

More information

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert

Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2007-025 CBCL-268 May 1, 2007 Notes on Regularized Least Squares Ryan M. Rifkin and Ross A. Lippert massachusetts institute

More information

Incorporating Invariances in Nonlinear Support Vector Machines

Incorporating Invariances in Nonlinear Support Vector Machines Incorporating Invariances in Nonlinear Support Vector Machines Olivier Chapelle olivier.chapelle@lip6.fr LIP6, Paris, France Biowulf Technologies Bernhard Scholkopf bernhard.schoelkopf@tuebingen.mpg.de

More information

INDEPENDENCE MEASURES

INDEPENDENCE MEASURES INDEPENDENCE MEASURES Beatriz Bueno Larraz Máster en Investigación e Innovación en Tecnologías de la Información y las Comunicaciones. Escuela Politécnica Superior. Máster en Matemáticas y Aplicaciones.

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Mercer s Theorem, Feature Maps, and Smoothing

Mercer s Theorem, Feature Maps, and Smoothing Mercer s Theorem, Feature Maps, and Smoothing Ha Quang Minh, Partha Niyogi, and Yuan Yao Department of Computer Science, University of Chicago 00 East 58th St, Chicago, IL 60637, USA Department of Mathematics,

More information

Reproducing Kernel Banach Spaces for Machine Learning

Reproducing Kernel Banach Spaces for Machine Learning Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009 Reproducing Kernel Banach Spaces for Machine Learning Haizhang Zhang, Yuesheng Xu and Jun Zhang

More information

22 : Hilbert Space Embeddings of Distributions

22 : Hilbert Space Embeddings of Distributions 10-708: Probabilistic Graphical Models 10-708, Spring 2014 22 : Hilbert Space Embeddings of Distributions Lecturer: Eric P. Xing Scribes: Sujay Kumar Jauhar and Zhiguang Huo 1 Introduction and Motivation

More information

Gaussian Processes (10/16/13)

Gaussian Processes (10/16/13) STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

More information

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN , Sparse Kernel Canonical Correlation Analysis Lili Tan and Colin Fyfe 2, Λ. Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong. 2. School of Information and Communication

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee

RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets Class 22, 2004 Tomaso Poggio and Sayan Mukherjee RKHS, Mercer s theorem, Unbounded domains, Frames and Wavelets 9.520 Class 22, 2004 Tomaso Poggio and Sayan Mukherjee About this class Goal To introduce an alternate perspective of RKHS via integral operators

More information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee Distribution Regression: A Simple Technique with Minimax-optimal Guarantee (CMAP, École Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department,

More information

Similarity and kernels in machine learning

Similarity and kernels in machine learning 1/31 Similarity and kernels in machine learning Zalán Bodó Babeş Bolyai University, Cluj-Napoca/Kolozsvár Faculty of Mathematics and Computer Science MACS 2016 Eger, Hungary 2/31 Machine learning Overview

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Statistical Learning on Reproducing Kernel Hilbert Spaces

Statistical Learning on Reproducing Kernel Hilbert Spaces Statistical Learning on Reproducing Kernel Hilbert Spaces Su-Yun Huang Inst. Statistical Science, Academia Sinica Workshop on Statistics and Machine Learning at National Donghwa University, Feb. 5-6, 2004

More information

Distribution Regression with Minimax-Optimal Guarantee

Distribution Regression with Minimax-Optimal Guarantee Distribution Regression with Minimax-Optimal Guarantee (Gatsby Unit, UCL) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnabás Póczos (ML Department, CMU), Arthur Gretton

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

The Numerical Stability of Kernel Methods

The Numerical Stability of Kernel Methods The Numerical Stability of Kernel Methods Shawn Martin Sandia National Laboratories P.O. Box 5800 Albuquerque, NM 87185-0310 smartin@sandia.gov November 3, 2005 Abstract Kernel methods use kernel functions

More information

An Introduction to Kernel Methods 1

An Introduction to Kernel Methods 1 An Introduction to Kernel Methods 1 Yuri Kalnishkan Technical Report CLRC TR 09 01 May 2009 Department of Computer Science Egham, Surrey TW20 0EX, England 1 This paper has been written for wiki project

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Unsupervised Kernel Dimension Reduction Supplemental Material

Unsupervised Kernel Dimension Reduction Supplemental Material Unsupervised Kernel Dimension Reduction Supplemental Material Meihong Wang Dept. of Computer Science U. of Southern California Los Angeles, CA meihongw@usc.edu Fei Sha Dept. of Computer Science U. of Southern

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information