Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009

Size: px

Start display at page:

Download "Kernel Methods. Bernhard Schölkopf. Max-Planck-Institut für biologische Kybernetik. B. Schölkopf, Cambridge, 2009"

Dorcas Gregory
5 years ago
Views:

1 Kernel Methods Bernhard Schölkopf Max-Planck-Institut für biologische Kybernetik

2 Roadmap Similarity, kernels, feature spaces Positive definite kernels and their RKHS Kernel means, representer theorem Support Vector Machines

3 Learning and Similarity: some Informal Thoughts input/output sets X, Y training set (x 1,y 1 ),..., (x m,y m ) X Y generalization : given a previously unseen x X, find a suitable y Y (x, y) should be similar to (x 1,y 1 ),..., (x m,y m ) how to measure similarity? for outputs: loss function (e.g., for Y = {±1}, zero-one loss) for inputs: kernel

4 Similarity of Inputs symmetric function k : X X R (x, x ) k(x, x ) for example, if X = R N : canonical dot product k(x, x )= N i=1 [x] i[x ] i if X is not a dot product space: assume that k has a representation as a dot product in a linear space H, i.e., there exists a map Φ : X H such that k(x, x )= Φ(x), Φ(x ). in that case, we can think of the patterns as Φ(x), Φ(x ), and carry out geometric algorithms in the dot product space ( feature space ) H.

5 An Example of a Kernel Algorithm Idea: classify points x := Φ(x) in feature space according to which of the two class means is closer. c + := 1 Φ(x m i ), c := 1 Φ(x + m i ) y i =1 y i = 1 o o o w + c 2 o c c 1. x x-c + + Compute the sign of the dot product between w := c + c and x c.

6 An Example of a Kernel Algorithm, ctd. [32] f(x) = sgn 1 where = sgn b = 1 2 m + 1 m + 1 m 2 {i:y i =+1} {i:y i =+1} {(i,j):y i =y j = 1} Φ(x), Φ(x i ) 1 k(x, x i ) 1 m k(x i,x j ) 1 m 2 + m {i:y i = 1} {i:y i = 1} {(i,j):y i =y j =+1} Φ(x), Φ(x i ) +b k(x, x i )+b k(x i,x j ). provides a geometric interpretation of Parzen windows

7 An Example of a Kernel Algorithm, ctd. Demo Exercise: derive the Parzen windows classifier by computing the distance criterion directly

8 Statistical Learning Theory 1. started by Vapnik and Chervonenkis in the Sixties 2. model: we observe data generated by an unknown stochastic regularity 3. learning = extraction of the regularity from the data 4. the analysis of the learning problem leads to notions of capacity of the function classes that a learning machine can implement. 5. support vector machines use a particular type of function class: classifiers with large margins in a feature space induced by a kernel. [39, 40]

9 Kernels and Feature Spaces Preprocess the data with Φ : X H x Φ(x), where H is a dot product space, and learn the mapping from Φ(x) to y [6]. usually, dim(x) dim(h) Curse of Dimensionality? crucial issue: capacity, not dimensionality

10 Example: All Degree 2 Monomials Φ: R 2 R 3 (x 1,x 2 ) (z 1,z 2,z 3 ) := (x 2 1, 2 x 1 x 2,x 2 2 )!!!!!!!! x 1 x 2!!!!!!!! z 1 z 3 z 2

11 General Product Feature Space How about patterns x R N and product features of order d? Here, dim(h) grows like N d. E.g. N = 16 16, and d =5 dimension 10 10

12 The Kernel Trick, N = d =2 Φ(x), Φ(x ) =(x 2 1, 2 x 1 x 2,x 2 2 )(x 2 1, 2 x 1 x 2,x 2 2 ) = x, x 2 =:k(x, x ) the dot product in H can be computed in R 2

13 The Kernel Trick, II More generally: x, x R N, d N: x, x d d N = x j x j = j=1 N j 1,...,j d =1 x j1 x jd x j 1 x j d = Φ(x), Φ(x ), where Φ maps into the space spanned by all ordered products of d input directions

14 Mercer s Theorem If k is a continuous kernel of a positive definite integral operator on L 2 (X) (where X is some compact space), k(x, x )f(x)f(x ) dx dx 0, X it can be expanded as k(x, x )= λ i ψ i (x)ψ i (x ) i=1 using eigenfunctions ψ i and eigenvalues λ i 0 [26].

15 The Mercer Feature Map In that case Φ(x) := satisfies Φ(x), Φ(x ) = k(x, x ). λ1 ψ 1 (x) λ2 ψ 2 (x) Proof: Φ(x), Φ(x ) λ1 ψ 1 (x) = λ2 ψ 2 (x),. = λ i ψ i (x)ψ i (x )=k(x, x ) i=1. λ1 ψ 1 (x ) λ2 ψ 2 (x ).

16 The Kernel Trick Summary any algorithm that only depends on dot products can benefit from the kernel trick this way, we can apply linear methods to vectorial as well as non-vectorial data think of the kernel as a nonlinear similarity measure examples of common kernels: Polynomial k(x, x )=( x, x + c) d Sigmoid k(x, x ) = tanh(κ x, x + Θ) Gaussian k(x, x ) = exp( x x 2 /(2 σ 2 )) Kernels are also known as covariance functions [44, 41, 45, 25]

17 Positive Definite Kernels It can be shown that the admissible class of kernels coincides with the one of positive definite (pd) kernels: kernels which are symmetric (i.e., k(x, x )=k(x,x)), and for any set of training points x 1,..., x m X and any a 1,..., a m R satisfy a i a j K ij 0, where K ij := k(x i,x j ). i,j K is called the Gram matrix or kernel matrix. If for pairwise distinct points, i,j a ia j K ij =0= a = 0, call it strictly positive definite.

18 Elementary Properties of PD Kernels Kernels from Feature Maps. If Φ maps X into a dot product space H, then Φ(x), Φ(x ) is a pd kernel on X X. Positivity on the Diagonal. k(x, x) 0 for all x X Cauchy-Schwarz Inequality. k(x, x ) 2 k(x, x)k(x,x ) (Hint: compute the determinant of the Gram matrix) Vanishing Diagonals. k(x, x) = 0 for all x X = k(x, x ) = 0 for all x, x X

19 The Feature Space for PD Kernels [5, 2, 29] define a feature map Φ: X R X x k(., x). E.g., for the Gaussian kernel:!.. x x'!(x)!(x') Next steps: turn Φ(X) into a linear space endow it with a dot product satisfying Φ(x), Φ(x ) = k(x, x ), i.e., k(., x),k(., x ) = k(x, x ) complete the space to get a reproducing kernel Hilbert space

20 Turn it Into a Linear Space Form linear combinations f(.) = g(.) = m α i k(., x i ), i=1 m j=1 (m, m N, α i,β j R, x i,x j X). β j k(., x j )

21 Endow it With a Dot Product f, g := = m i=1 m j=1 m α i g(x i )= i=1 α i β j k(x i,x j ) m j=1 β j f(x j ) This is well-defined, symmetric, and bilinear (more later). So far, it also works for non-pd kernels

22 The Reproducing Kernel Property Two special cases: Assume In this case, we have f(.) =k(., x). k(., x),g = g(x). If moreover we have g(.) =k(., x ), k(., x),k(., x ) = k(x, x ). k is called a reproducing kernel (up to here, have not used positive definiteness)

23 Endow it With a Dot Product, II It can be shown that.,. is a p.d. kernel on the set of functions {f(.) = m i=1 α i k(., x i ) α i R,x i X} : = γ i γ j fi,f j = γ i f i, γ j f j =: f, f ij i j α i k(., x i ), α i k(., x i ) = α i α j k(x i,x j ) 0 i i ij furthermore, it is strictly positive definite: f(x) 2 = f, k(., x) 2 f, f k(., x),k(., x) = f, f k(x, x) hence f, f = 0 implies f = 0. Complete the space in the corresponding norm to get a Hilbert space H k.

24 Explicit Construction of the RKHS Map for Mercer Kernels Recall that the dot product has to satisfy k(x,.),k(x,.) = k(x, x ). For a Mercer kernel k(x, x )= N F j=1 λ j ψ j (x)ψ j (x ) (with λ i > 0 for all i, N F N { }, and ψ i,ψ j L 2 (X) = δ ij), this can be achieved by choosing.,. such that ψ i,ψ j = δ ij /λ i.

25 ctd. To see this, compute k(x,.),k(x,.) = λ i ψ i (x)ψ i, λ j ψ j (x )ψ j i j = λ i λ j ψ i (x)ψ j (x ) ψ i,ψ j i,j = i,j = i λ i λ j ψ i (x)ψ j (x )δ ij /λ i λ i ψ i (x)ψ i (x ) = k(x, x ).

26 Deriving the Kernel from the RKHS An RKHS is a Hilbert space H of functions f where all point evaluation functionals p x : H R f p x (f) =f(x) exist and are continuous. Continuity means that whenever f and f are close in H, then f(x) and f (x) are close in R. This can be thought of as a topological prerequisite for generalization ability. By Riesz representation theorem, there exists an element of H, call it r x, such that r x,f = f(x), in particular, r x,r x = r x (x). Define k(x, x ) := r x (x )=r x (x). (cf. Canu & Mary, 2002)

27 The Empirical Kernel Map Recall the feature map Φ: X R X x k(., x). each point is represented by its similarity to all other points how about representing it by its similarity to a sample of points? Consider Φ m : X R m x k(., x) (x1,...,x m ) =(k(x 1,x),..., k(x m,x))

28 ctd. Φ m (x 1 ),..., Φ m (x m ) contain all necessary information about Φ(x 1 ),..., Φ(x m ) the Gram matrix G ij := Φ m (x i ), Φ m (x j ) satisfies G = K 2 where K ij = k(x i,x j ) modify Φ m to Φ w m : X R m x K 1 2(k(x 1,x),..., k(x m,x)) this whitened map ( kernel PCA map ) satifies Φ w m (x i ), Φ w m(x j ) = k(x i,x j ) for all i, j =1,..., m.

29 Some Properties of Kernels [32, 34] If k 1,k 2,... are pd kernels, then so are αk 1, provided α 0 k 1 + k 2 k 1 k 2 k(x, x ) := lim n k n (x, x ), provided it exists k(a, B) := x A,x B k 1(x, x ), where A, B are finite subsets of X (using the feature map Φ(A) := x A Φ(x)) Further operations to construct kernels from kernels: tensor products, direct sums, convolutions [19].

30 Properties of Kernel Matrices, I [30] Suppose we are given distinct training patterns x 1,..., x m, and a positive definite m m matrix K. K can be diagonalized as K = SDS, with an orthogonal matrix S and a diagonal matrix D with nonnegative entries. Then K ij =(SDS ) ij = S i, DS j = DSi, DS j, where the S i are the rows of S. We have thus constructed a map Φ into an m-dimensional feature space H such that K ij = Φ(x i ), Φ(x j ).

31 Properties, II: Functional Calculus [33] K symmetric m m matrix with spectrum σ(k) f a continuous function on σ(k) Then there is a symmetric matrix f(k) with eigenvalues in f(σ(k)). compute f(k) via Taylor series, or eigenvalue decomposition of K: If K = S DS (D diagonal and S unitary), then f(k) = S f(d)s, where f(d) is defined elementwise on the diagonal can treat functions of symmetric matrices like functions on R (αf + g)(k) =αf(k)+g(k) (f g)(k) =f(k)g(k) =g(k)f(k) f,σ(k) = f(k) σ(f(k)) = f(σ(k)) (the C -algebra generated by K is isomorphic to the set of continuous functions on σ(k))

32 An example of a kernel algorithm, revisited o + µ(x ) + w +. o µ(y ) o + X compact subset of a separable metric space, m, n N. Positive class X := {x 1,..., x m } X Negative class Y := {y 1,..., y n } X RKHS means µ(x) = 1 m mi=1 k(x i, ), µ(y )= 1 n ni=1 k(y i, ). Get a problem if µ(x) =µ(y )!

33 When do the means coincide? k(x, x )= x, x : k(x, x )=( x, x + 1) d : the means coincide all empirical moments up to order d coincide k strictly pd: X = Y. The mean remembers each point that contributed to it.

34 Proposition 1 Assume X, Y are defined as above, k is strictly pd, and for all i, j, x i x j, and y i y j. If for some α i,β j R {0}, we have m n α i k(x i,.)= β j k(y j,.), (1) then X = Y. i=1 j=1

35 Exercise: generalize to the case of nonsingular kernel (i.e., leading to nonsingular Gram matrices for pairwise distinct points). Proof (by contradiction) W.l.o.g., assume that x 1 Y. Subtract n j=1 β j k(y j,.) from (1), and make it a sum over pairwise distinct points, to get 0= γ i k(z i,.), i where z 1 = x 1,γ 1 = α 1 0, and z 2, X Y {x 1 },γ 2, R. Take the RKHS dot product with j γ jk(z j,.) to get 0= ij γ i γ j k(z i,z j ), with γ 0, hence k cannot be strictly pd.

36 Generalization We will prove a more general statement, without assuming positive definiteness. Definition 2 We call a kernel k : X 2 R nonsingular if for any n N and pairwise distinct x 1,..., x n X, the Gram matrix (k(x i,x j )) ij is nonsingular. Note that strictly positive definite kernels are nonsingular: if the matrix K is singular, then there exists a β 0 such that Kβ = 0, hence β Kβ = 0, hence k is not strictly positive definite. Proposition 3 Assume X, Y are defined as above, k is nonsingular, and for all i, j, x i x j, and y i y j. If for some α i,β j R {0}, we have m n α i k(x i,.)= β j k(y j,.), (2) then X = Y. i=1 Proof (by contradiction) W.l.o.g., assume that x 1 Y. Subtract n j=1 β jk(y j,.) from (2), and make it a sum over pairwise distinct points, to get 0= γ i k(z i,.), i where z 1 = x 1,γ 1 = α 1 0, and z 2, X Y {x 1 },γ 2, R. Similar to the pd case, k induces a linear space with a bilinear form satisfying the reproducing kernel property. Take the bilinear form between j λ jk(z j,.) and the above, to get j=1 0= ij λ j γ i k(z j,z i )=λ Kγ, where λ R is arbitrary. Hence Kγ = 0. However, γ 0, hence K is singular. Since the z i are pairwise distinct, k cannot be nonsingular.

37 The mean map satisfies and µ: X =(x 1,..., x m ) 1 m µ(x),f = 1 m m k(x i, ),f i=1 µ(x) µ(y ) = sup µ(x) µ(y ),f = sup f 1 f 1 m k(x i, ) i=1 = 1 m 1 m m f(x i ) i=1 m f(x i ) 1 n i=1 n f(y i ). i=1 Note: distance in the RKHS = solution of a high-dimensional optimization problem.

38 Witness function f = µ(x) µ(y ), thus f(x) µ(x) µ(y ),k(x,.) ): µ(x) µ(y ) Prob. density and f Witness f for Gauss and Laplace data f Gauss Laplace This function is in the RKHS of a Gaussian kernel, but not in the RKHS of the linear kernel. X

39 The mean map for measures p, q Borel probability measures, E x,x p[k(x, x )], E x,x q[k(x, x )] < ( k(x,.) M< is sufficient) Define Note and µ(p) µ(q) = µ: p E x p [k(x, )]. µ(p),f = E x p [f(x)] sup Ex p [f(x)] E x q [f(x)]. f 1 Recall that in the finite sample case, for strictly p.d. kernels, µ was injective how about now?

40 Theorem 4 [13, 10] p = q sup f C(X) E x p (f(x)) E x q (f(x)) =0, where C(X) is the space of continuous bounded functions on X. Replace C(X) by the unit ball in an RKHS that is dense in C(X) universal kernel [38], e.g., Gaussian. Theorem 5 [16] If k is universal, then p = q µ(p) µ(q) =0.

41 µ is invertible on its image M = {µ(p) p is a probability distribution} (the marginal polytope, [42]) generalization of the moment generating function of a RV x with distribution p: M p (.) =E x p [e x, ].

42 Uniform convergence bounds Let X be an i.i.d. m-sample from p. The discrepancy µ(p) µ(x) = E x p[f(x)] 1 m f(x m i ) sup f 1 i=1 can be bounded using uniform convergence methods [37].

43 Application 1: Two-sample problem [16] X, Y i.i.d. m-samples from p, q, respectively. µ(p) µ(q) 2 =E x,x p [k(x, x )] 2E x p,y q [k(x, y)] + E y,y q [k(y, y )] =E x,x p,y,y q [h((x, y), (x,y ))] with Define h((x, y), (x,y )) := k(x, x ) k(x, y ) k(y, x )+k(y, y ). D(p, q) 2 := E x,x p,y,y qh((x, y), (x,y )) ˆD(X, Y ) 2 1 := m(m 1) h((x i,y i ), (x j,y j )). ˆD(X, Y ) 2 is an unbiased estimator of D(p, q) 2. It s easy to compute, and works on structured data. i j

44 Theorem 6 Assume k is bounded. ˆD(X, Y ) 2 converges to D(p, q) 2 in probability with rate O(m 1 2 ). This could be used as a basis for a test, but uniform convergence bounds are often loose.. Theorem 7 We assume E ( h 2) <. When p q, then m( ˆD(X, Y ) 2 D(p, q) 2 ) converges in distribution to a zero mean Gaussian with variance ( [ σu 2 =4 E z (Ez h(z, z )) 2] [ E z,z (h(z, z )) ] ) 2. When p = q, then m( ˆD(X, Y ) 2 D(p, q) 2 )=m ˆD(X, Y ) 2 converges in distribution to [ λ l q 2 l 2 ], (3) where q l N(0, 2) i.i.d., λ i are the solutions to the eigenvalue equation k(x, x )ψ i (x)dp(x) =λ i ψ i (x ), X l=1 and k(x i,x j ) := k(x i,x j ) E x k(x i,x) E x k(x, x j )+E x,x k(x, x ) is the centred RKHS kernel.

45 Application 2: Dependence Measures Assume that (x, y) are drawn from p xy, with marginals p x,p y. Want to know whether p xy factorizes. [3, 14]: kernel generalized variance [17, 18]: kernel constrained covariance, HSIC Main idea [22, 28]: x and y independent bounded continuous functions f, g, we have Cov(f(x),g(y)) = 0.

46 k kernel on X Y. µ(p xy ) := E (x,y) pxy [k((x, y), )] µ(p x p y ) := E x px,y p y [k((x, y), )]. Use := µ(pxy ) µ(p x p y ) as a measure of dependence. For k((x, y), (x,y )) = k x (x, x )k y (y, y ): 2 equals the Hilbert-Schmidt norm of the covariance operator between the two RKHSs (HSIC), with empirical estimate m 2 tr HK x HK y, where H = I 1/m [17, 37].

47 Witness function of the equivalent optimisation problem: 1.5 Dependence witness and sample Y X 0.04 Application: learning causal structures (Sun, Janzing, Schölkopf, Fukumizu, ICML 2007; Fukumizu, Gretton, Sun, Schölkopf, NIPS 2007))

48 Application 3: Covariate Shift Correction and Local Learning training set X = {(x 1,y 1 ),..., (x m,y m )} drawn from p, test set X = { (x 1,y 1 ),..., (x n,y n )} from p p. Assume p y x = p y x. [35]: reweight training set

49 Minimize m β i k(x i, ) µ(x ) i=1 Equivalent QP: minimize β 2 +λ β 2 2 subject to β i 0, i 1 2 β (K + λ1) β β l subject to β i 0 and i β i =1, β i =1. where K ij := k(x i,x j ), l i = k(x i, ),µ(x ). Experiments show that in underspecified situations (e.g., large kernel widths), this helps [21]. X = { x } leads to a local sample weighting scheme.

50 Application 4: Measure estimation and dataset squashing [9, 4, 1, 37] Given a sample X, minimize µ(x) µ(p) 2 over a convex combination of measures p i, p = i α ip i, α i 0, i α i =1. This can be written as a convex QP with objective function where µ(x) µ(p) 2 = α Qα+1 mk1 m 2α L1 m, L ij :=E x pi [ k(x, xj ) ] Q ij :=E x pi,x p j [ k(x, x ) ] K ij =k(x i,x j ) 1 m :=(1/m,..., 1/m) R m.

51 In practice, use α [Q + λi]α 2α L1 m Some cases where Q and L can be computed in closed form [37]: Gaussian p i and k (cf. [4, 43]) X training set, Dirac measures p i = δ xi : dataset squashing, [11] X test set, Dirac measures p i = δ yi centered on the training points Y : covariate shift correction [20]

52 The Representer Theorem Theorem 8 Given: a p.d. kernel k on X X, a training set (x 1,y 1 ),..., (x m,y m ) X R, a strictly monotonic increasing real-valued function Ω on [0, [, and an arbitrary cost function c :(X R 2 ) m R { } Any f H minimizing the regularized risk functional c ((x 1,y 1,f(x 1 )),..., (x m,y m,f(x m ))) + Ω ( f ) (4) admits a representation of the form f(.) = m i=1 α ik(x i,.).

53 Remarks significance: many learning algorithms have solutions that can be expressed as expansions in terms of the training examples original form, with mean squared loss c((x 1,y 1,f(x 1 )),..., (x m,y m,f(x m ))) = 1 m and Ω( f ) =λ f 2 (λ> 0): [24] m (y i f(x i )) 2, i=1 generalization to non-quadratic cost functions: [8] present form: [32]

54 Proof Decompose f H into a part in the span of the k(x i,.) and an orthogonal one: f = α i k(x i,.)+f, where for all j i f,k(x j,.) =0. Application of f to an arbitrary training point x j yields f(x j )= f, k(x j,.) = α i k(x i,.)+f,k(x j,.) = α i k(x i,.),k(x j,.), i independent of f. i

55 Proof: second part of (4) Since f is orthogonal to i α ik(x i,.), and Ω is strictly monotonic, we get ( Ω( f ) =Ω ) i α ik(x i,.)+f ) ( i α ik(x i,.) 2 + f 2 =Ω Ω ( ) i α ik(x i,.), (5) with equality occuring if and only if f = 0. Hence, any minimizer must have f = 0. solution takes the form f = i α ik(x i,.). Consequently, any

56 Application: Support Vector Classification Here, y i {±1}. Use c ((x i,y i,f(x i )) i )= 1 λ and the regularizer Ω ( f ) = f 2. λ 0 leads to the hard margin SVM max (0, 1 y i f (x i )), i

57 Further Applications Bayesian MAP Estimates. Identify (4) with the negative log posterior (cf. Kimeldorf & Wahba, 1970, Poggio & Girosi, 1990), i.e. exp( c((x i,y i,f(x i )) i )) likelihood of the data exp( Ω( f )) prior over the set of functions; e.g., Ω( f ) = λ f 2 Gaussian process prior [45] with covariance function k minimizer of (4) = MAP estimate Kernel PCA (see below) can be shown to correspond to the case of c((x i,y i,f(x i )) i=1,...,m )= 0 if 1 m i otherwise ( f(x i ) 1 m with g an arbitrary strictly monotonically increasing function. j f(x j)) 2 =1

58 The Pre-Image Problem due to the representer theorem, the solution of kernel algorithms usually corresponds to a single vector in H m w = α i Φ(x i ). i=1 However, there is usually no x X such that Φ(x) =w, i.e., Φ(X) is not closed under linear combinations it is a nonlinear manifold (cf. [7, 31]).

59 Conclusion so far the kernel corresponds to a similarity measure for the data, or a (linear) representation of the data, or a hypothesis space for learning, kernels allow the formulation of a multitude of geometrical algorithms (Parzen windows, 2-sample tests, SVMs, kernel PCA,...)

60 References [1] Y. Altun and A.J. Smola. Unifying divergence minimization and statistical inference via convex duality. In H.U. Simon and G. Lugosi, editors, Proc. Annual Conf. Computational Learning Theory, LNCS, pages Springer, [2] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68: , [3] F. R. Bach and M. I. Jordan. Kernel independent component analysis. J. Mach. Learn. Res., 3:1 48, [4] N. Balakrishnan and D. Schonfeld. A maximum entropy kernel density estimator with applications to function interpolation and texture segmentation. In SPIE Proceedings of Electronic Imaging: Science and Technology. Conference on Computational Imaging IV, San Jose, CA, [5] C. Berg, J. P. R. Christensen, and P. Ressel. Harmonic Analysis on Semigroups. Springer-Verlag, New York, [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In D. Haussler, editor, Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, pages , Pittsburgh, PA, July ACM Press. [7] C. J. C. Burges. Geometry and invariance in kernel based methods. In B. Schölkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in Kernel Methods Support Vector Learning, pages , Cambridge, MA, MIT Press. [8] D. Cox and F. O Sullivan. Asymptotic analysis of penalized likelihood and related estimators. Annals of Statistics, 18: , [9] M. Dudík, S. Phillips, and R.E. Schapire. Performance guarantees for regularized maximum entropy density estimation. In Proc. Annual Conf. Computational Learning Theory. Springer Verlag, [10] R. M. Dudley. Real analysis and probability. Cambridge University Press, Cambridge, UK, [11] W. DuMouchel, C. Volinsky, C. Cortes, D. Pregibon, and T. Johnson. Squashing flat files flatter. In International Conference on Knowledge Discovery and Data Mining (KDD), 1999.

61 [12] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. In A. J. Smola, P. L. Bartlett, B. Schölkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, pages , Cambridge, MA, MIT Press. [13] R. Fortet and E. Mourier. Convergence de la réparation empirique vers la réparation théorique. Ann. Scient. École Norm. Sup., 70: , [14] K. Fukumizu, F. R. Bach, and M. I. Jordan. Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces. J. Mach. Learn. Res., 5:73 99, [15] F. Girosi. An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6): , [16] A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, and A. J. Smola. A kernel method for the two-sample-problem. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The MIT Press, Cambridge, MA, [17] A. Gretton, O. Bousquet, A.J. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In S. Jain, H. U. Simon, and E. Tomita, editors, Proceedings Algorithmic Learning Theory, pages 63 77, Berlin, Germany, Springer-Verlag. [18] A. Gretton, R. Herbrich, A. Smola, O. Bousquet, and B. Schölkopf. Kernel methods for measuring independence. J. Mach. Learn. Res., 6: , [19] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10, Computer Science Department, University of California at Santa Cruz, [20] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In Advances in Neural Information Processing Systems 19, Cambridge, MA, MIT Press. [21] J. Huang, A.J. Smola, A. Gretton, K. Borgwardt, and B. Schölkopf. Correcting sample selection bias by unlabeled data. In B. Schölkopf, J. Platt, and T. Hofmann, editors, Advances in Neural Information Processing Systems 19, volume 19. The MIT Press, Cambridge, MA, [22] J. Jacod and P. Protter. Probability Essentials. Springer, New York, [23] G. S. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Annals of Mathematical Statistics, 41: , 1970.

62 [24] G. S. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. J. Math. Anal. Applic., 33:82 95, [25] D. J. C. MacKay. Introduction to gaussian processes. In C. M. Bishop, editor, Neural Networks and Machine Learning, pages Springer-Verlag, Berlin, [26] J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philos. Trans. Roy. Soc. London, A 209: , [27] T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September [28] A. Rényi. On measures of dependence. Acta Math. Acad. Sci. Hungar., 10: , [29] S. Saitoh. Theory of Reproducing Kernels and its Applications. Longman Scientific & Technical, Harlow, England, [30] B. Schölkopf. Support Vector Learning. R. Oldenbourg Verlag, München, Doktorarbeit, TU Berlin. Download: [31] B. Schölkopf, S. Mika, C. Burges, P. Knirsch, K.-R. Müller, G. Rätsch, and A. Smola. Input space vs. feature space in kernel-based methods. IEEE Transactions on Neural Networks, 10(5): , [32] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, [33] B. Schölkopf, J. Weston, E. Eskin, C. Leslie, and W. S. Noble. A kernel approach for learning from almost orthogonal patterns. In T. Elomaa, H. Mannila, and H. Toivonen, editors, 13th European Conference on Machine Learning (ECML 2002) and 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2002), Helsinki, volume 2430/2431 of Lecture Notes in Computer Science, pages , Berlin, Springer. [34] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, [35] H. Shimodaira. Improving predictive inference under convariance shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference, 90, [36] A. Smola, B. Schölkopf, and K.-R. Müller. The connection between regularization operators and support vector kernels. Neural Networks, 11: , [37] A. J. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Proc. Intl. Conf. Algorithmic Learning Theory, volume 4754 of LNAI. Springer, 2007.

63 [38] I. Steinwart. On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res., 2:67 93, [39] V. Vapnik. The Nature of Statistical Learning Theory. Springer, NY, [40] V. Vapnik. Statistical Learning Theory. Wiley, NY, [41] G. Wahba. Spline Models for Observational Data, volume 59 of CBMS-NSF Regional Conference Series in Applied Mathematics. SIAM, Philadelphia, [42] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, September [43] C. Walder, K. Kim, and B. Schölkopf. Sparse multiscale gaussian process regression. Technical Report 162, Max-Planck- Institut für biologische Kybernetik, [44] H. L. Weinert. Reproducing Kernel Hilbert Spaces. Hutchinson Ross, Stroudsburg, PA, [45] C. K. I. Williams. Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan, editor, Learning and Inference in Graphical Models. Kluwer, 1998.

64 Regularization Interpretation of Kernel Machines The norm in H can be interpreted as a regularization term (Girosi 1998, Smola et al., 1998, Evgeniou et al., 2000): if P is a regularization operator (mapping into a dot product space D) such that k is Green s function of P P, then where and w = P f, w = m i=1 α iφ(x i ) f(x) = i α ik(x i,x). Example: for the Gaussian kernel, P is a linear combination of differential operators.

65 w 2 = i,j = i,j = i,j α i α j k(x i,x j ) α i α j k(x i,.),δ xj (.) α i α j k(xi,.), (P Pk)(x j,.) = α i α j (Pk)(xi,.),(Pk)(x j,.) D i,j = (P α i k)(x i,.),(p α j k)(x j,.) i j = Pf 2, using f(x) = i α ik(x i,x). D

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck