Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Size: px

Start display at page:

Download "Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton"

Drusilla King
5 years ago
Views:

1 Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Machine Learning Summer School, Taiwan 2006 Alexander J. Smola: Kernel Methods 1 / 39

2 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39

3 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39

4 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39

5 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 3 / 39

6 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q, test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Kernel Methods 4 / 39

7 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q, test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Kernel Methods 4 / 39

8 Indirect Solution Plug in estimators Estimate ˆp and ˆq and compute D(ˆp, ˆq). Parzen Windows and L 2 distance ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Computing squared L 2 distance between ˆp and ˆq yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Kernel Methods 5 / 39

9 Indirect Solution Plug in estimators Estimate ˆp and ˆq and compute D(ˆp, ˆq). Parzen Windows and L 2 distance ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Computing squared L 2 distance between ˆp and ˆq yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Kernel Methods 5 / 39

10 Problems Curse of dimensionality when estimating ˆp and ˆq Statistical analysis (multi-stage procedure). What to do on strings, images, structured data? This quantity is biased (even for p = q its expected value does not vanish). Alexander J. Smola: Kernel Methods 6 / 39

11 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39

12 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39

13 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39

14 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Kernel Methods 8 / 39

15 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Kernel Methods 8 / 39

16 Kolmogorov Smirnov Statistic Function Class Real-valued in one dimension, X = R F are all functions with total variation less than 1. Key: F is absolute convex hull of ξ (,t] (x) for t R. Optimization Problem sup E p [f (x)] E q [f (y)] = f F [ sup E p ξ(,t] (x) ] [ E q ξ(,t] (y) ] = F p F q t R Estimation Use empirical estimates of F p and F q. Use Glivenko-Cantelli to obtain statistic. Alexander J. Smola: Kernel Methods 9 / 39

17 Hilbert Space Setting Function Class Reproducing Kernel Hilbert Space H with kernel k. Evaluation functionals f (x) = k(x, ), f. Computing means via linearity 1 m E p [f (x)] = E p [ k(x, ), f ] = m f (x i ) = 1 m i=1 m k(x i, ), f = i=1 Computing means via µ p, f and µ X, f. Alexander J. Smola: Kernel Methods 10 / 39 E p [k(x, )], f }{{} :=µ p 1 m k(x i, ), f m i=1 }{{} :=µ X

18 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Kernel Methods 11 / 39

19 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Kernel Methods 11 / 39

20 Maximum Mean Discrepancy Statistic Goal: Estimate D(p, q, F) E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) U-Statistic: Empirical estimate D(X, Y, F) 1 m(m 1) i j k(x i, x j ) k(x i, y j ) k(y i, x j ) + k(y i, y j ) }{{} =:h((x i,y i ),(x j,y j )) Theorem D(X, Y, F) is an unbiased estimator of D(p, q, F). Alexander J. Smola: Kernel Methods 12 / 39

21 Distinguishing Normal and Laplace Alexander J. Smola: Kernel Methods 13 / 39

22 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39

23 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39

24 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39

25 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39

26 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39

27 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39

28 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 16 / 39

29 Application: Data Integration Goal Data from various sources Check whether we can combine it Comparison MMD using the uniform convergence bound MMD using the asymptotic expansion t-test Friedman-Rafsky Wolf test Friedman-Rafsky Smirnov test Hall-Tajvidi Important Detail Our test only needs a double for loop for implementation. Other tests require spanning trees, matrix inversion, etc. Alexander J. Smola: Kernel Methods 17 / 39

30 Toy Example: Normal Distributions Alexander J. Smola: Kernel Methods 18 / 39

31 Microarray cross-platform comparability Platforms H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Cross-platform comparability tests on microarray level for cdna and oligonucleotide platforms repetitions: 100 sample size (each): 25 dimension of sample vectors: 2116 Alexander J. Smola: Kernel Methods 19 / 39

32 Cancer diagnosis Health status H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from normal and prostate tumor tissues. H 0 is hypothesis that p = q repetitions 100 sample size (each) 25 dimension of sample vectors: 12,600 Alexander J. Smola: Kernel Methods 20 / 39

33 Tumor subtype tests Subtype H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from different and identical tumor subtypes of lymphoma. H 0 is hypothesis that p = q. repetitions 100 sample size (each) 25 dimension of sample vectors: 2,118 Alexander J. Smola: Kernel Methods 21 / 39

34 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 22 / 39

35 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39

36 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39

37 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39

38 Application: Attribute Matching Alexander J. Smola: Kernel Methods 24 / 39

39 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

40 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

41 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

42 Schema Matching with Linear Assignment Key Idea Use D(X i, X j, F) = C ij as compatibility criterion. Results Dataset Data d m rept. % correct BIO uni CNUM uni FOREST uni FOREST10D multi ENYZME struct PROTEINS struct Alexander J. Smola: Kernel Methods 26 / 39

43 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 27 / 39

44 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Kernel Methods 28 / 39

45 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Kernel Methods 28 / 39

46 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39

47 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39

48 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39

49 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39

50 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Kernel Methods 30 / 39

51 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Kernel Methods 30 / 39

52 Empirical Version Re-weight empirical mean in feature space µ[x, β] := 1 m m β i k(x i, ) i=1 such that it is close to the mean on test set µ[x ] := 1 m k(x m i, ) i=1 Ensure that β i is proper reweighting. Alexander J. Smola: Kernel Methods 31 / 39

53 Optimization Problem Quadratic Program 1 m minimize β i k(x i, ) 1 m k(x β m m i, ) i=1 i=1 m subject to 0 β i B and β i = m i=1 Upper bound on β i for regularization Summation constraint for reweighted distribution Standard solvers available 2 Alexander J. Smola: Kernel Methods 32 / 39

54 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Kernel Methods 33 / 39

55 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Kernel Methods 33 / 39

56 Regression Toy Example Alexander J. Smola: Kernel Methods 34 / 39

57 Regression Toy Example Alexander J. Smola: Kernel Methods 35 / 39

58 Breast Cancer - Bias on features Alexander J. Smola: Kernel Methods 36 / 39

59 Breast Cancer - Bias on labels Alexander J. Smola: Kernel Methods 37 / 39

60 More Experiments Alexander J. Smola: Kernel Methods 38 / 39

61 Summary 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 39 / 39

Maximum Mean Discrepancy

Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia