Maximum Mean Discrepancy

Size: px
Start display at page:

Download "Maximum Mean Discrepancy"

Transcription

1 Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Maximum Mean Discrepancy 1 / 42

2 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 2 / 42

3 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q. Test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Maximum Mean Discrepancy 3 / 42

4 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q. Test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Maximum Mean Discrepancy 3 / 42

5 Indirect Solution Algorithm Estimate ˆp and ˆq from observations. E.g. Parzen Windows, Mixture of Gaussians,... Compute distance between ˆp and ˆq. E.g. Kullback Leibler divergence, L 2 distance, L 1 distance,... PRO Lots of existing density estimator code available. Lots of distance measures available. Theoretical analysis for convergence of density estimators. CON Curse of dimensionality for density estimation. Theoretical analysis quite tricky (multi-stage procedure). What to do when density estimation is difficult? Bias. Even for p = q typically nonzero. Alexander J. Smola: Maximum Mean Discrepancy 4 / 42

6 Example Parzen Windows ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Here κ(x, ) is a nonnegative function which integrates to 1. Squared Distance We use D(X, Y ) := (ˆp(x) ˆq(x)) 2 dx. This yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Maximum Mean Discrepancy 5 / 42

7 Example Parzen Windows ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Here κ(x, ) is a nonnegative function which integrates to 1. Squared Distance We use D(X, Y ) := (ˆp(x) ˆq(x)) 2 dx. This yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Maximum Mean Discrepancy 5 / 42

8 A Simple Example Alexander J. Smola: Maximum Mean Discrepancy 6 / 42

9 Do we need density estimation? Food for thought No density estimation needed if the problem is really simple. In the previous example it suffices to check the means. General Principle We can find simple linear witness which shows that distributions are different. Alexander J. Smola: Maximum Mean Discrepancy 7 / 42

10 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

11 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

12 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

13 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Maximum Mean Discrepancy 9 / 42

14 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Maximum Mean Discrepancy 9 / 42

15 Functions of Bounded Variation Function class x f (x) dx C Heaviside Functions Alexander J. Smola: Maximum Mean Discrepancy 10 / 42

16 Kolmogorov Smirnov Statistic Function Class Real-valued in one dimension, X = R F are all functions with total variation less than 1. Key: F is absolute convex hull of ξ (,t] (x) for t R. Optimization Problem sup E p [f (x)] E q [f (y)] = f F [ sup E p ξ(,t] (x) ] [ E q ξ(,t] (y) ] = F p F q t R Estimation Use empirical estimates of F p and F q. Use Glivenko-Cantelli to obtain statistic. Alexander J. Smola: Maximum Mean Discrepancy 11 / 42

17 Feature Spaces Feature Map φ(x) maps X to some feature space F. Example: quadratic features φ(x) = (1, x, x 2 ). Function Class Linear functions f (x) = φ(x), w where w 1. Example: quadratic functions f (x) in the above case. Kernels No need to compute φ(x) explicitly. Use algorithm which only computes φ(x), φ(x ) =: k(x, x ). Reproducing Kernel Hilbert Space Notation Reproducing property f, k(x, ) = f (x) Equivalence between φ(x) and k(x, ) via k(x, ), k(x, ) = k(x, x ) Alexander J. Smola: Maximum Mean Discrepancy 12 / 42

18 Hilbert Space Setting Function Class Reproducing Kernel Hilbert Space H with kernel k. Evaluation functionals f (x) = k(x, ), f. Computing means via linearity 1 m E p [f (x)] = E p [ k(x, ), f ] = m f (x i ) = 1 m i=1 m k(x i, ), f = i=1 Computing means via µ p, f and µ X, f. Alexander J. Smola: Maximum Mean Discrepancy 13 / 42 E p [k(x, )], f }{{} :=µ p 1 m k(x i, ), f m i=1 }{{} :=µ X

19 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Maximum Mean Discrepancy 14 / 42

20 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Maximum Mean Discrepancy 14 / 42

21 Witness Functions Optimization Problem The expression µ p µ q, f is maximized for f = µ p µ q. Empirical Equivalent Expectations µ p and µ q are hard to compute. So we replace them by empirical means. This yields: f (z) = 1 m m k(x i, z) 1 n i=1 n k(y i, z) i=1 Alexander J. Smola: Maximum Mean Discrepancy 15 / 42

22 Distinguishing Normal and Laplace Alexander J. Smola: Maximum Mean Discrepancy 16 / 42

23 Maximum Mean Discrepancy Statistic Goal: Estimate D(p, q, F) E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) U-Statistic: Empirical estimate D(X, Y, F) 1 m(m 1) i j k(x i, x j ) k(x i, y j ) k(y i, x j ) + k(y i, y j ) }{{} =:h((x i,y i ),(x j,y j )) Theorem D(X, Y, F) is an unbiased estimator of D(p, q, F). Alexander J. Smola: Maximum Mean Discrepancy 17 / 42

24 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

25 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

26 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

27 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

28 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

29 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

30 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 20 / 42

31 Application: Data Integration Goal Data from various sources Check whether we can combine it Comparison MMD using the uniform convergence bound MMD using the asymptotic expansion t-test Friedman-Rafsky Wolf test Friedman-Rafsky Smirnov test Hall-Tajvidi Important Detail Our test only needs a double for loop for implementation. Other tests require spanning trees, matrix inversion, etc. Alexander J. Smola: Maximum Mean Discrepancy 21 / 42

32 Toy Example: Normal Distributions Alexander J. Smola: Maximum Mean Discrepancy 22 / 42

33 Microarray cross-platform comparability Platforms H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Cross-platform comparability tests on microarray level for cdna and oligonucleotide platforms repetitions: 100 sample size (each): 25 dimension of sample vectors: 2116 Alexander J. Smola: Maximum Mean Discrepancy 23 / 42

34 Cancer diagnosis Health status H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from normal and prostate tumor tissues. H 0 is hypothesis that p = q repetitions 100 sample size (each) 25 dimension of sample vectors: 12,600 Alexander J. Smola: Maximum Mean Discrepancy 24 / 42

35 Tumor subtype tests Subtype H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from different and identical tumor subtypes of lymphoma. H 0 is hypothesis that p = q. repetitions 100 sample size (each) 25 dimension of sample vectors: 2,118 Alexander J. Smola: Maximum Mean Discrepancy 25 / 42

36 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 26 / 42

37 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

38 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

39 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

40 Application: Attribute Matching Alexander J. Smola: Maximum Mean Discrepancy 28 / 42

41 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

42 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

43 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

44 Schema Matching with Linear Assignment Key Idea Use D(X i, X j, F) = C ij as compatibility criterion. Results Dataset Data d m rept. % correct BIO uni CNUM uni FOREST uni FOREST10D multi ENYZME struct PROTEINS struct Alexander J. Smola: Maximum Mean Discrepancy 30 / 42

45 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 31 / 42

46 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Maximum Mean Discrepancy 32 / 42

47 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Maximum Mean Discrepancy 32 / 42

48 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

49 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

50 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

51 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

52 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Maximum Mean Discrepancy 34 / 42

53 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Maximum Mean Discrepancy 34 / 42

54 Empirical Version Re-weight empirical mean in feature space µ[x, β] := 1 m m β i k(x i, ) i=1 such that it is close to the mean on test set µ[x ] := 1 m k(x m i, ) i=1 Ensure that β i is proper reweighting. Alexander J. Smola: Maximum Mean Discrepancy 35 / 42

55 Optimization Problem Quadratic Program 1 m minimize β i k(x i, ) 1 m k(x β m m i, ) i=1 i=1 m subject to 0 β i B and β i = m i=1 Upper bound on β i for regularization Summation constraint for reweighted distribution Standard solvers available 2 Alexander J. Smola: Maximum Mean Discrepancy 36 / 42

56 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Maximum Mean Discrepancy 37 / 42

57 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Maximum Mean Discrepancy 37 / 42

58 Regression Toy Example Alexander J. Smola: Maximum Mean Discrepancy 38 / 42

59 Regression Toy Example Alexander J. Smola: Maximum Mean Discrepancy 39 / 42

60 Breast Cancer - Bias on features Alexander J. Smola: Maximum Mean Discrepancy 40 / 42

61 Breast Cancer - Bias on labels Alexander J. Smola: Maximum Mean Discrepancy 41 / 42

62 More Experiments Alexander J. Smola: Maximum Mean Discrepancy 42 / 42

63 Summary 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 43 / 42

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra,

More information

Hilbert Space Representations of Probability Distributions

Hilbert Space Representations of Probability Distributions Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck

More information

A Kernel Method for the Two-Sample-Problem

A Kernel Method for the Two-Sample-Problem A Kernel Method for the Two-Sample-Problem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt Ludwig-Maximilians-Univ. Munich, Germany kb@dbs.ifi.lmu.de

More information

Hilbert Schmidt Independence Criterion

Hilbert Schmidt Independence Criterion Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au

More information

Learning Interpretable Features to Compare Distributions

Learning Interpretable Features to Compare Distributions Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections

More information

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy BIOINFORMATICS Vol. 00 no. 00 2006 Pages 1 10 Integrating structured biological data by Kernel Maximum Mean Discrepancy Karsten M. Borgwardt a, Arthur Gretton b, Malte J. Rasch, c Hans-Peter Kriegel, a

More information

arxiv: v1 [cs.lg] 15 May 2008

arxiv: v1 [cs.lg] 15 May 2008 Journal of Machine Learning Research (008) -0 Submitted 04/08; Published 04/08 A Kernel Method for the Two-Sample Problem arxiv:0805.368v [cs.lg] 5 May 008 Arthur Gretton MPI for Biological Cybernetics

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

arxiv: v1 [cs.lg] 22 Jun 2009

arxiv: v1 [cs.lg] 22 Jun 2009 Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Correcting Sample Selection Bias by Unlabeled Data

Correcting Sample Selection Bias by Unlabeled Data Correcting Sample Selection Bias by Unlabeled Data Jiayuan Huang School of Computer Science Univ. of Waterloo, Canada j9huang@cs.uwaterloo.ca Alexander J. Smola NICTA, ANU Canberra, Australia Alex.Smola@anu.edu.au

More information

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Kernel methods for comparing distributions, measuring dependence

Kernel methods for comparing distributions, measuring dependence Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations

More information

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable

More information

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES Wei Chu, S. Sathiya Keerthi, Chong Jin Ong Control Division, Department of Mechanical Engineering, National University of Singapore 0 Kent Ridge Crescent,

More information

Correcting Sample Selection Bias by Unlabeled Data

Correcting Sample Selection Bias by Unlabeled Data Correcting Sample Selection Bias by Unlabeled Data Jiayuan Huang School of Computer Science University of Waterloo, Canada j9huang@cs.uwaterloo.ca Arthur Gretton MPI for Biological Cybernetics Tübingen,

More information

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University Chapter 9. Support Vector Machine Yongdai Kim Seoul National University 1. Introduction Support Vector Machine (SVM) is a classification method developed by Vapnik (1996). It is thought that SVM improved

More information

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Unsupervised Nonparametric Anomaly Detection: A Kernel Method Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto Reproducing Kernel Hilbert Spaces 9.520 Class 03, 15 February 2006 Andrea Caponnetto About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

The Learning Problem and Regularization

The Learning Problem and Regularization 9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning

More information

High-dimensional test for normality

High-dimensional test for normality High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University

More information

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint Xia Hong 1, Sheng Chen 2, Chris J. Harris 2 1 School of Systems Engineering University of Reading, Reading RG6 6AY, UK E-mail: x.hong@reading.ac.uk

More information

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines A comprehensive introduc@on to SVMs and other kernel methods, including theory, algorithms and applica@ons. Instructor: Anthony

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

References for online kernel methods

References for online kernel methods References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Distribution-Free Distribution Regression

Distribution-Free Distribution Regression Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading

More information

Sample Selection Bias Correction

Sample Selection Bias Correction Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training

More information

Support Vector Machine

Support Vector Machine Support Vector Machine Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Linear Support Vector Machine Kernelized SVM Kernels 2 From ERM to RLM Empirical Risk Minimization in the binary

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Learning features to compare distributions

Learning features to compare distributions Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this

More information

Hilbert Space Embedding of Probability Measures

Hilbert Space Embedding of Probability Measures Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Kernels A Machine Learning Overview

Kernels A Machine Learning Overview Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

A Linear-Time Kernel Goodness-of-Fit Test

A Linear-Time Kernel Goodness-of-Fit Test A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

Advanced Machine Learning & Perception

Advanced Machine Learning & Perception Advanced Machine Learning & Perception Instructor: Tony Jebara Topic 6 Standard Kernels Unusual Input Spaces for Kernels String Kernels Probabilistic Kernels Fisher Kernels Probability Product Kernels

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Feature Selection via Dependence Maximization

Feature Selection via Dependence Maximization Journal of Machine Learning Research 1 (2007) xxx-xxx Submitted xx/07; Published xx/07 Feature Selection via Dependence Maximization Le Song lesong@it.usyd.edu.au NICTA, Statistical Machine Learning Program,

More information

Convergence Rates of Kernel Quadrature Rules

Convergence Rates of Kernel Quadrature Rules Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Bundle Methods for Machine Learning

Bundle Methods for Machine Learning Bundle Methods for Machine Learning Joint work with Quoc Le, Choon-Hui Teo, Vishy Vishwanathan and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program Canberra, ACT 0200

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/

More information

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2 Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 12, 2007 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

1 A Lower Bound on Sample Complexity

1 A Lower Bound on Sample Complexity COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

The PAC Learning Framework -II

The PAC Learning Framework -II The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline

More information

Kernel methods and the exponential family

Kernel methods and the exponential family Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Hilbert Space Methods in Learning

Hilbert Space Methods in Learning Hilbert Space Methods in Learning guest lecturer: Risi Kondor 6772 Advanced Machine Learning and Perception (Jebara), Columbia University, October 15, 2003. 1 1. A general formulation of the learning problem

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

An Adaptive Test of Independence with Analytic Kernel Embeddings

An Adaptive Test of Independence with Analytic Kernel Embeddings An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 February 11, 2009 About this class Goal To introduce a particularly useful family of hypothesis spaces called Reproducing Kernel Hilbert

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

Bits of Machine Learning Part 1: Supervised Learning

Bits of Machine Learning Part 1: Supervised Learning Bits of Machine Learning Part 1: Supervised Learning Alexandre Proutiere and Vahan Petrosyan KTH (The Royal Institute of Technology) Outline of the Course 1. Supervised Learning Regression and Classification

More information

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test

More information

Statistical Learning Theory and the C-Loss cost function

Statistical Learning Theory and the C-Loss cost function Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

(i) The optimisation problem solved is 1 min

(i) The optimisation problem solved is 1 min STATISTICAL LEARNING IN PRACTICE Part III / Lent 208 Example Sheet 3 (of 4) By Dr T. Wang You have the option to submit your answers to Questions and 4 to be marked. If you want your answers to be marked,

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

arxiv: v2 [stat.ml] 3 Sep 2015

arxiv: v2 [stat.ml] 3 Sep 2015 Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information

Class Prior Estimation from Positive and Unlabeled Data

Class Prior Estimation from Positive and Unlabeled Data IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,

More information

10-701/ Recitation : Kernels

10-701/ Recitation : Kernels 10-701/15-781 Recitation : Kernels Manojit Nandi February 27, 2014 Outline Mathematical Theory Banach Space and Hilbert Spaces Kernels Commonly Used Kernels Kernel Theory One Weird Kernel Trick Representer

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Roots and Coefficients Polynomials Preliminary Maths Extension 1

Roots and Coefficients Polynomials Preliminary Maths Extension 1 Preliminary Maths Extension Question If, and are the roots of x 5x x 0, find the following. (d) (e) Question If p, q and r are the roots of x x x 4 0, evaluate the following. pq r pq qr rp p q q r r p

More information

Probabilistic Regression Using Basis Function Models

Probabilistic Regression Using Basis Function Models Probabilistic Regression Using Basis Function Models Gregory Z. Grudic Department of Computer Science University of Colorado, Boulder grudic@cs.colorado.edu Abstract Our goal is to accurately estimate

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information