Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton
|
|
- Drusilla King
- 5 years ago
- Views:
Transcription
1 Kernel Methods Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Machine Learning Summer School, Taiwan 2006 Alexander J. Smola: Kernel Methods 1 / 39
2 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39
3 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39
4 Course Overview 1 Estimation in exponential families Maximum Likelihood and Priors Clifford Hammersley decomposition 2 Applications Conditional distributions and kernels Classification, Regression, Conditional random fields 3 Inference and convex duality Maximum entropy inference Approximate moment matching 4 Maximum mean discrepancy Means in feature space, Covariate shift correction 5 Hilbert-Schmidt independence criterion Covariance in feature space ICA, Feature selection Alexander J. Smola: Kernel Methods 2 / 39
5 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 3 / 39
6 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q, test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Kernel Methods 4 / 39
7 Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q, test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Kernel Methods 4 / 39
8 Indirect Solution Plug in estimators Estimate ˆp and ˆq and compute D(ˆp, ˆq). Parzen Windows and L 2 distance ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Computing squared L 2 distance between ˆp and ˆq yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Kernel Methods 5 / 39
9 Indirect Solution Plug in estimators Estimate ˆp and ˆq and compute D(ˆp, ˆq). Parzen Windows and L 2 distance ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Computing squared L 2 distance between ˆp and ˆq yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Kernel Methods 5 / 39
10 Problems Curse of dimensionality when estimating ˆp and ˆq Statistical analysis (multi-stage procedure). What to do on strings, images, structured data? This quantity is biased (even for p = q its expected value does not vanish). Alexander J. Smola: Kernel Methods 6 / 39
11 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39
12 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39
13 Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Kernel Methods 7 / 39
14 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Kernel Methods 8 / 39
15 Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Kernel Methods 8 / 39
16 Kolmogorov Smirnov Statistic Function Class Real-valued in one dimension, X = R F are all functions with total variation less than 1. Key: F is absolute convex hull of ξ (,t] (x) for t R. Optimization Problem sup E p [f (x)] E q [f (y)] = f F [ sup E p ξ(,t] (x) ] [ E q ξ(,t] (y) ] = F p F q t R Estimation Use empirical estimates of F p and F q. Use Glivenko-Cantelli to obtain statistic. Alexander J. Smola: Kernel Methods 9 / 39
17 Hilbert Space Setting Function Class Reproducing Kernel Hilbert Space H with kernel k. Evaluation functionals f (x) = k(x, ), f. Computing means via linearity 1 m E p [f (x)] = E p [ k(x, ), f ] = m f (x i ) = 1 m i=1 m k(x i, ), f = i=1 Computing means via µ p, f and µ X, f. Alexander J. Smola: Kernel Methods 10 / 39 E p [k(x, )], f }{{} :=µ p 1 m k(x i, ), f m i=1 }{{} :=µ X
18 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Kernel Methods 11 / 39
19 Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Kernel Methods 11 / 39
20 Maximum Mean Discrepancy Statistic Goal: Estimate D(p, q, F) E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) U-Statistic: Empirical estimate D(X, Y, F) 1 m(m 1) i j k(x i, x j ) k(x i, y j ) k(y i, x j ) + k(y i, y j ) }{{} =:h((x i,y i ),(x j,y j )) Theorem D(X, Y, F) is an unbiased estimator of D(p, q, F). Alexander J. Smola: Kernel Methods 12 / 39
21 Distinguishing Normal and Laplace Alexander J. Smola: Kernel Methods 13 / 39
22 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39
23 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39
24 Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Kernel Methods 14 / 39
25 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39
26 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39
27 Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Kernel Methods 15 / 39
28 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 16 / 39
29 Application: Data Integration Goal Data from various sources Check whether we can combine it Comparison MMD using the uniform convergence bound MMD using the asymptotic expansion t-test Friedman-Rafsky Wolf test Friedman-Rafsky Smirnov test Hall-Tajvidi Important Detail Our test only needs a double for loop for implementation. Other tests require spanning trees, matrix inversion, etc. Alexander J. Smola: Kernel Methods 17 / 39
30 Toy Example: Normal Distributions Alexander J. Smola: Kernel Methods 18 / 39
31 Microarray cross-platform comparability Platforms H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Cross-platform comparability tests on microarray level for cdna and oligonucleotide platforms repetitions: 100 sample size (each): 25 dimension of sample vectors: 2116 Alexander J. Smola: Kernel Methods 19 / 39
32 Cancer diagnosis Health status H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from normal and prostate tumor tissues. H 0 is hypothesis that p = q repetitions 100 sample size (each) 25 dimension of sample vectors: 12,600 Alexander J. Smola: Kernel Methods 20 / 39
33 Tumor subtype tests Subtype H 0 MMD t-test FR FR Wolf Smirnov Same accepted Same rejected Different accepted Different rejected Comparing samples from different and identical tumor subtypes of lymphoma. H 0 is hypothesis that p = q. repetitions 100 sample size (each) 25 dimension of sample vectors: 2,118 Alexander J. Smola: Kernel Methods 21 / 39
34 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 22 / 39
35 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39
36 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39
37 Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Kernel Methods 23 / 39
38 Application: Attribute Matching Alexander J. Smola: Kernel Methods 24 / 39
39 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}
40 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}
41 Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Kernel Methods 25 / 39 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}
42 Schema Matching with Linear Assignment Key Idea Use D(X i, X j, F) = C ij as compatibility criterion. Results Dataset Data d m rept. % correct BIO uni CNUM uni FOREST uni FOREST10D multi ENYZME struct PROTEINS struct Alexander J. Smola: Kernel Methods 26 / 39
43 Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 27 / 39
44 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Kernel Methods 28 / 39
45 Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Kernel Methods 28 / 39
46 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39
47 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39
48 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39
49 Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Kernel Methods 29 / 39
50 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Kernel Methods 30 / 39
51 Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Kernel Methods 30 / 39
52 Empirical Version Re-weight empirical mean in feature space µ[x, β] := 1 m m β i k(x i, ) i=1 such that it is close to the mean on test set µ[x ] := 1 m k(x m i, ) i=1 Ensure that β i is proper reweighting. Alexander J. Smola: Kernel Methods 31 / 39
53 Optimization Problem Quadratic Program 1 m minimize β i k(x i, ) 1 m k(x β m m i, ) i=1 i=1 m subject to 0 β i B and β i = m i=1 Upper bound on β i for regularization Summation constraint for reweighted distribution Standard solvers available 2 Alexander J. Smola: Kernel Methods 32 / 39
54 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Kernel Methods 33 / 39
55 Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Kernel Methods 33 / 39
56 Regression Toy Example Alexander J. Smola: Kernel Methods 34 / 39
57 Regression Toy Example Alexander J. Smola: Kernel Methods 35 / 39
58 Breast Cancer - Bias on features Alexander J. Smola: Kernel Methods 36 / 39
59 Breast Cancer - Bias on labels Alexander J. Smola: Kernel Methods 37 / 39
60 More Experiments Alexander J. Smola: Kernel Methods 38 / 39
61 Summary 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Kernel Methods 39 / 39
Maximum Mean Discrepancy
Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia
More informationHilbert Space Representations of Probability Distributions
Hilbert Space Representations of Probability Distributions Arthur Gretton joint work with Karsten Borgwardt, Kenji Fukumizu, Malte Rasch, Bernhard Schölkopf, Alex Smola, Le Song, Choon Hui Teo Max Planck
More informationHilbert Schmidt Independence Criterion
Hilbert Schmidt Independence Criterion Thanks to Arthur Gretton, Le Song, Bernhard Schölkopf, Olivier Bousquet Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au
More informationA Kernel Method for the Two-Sample-Problem
A Kernel Method for the Two-Sample-Problem Arthur Gretton MPI for Biological Cybernetics Tübingen, Germany arthur@tuebingen.mpg.de Karsten M. Borgwardt Ludwig-Maximilians-Univ. Munich, Germany kb@dbs.ifi.lmu.de
More informationarxiv: v1 [cs.lg] 22 Jun 2009
Bayesian two-sample tests arxiv:0906.4032v1 [cs.lg] 22 Jun 2009 Karsten M. Borgwardt 1 and Zoubin Ghahramani 2 1 Max-Planck-Institutes Tübingen, 2 University of Cambridge June 22, 2009 Abstract In this
More informationIntroduction to Machine Learning
Introduction to Machine Learning 3. Instance Based Learning Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Parzen Windows Kernels, algorithm Model selection
More informationBIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy
BIOINFORMATICS Vol. 00 no. 00 2006 Pages 1 10 Integrating structured biological data by Kernel Maximum Mean Discrepancy Karsten M. Borgwardt a, Arthur Gretton b, Malte J. Rasch, c Hans-Peter Kriegel, a
More informationLearning Interpretable Features to Compare Distributions
Learning Interpretable Features to Compare Distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London Theory of Big Data, 2017 1/41 Goal of this talk Given: Two collections
More informationImportance Reweighting Using Adversarial-Collaborative Training
Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationarxiv: v1 [cs.lg] 15 May 2008
Journal of Machine Learning Research (008) -0 Submitted 04/08; Published 04/08 A Kernel Method for the Two-Sample Problem arxiv:0805.368v [cs.lg] 5 May 008 Arthur Gretton MPI for Biological Cybernetics
More informationCorrecting Sample Selection Bias by Unlabeled Data
Correcting Sample Selection Bias by Unlabeled Data Jiayuan Huang School of Computer Science Univ. of Waterloo, Canada j9huang@cs.uwaterloo.ca Alexander J. Smola NICTA, ANU Canberra, Australia Alex.Smola@anu.edu.au
More informationIntroduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationGaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012
Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature
More informationDistribution-Free Distribution Regression
Distribution-Free Distribution Regression Barnabás Póczos, Alessandro Rinaldo, Aarti Singh and Larry Wasserman AISTATS 2013 Presented by Esther Salazar Duke University February 28, 2014 E. Salazar (Reading
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L2: Instance Based Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationQuantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing
Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu October
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationUnsupervised Nonparametric Anomaly Detection: A Kernel Method
Fifty-second Annual Allerton Conference Allerton House, UIUC, Illinois, USA October - 3, 24 Unsupervised Nonparametric Anomaly Detection: A Kernel Method Shaofeng Zou Yingbin Liang H. Vincent Poor 2 Xinghua
More informationCSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines
CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines A comprehensive introduc@on to SVMs and other kernel methods, including theory, algorithms and applica@ons. Instructor: Anthony
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationIntroduction to Machine Learning
Introduction to Machine Learning 12. Gaussian Processes Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 The Normal Distribution http://www.gaussianprocess.org/gpml/chapters/
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationCorrecting Sample Selection Bias by Unlabeled Data
Correcting Sample Selection Bias by Unlabeled Data Jiayuan Huang School of Computer Science University of Waterloo, Canada j9huang@cs.uwaterloo.ca Arthur Gretton MPI for Biological Cybernetics Tübingen,
More informationKernel methods for comparing distributions, measuring dependence
Kernel methods for comparing distributions, measuring dependence Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Principal component analysis Given a set of M centered observations
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationBundle Methods for Machine Learning
Bundle Methods for Machine Learning Joint work with Quoc Le, Choon-Hui Teo, Vishy Vishwanathan and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program Canberra, ACT 0200
More informationMINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava
MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,
More informationFeature Selection via Dependence Maximization
Journal of Machine Learning Research 1 (2007) xxx-xxx Submitted xx/07; Published xx/07 Feature Selection via Dependence Maximization Le Song lesong@it.usyd.edu.au NICTA, Statistical Machine Learning Program,
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationThe PAC Learning Framework -II
The PAC Learning Framework -II Prof. Dan A. Simovici UMB 1 / 1 Outline 1 Finite Hypothesis Space - The Inconsistent Case 2 Deterministic versus stochastic scenario 3 Bayes Error and Noise 2 / 1 Outline
More informationConvergence Rates of Kernel Quadrature Rules
Convergence Rates of Kernel Quadrature Rules Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE NIPS workshop on probabilistic integration - Dec. 2015 Outline Introduction
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationSample Selection Bias Correction
Sample Selection Bias Correction Afshin Rostamizadeh Joint work with: Corinna Cortes, Mehryar Mohri & Michael Riley Courant Institute & Google Research Motivation Critical Assumption: Samples for training
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More informationMachine Learning for Computational Advertising
Machine Learning for Computational Advertising L1: Basics and Probability Theory Alexander J. Smola Yahoo! Labs Santa Clara, CA 95051 alex@smola.org UC Santa Cruz, April 2009 Alexander J. Smola: Machine
More informationKernel methods and the exponential family
Kernel methods and the exponential family Stéphane Canu 1 and Alex J. Smola 2 1- PSI - FRE CNRS 2645 INSA de Rouen, France St Etienne du Rouvray, France Stephane.Canu@insa-rouen.fr 2- Statistical Machine
More informationHigh-dimensional test for normality
High-dimensional test for normality Jérémie Kellner Ph.D Student University Lille I - MODAL project-team Inria joint work with Alain Celisse Rennes - June 5th, 2014 Jérémie Kellner Ph.D Student University
More informationMachine Learning And Applications: Supervised Learning-SVM
Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine
More informationStatistical Learning Theory and the C-Loss cost function
Statistical Learning Theory and the C-Loss cost function Jose Principe, Ph.D. Distinguished Professor ECE, BME Computational NeuroEngineering Laboratory and principe@cnel.ufl.edu Statistical Learning Theory
More informationData Uncertainty, MCML and Sampling Density
Data Uncertainty, MCML and Sampling Density Graham Byrnes International Agency for Research on Cancer 27 October 2015 Outline... Correlated Measurement Error Maximal Marginal Likelihood Monte Carlo Maximum
More informationReferences for online kernel methods
References for online kernel methods W. Liu, J. Principe, S. Haykin Kernel Adaptive Filtering: A Comprehensive Introduction. Wiley, 2010. W. Liu, P. Pokharel, J. Principe. The kernel least mean square
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationL5 Support Vector Classification
L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationThe Learning Problem and Regularization
9.520 Class 02 February 2011 Computational Learning Statistical Learning Theory Learning is viewed as a generalization/inference problem from usually small sets of high dimensional, noisy data. Learning
More information1 Review of The Learning Setting
COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we
More informationSTATS 200: Introduction to Statistical Inference. Lecture 29: Course review
STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout
More informationPATTERN RECOGNITION AND MACHINE LEARNING
PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationGaussian Processes (10/16/13)
STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs
More informationA Bahadur Representation of the Linear Support Vector Machine
A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support
More informationLecture 3: Statistical Decision Theory (Part II)
Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical
More informationNon-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines
Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall
More informationLECTURE NOTE #3 PROF. ALAN YUILLE
LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.
More informationClass Prior Estimation from Positive and Unlabeled Data
IEICE Transactions on Information and Systems, vol.e97-d, no.5, pp.1358 1362, 2014. 1 Class Prior Estimation from Positive and Unlabeled Data Marthinus Christoffel du Plessis Tokyo Institute of Technology,
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationKernel Methods. Outline
Kernel Methods Quang Nguyen University of Pittsburgh CS 3750, Fall 2011 Outline Motivation Examples Kernels Definitions Kernel trick Basic properties Mercer condition Constructing feature space Hilbert
More informationMachine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 7: Multiclass Classification Princeton University COS 495 Instructor: Yingyu Liang Example: image classification indoor Indoor outdoor Example: image classification (multiclass)
More informationQualifying Exam in Machine Learning
Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts
More informationBig Hypothesis Testing with Kernel Embeddings
Big Hypothesis Testing with Kernel Embeddings Dino Sejdinovic Department of Statistics University of Oxford 9 January 2015 UCL Workshop on the Theory of Big Data D. Sejdinovic (Statistics, Oxford) Big
More informationA Linear-Time Kernel Goodness-of-Fit Test
A Linear-Time Kernel Goodness-of-Fit Test Wittawat Jitkrittum 1; Wenkai Xu 1 Zoltán Szabó 2 NIPS 2017 Best paper! Kenji Fukumizu 3 Arthur Gretton 1 wittawatj@gmail.com 1 Gatsby Unit, University College
More informationIndirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina
Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More information1 A Lower Bound on Sample Complexity
COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #7 Scribe: Chee Wei Tan February 25, 2008 1 A Lower Bound on Sample Complexity In the last lecture, we stopped at the lower bound on
More informationLinear Regression. Machine Learning CSE546 Kevin Jamieson University of Washington. Oct 5, Kevin Jamieson 1
Linear Regression Machine Learning CSE546 Kevin Jamieson University of Washington Oct 5, 2017 1 The regression problem Given past sales data on zillow.com, predict: y = House sale price from x = {# sq.
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More informationApproximation Theoretical Questions for SVMs
Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear
More informationAn Adaptive Test of Independence with Analytic Kernel Embeddings
An Adaptive Test of Independence with Analytic Kernel Embeddings Wittawat Jitkrittum Gatsby Unit, University College London wittawat@gatsby.ucl.ac.uk Probabilistic Graphical Model Workshop 2017 Institute
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9 Slides adapted from Jordan Boyd-Graber Machine Learning: Chenhao Tan Boulder 1 of 39 Recap Supervised learning Previously: KNN, naïve
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationKernels A Machine Learning Overview
Kernels A Machine Learning Overview S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola, Stéphane Canu, Mike Jordan and Peter
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationLecture 5: GPs and Streaming regression
Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X
More informationSupport Vector Machines
Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.
More informationLearning with Rejection
Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More information6.1 Variational representation of f-divergences
ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationNearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2
Nearest Neighbor Machine Learning CSE546 Kevin Jamieson University of Washington October 26, 2017 2017 Kevin Jamieson 2 Some data, Bayes Classifier Training data: True label: +1 True label: -1 Optimal
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationKernel methods, kernel SVM and ridge regression
Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;
More informationLearning features to compare distributions
Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain 1/28 Goal of this
More informationlearning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015
learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015 Introduction Often, training distribution does not match testing distribution Want to utilize information about test
More informationPolyhedral Computation. Linear Classifiers & the SVM
Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationarxiv: v2 [stat.ml] 3 Sep 2015
Nonparametric Independence Testing for Small Sample Sizes arxiv:46.9v [stat.ml] 3 Sep 5 Aaditya Ramdas Dept. of Statistics and Machine Learning Dept. Carnegie Mellon University aramdas@cs.cmu.edu September
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationHilbert Space Embedding of Probability Measures
Lecture 2 Hilbert Space Embedding of Probability Measures Bharath K. Sriperumbudur Department of Statistics, Pennsylvania State University Machine Learning Summer School Tübingen, 2017 Recap of Lecture
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More information