Maximum Mean Discrepancy

Similar documents
Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Hilbert Space Representations of Probability Distributions

A Kernel Method for the Two-Sample-Problem

Hilbert Schmidt Independence Criterion

Learning Interpretable Features to Compare Distributions

BIOINFORMATICS. Integrating structured biological data by Kernel Maximum Mean Discrepancy

arxiv: v1 [cs.lg] 15 May 2008

Introduction to Machine Learning

Importance Reweighting Using Adversarial-Collaborative Training

arxiv: v1 [cs.lg] 22 Jun 2009

Lecture 35: December The fundamental statistical distances

An Introduction to Machine Learning

Correcting Sample Selection Bias by Unlabeled Data

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Kernel methods for comparing distributions, measuring dependence

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Correcting Sample Selection Bias by Unlabeled Data

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

An Introduction to Machine Learning

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

The Learning Problem and Regularization

High-dimensional test for normality

Sparse Kernel Density Estimation Technique Based on Zero-Norm Constraint

CSC2545 Topics in Machine Learning: Kernel Methods and Support Vector Machines

Statistical learning theory, Support vector machines, and Bioinformatics

References for online kernel methods

PATTERN RECOGNITION AND MACHINE LEARNING

Distribution-Free Distribution Regression

Sample Selection Bias Correction

Support Vector Machine

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Learning features to compare distributions

Hilbert Space Embedding of Probability Measures

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Kernel Methods and Support Vector Machines

Kernels A Machine Learning Overview

Machine Learning

L5 Support Vector Classification

Machine Learning. Lecture 9: Learning Theory. Feng Li.

A Linear-Time Kernel Goodness-of-Fit Test

Max Margin-Classifier

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Approximation Theoretical Questions for SVMs

A Bahadur Representation of the Linear Support Vector Machine

Advanced Machine Learning & Perception

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning

Feature Selection via Dependence Maximization

Convergence Rates of Kernel Quadrature Rules

Kernel methods, kernel SVM and ridge regression

CIS 520: Machine Learning Oct 09, Kernel Methods

Learning with Rejection

Bundle Methods for Machine Learning

Introduction to Machine Learning

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Reproducing Kernel Hilbert Spaces

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

1 A Lower Bound on Sample Complexity

Stephen Scott.

The PAC Learning Framework -II

Kernel methods and the exponential family

Introduction to Support Vector Machines

Hilbert Space Methods in Learning

Lecture Support Vector Machine (SVM) Classifiers

Performance Evaluation and Comparison

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

An Adaptive Test of Independence with Analytic Kernel Embeddings

Support Vector Machines. Maximizing the Margin

Support Vector Machines

Kernel Learning via Random Fourier Representations

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Does Unlabeled Data Help?

LECTURE NOTE #3 PROF. ALAN YUILLE

1 Review of The Learning Setting

Reproducing Kernel Hilbert Spaces

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

An Introduction to Statistical Machine Learning - Theoretical Aspects -

Bits of Machine Learning Part 1: Supervised Learning

learning bounds for importance weighting Tamas Madarasz & Michael Rabadi April 15, 2015

Statistical Learning Theory and the C-Loss cost function

Understanding Generalization Error: Bounds and Decompositions

Machine Learning. Support Vector Machines. Manfred Huber

Nearest Neighbors Methods for Support Vector Machines

(i) The optimisation problem solved is 1 min

Machine Learning And Applications: Supervised Learning-SVM

arxiv: v2 [stat.ml] 3 Sep 2015

Announcements. Proposals graded

Class Prior Estimation from Positive and Unlabeled Data

10-701/ Recitation : Kernels

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Roots and Coefficients Polynomials Preliminary Maths Extension 1

Probabilistic Regression Using Basis Function Models

Support Vector Machines

Transcription:

Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au ICONIP 2006, Hong Kong, October 3 Alexander J. Smola: Maximum Mean Discrepancy 1 / 42

Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 2 / 42

Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q. Test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Maximum Mean Discrepancy 3 / 42

Two Sample Problem Setting Given X := {x 1,..., x m } p and Y := {y 1,..., y n } q. Test whether p = q. Applications Cross platform compatibility of microarrays Need to know whether distributions are the same. Database schema matching Need to know which coordinates match. Sample bias correction Need to know how to reweight data. Feature selection Need features which make distributions most different. Parameter estimation Reduce two-sample to one-sample test. Alexander J. Smola: Maximum Mean Discrepancy 3 / 42

Indirect Solution Algorithm Estimate ˆp and ˆq from observations. E.g. Parzen Windows, Mixture of Gaussians,... Compute distance between ˆp and ˆq. E.g. Kullback Leibler divergence, L 2 distance, L 1 distance,... PRO Lots of existing density estimator code available. Lots of distance measures available. Theoretical analysis for convergence of density estimators. CON Curse of dimensionality for density estimation. Theoretical analysis quite tricky (multi-stage procedure). What to do when density estimation is difficult? Bias. Even for p = q typically nonzero. Alexander J. Smola: Maximum Mean Discrepancy 4 / 42

Example Parzen Windows ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Here κ(x, ) is a nonnegative function which integrates to 1. Squared Distance We use D(X, Y ) := (ˆp(x) ˆq(x)) 2 dx. This yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Maximum Mean Discrepancy 5 / 42

Example Parzen Windows ˆp(x) = 1 m m κ(x i, x) and ˆq(y) = 1 n i=1 n κ(y i, y) i=1 Here κ(x, ) is a nonnegative function which integrates to 1. Squared Distance We use D(X, Y ) := (ˆp(x) ˆq(x)) 2 dx. This yields ˆp ˆq 2 2 = 1 m 2 m i,j=1 k(x i, x j ) 2 mn m,n k(x i, y j ) + 1 n 2 i,j=1 where k(x, x ) = κ(x, t)κ(x, t)dt. n k(y i, y j ) i,j=1 Alexander J. Smola: Maximum Mean Discrepancy 5 / 42

A Simple Example Alexander J. Smola: Maximum Mean Discrepancy 6 / 42

Do we need density estimation? Food for thought No density estimation needed if the problem is really simple. In the previous example it suffices to check the means. General Principle We can find simple linear witness which shows that distributions are different. Alexander J. Smola: Maximum Mean Discrepancy 7 / 42

Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

Direct Solution Key Idea Avoid density estimator, use means directly. Maximum Mean Discrepancy (Fortet and Mourier, 1953) D(p, q, F) := sup E p [f (x)] E q [f (y)] f F Theorem (via Dudley, 1984) D(p, q, F) = 0 iff p = q, when F = C 0 (X) is the space of continuous, bounded, functions on X. Theorem (via Steinwart, 2001; Smola et al., 2006) D(p, q, F) = 0 iff p = q, when F = {f f H 1}is a unit ball in a Reproducing Kernel Hilbert Space, provided that H is universal. Alexander J. Smola: Maximum Mean Discrepancy 8 / 42

Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Maximum Mean Discrepancy 9 / 42

Direct Solution Proof. If p = q it is clear that D(p, q, F) = 0 for any F. If p q there exists some f C 0 (X) such that E p [f ] E q [f ] = ɛ > 0 Since H is universal, we can find some f such that f f ɛ 2. Rescale f to fit into unit ball. Goals Empirical estimate for D(p, q, F). Convergence guarantees. Alexander J. Smola: Maximum Mean Discrepancy 9 / 42

Functions of Bounded Variation Function class x f (x) dx C Heaviside Functions Alexander J. Smola: Maximum Mean Discrepancy 10 / 42

Kolmogorov Smirnov Statistic Function Class Real-valued in one dimension, X = R F are all functions with total variation less than 1. Key: F is absolute convex hull of ξ (,t] (x) for t R. Optimization Problem sup E p [f (x)] E q [f (y)] = f F [ sup E p ξ(,t] (x) ] [ E q ξ(,t] (y) ] = F p F q t R Estimation Use empirical estimates of F p and F q. Use Glivenko-Cantelli to obtain statistic. Alexander J. Smola: Maximum Mean Discrepancy 11 / 42

Feature Spaces Feature Map φ(x) maps X to some feature space F. Example: quadratic features φ(x) = (1, x, x 2 ). Function Class Linear functions f (x) = φ(x), w where w 1. Example: quadratic functions f (x) in the above case. Kernels No need to compute φ(x) explicitly. Use algorithm which only computes φ(x), φ(x ) =: k(x, x ). Reproducing Kernel Hilbert Space Notation Reproducing property f, k(x, ) = f (x) Equivalence between φ(x) and k(x, ) via k(x, ), k(x, ) = k(x, x ) Alexander J. Smola: Maximum Mean Discrepancy 12 / 42

Hilbert Space Setting Function Class Reproducing Kernel Hilbert Space H with kernel k. Evaluation functionals f (x) = k(x, ), f. Computing means via linearity 1 m E p [f (x)] = E p [ k(x, ), f ] = m f (x i ) = 1 m i=1 m k(x i, ), f = i=1 Computing means via µ p, f and µ X, f. Alexander J. Smola: Maximum Mean Discrepancy 13 / 42 E p [k(x, )], f }{{} :=µ p 1 m k(x i, ), f m i=1 }{{} :=µ X

Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Maximum Mean Discrepancy 14 / 42

Hilbert Space Setting Optimization Problem Kernels sup E p [f (x)] E q [f (y)] = sup µ p µ q, f = µ p µ q H f 1 f 1 µ p µ q 2 H = µ p µ q, µ p µ q =E p,p k(x, ), k(x, ) 2E p,q k(x, ), k(y, ) + E q,q k(y, ), k(y, ) =E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) Alexander J. Smola: Maximum Mean Discrepancy 14 / 42

Witness Functions Optimization Problem The expression µ p µ q, f is maximized for f = µ p µ q. Empirical Equivalent Expectations µ p and µ q are hard to compute. So we replace them by empirical means. This yields: f (z) = 1 m m k(x i, z) 1 n i=1 n k(y i, z) i=1 Alexander J. Smola: Maximum Mean Discrepancy 15 / 42

Distinguishing Normal and Laplace Alexander J. Smola: Maximum Mean Discrepancy 16 / 42

Maximum Mean Discrepancy Statistic Goal: Estimate D(p, q, F) E p,p k(x, x ) 2E p,q k(x, y) + E q,q k(y, y ) U-Statistic: Empirical estimate D(X, Y, F) 1 m(m 1) i j k(x i, x j ) k(x i, y j ) k(y i, x j ) + k(y i, y j ) }{{} =:h((x i,y i ),(x j,y j )) Theorem D(X, Y, F) is an unbiased estimator of D(p, q, F). Alexander J. Smola: Maximum Mean Discrepancy 17 / 42

Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

Uniform Convergence Bound Theorem (Hoeffding, 1963) For the kernel of a U-statistic κ(x, x ) with κ(x, x ) r we have Pr E ( ) p [κ(x, x )] 1 κ(x m(m 1) i, x j ) > ɛ 2 exp mɛ2 r 2 i j Corollary (MMD Convergence) Pr { D(X, Y, F) D(p, q, F) > ɛ} 2 exp ( ) mɛ2 r 2 Consequences We have O( 1 m ) uniform convergence, hence the estimator is consistent. We can use this as a test: solve the inequality for a given confidence level δ. Bounds can be very loose. Alexander J. Smola: Maximum Mean Discrepancy 18 / 42

Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

Asymptotic Bound Idea Use asymptotic normality of U-Statistic, estimate variance σ 2. Theorem (Hoeffding, 1948) D(X, Y, F) asymptotically normal with variance 4σ2 m [ [ ] ] 2 [ σ 2 = E E k((x, y), (x, y )) x,y x,y and E k((x, y), (x, y )) x,y,x,y ] 2. Test Estimate σ 2 from data. Reject hypothesis that p = q if D(X, Y, F) > 2ασ/ m, where α is confidence threshold. Threshold is computed via (2π) 1 2 exp( x 2 /2)dx = δ. α Alexander J. Smola: Maximum Mean Discrepancy 19 / 42

Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 20 / 42

Application: Data Integration Goal Data from various sources Check whether we can combine it Comparison MMD using the uniform convergence bound MMD using the asymptotic expansion t-test Friedman-Rafsky Wolf test Friedman-Rafsky Smirnov test Hall-Tajvidi Important Detail Our test only needs a double for loop for implementation. Other tests require spanning trees, matrix inversion, etc. Alexander J. Smola: Maximum Mean Discrepancy 21 / 42

Toy Example: Normal Distributions Alexander J. Smola: Maximum Mean Discrepancy 22 / 42

Microarray cross-platform comparability Platforms H 0 MMD t-test FR FR Wolf Smirnov Same accepted 100 100 93 95 Same rejected 0 0 7 5 Different accepted 0 95 0 29 Different rejected 100 5 100 71 Cross-platform comparability tests on microarray level for cdna and oligonucleotide platforms repetitions: 100 sample size (each): 25 dimension of sample vectors: 2116 Alexander J. Smola: Maximum Mean Discrepancy 23 / 42

Cancer diagnosis Health status H 0 MMD t-test FR FR Wolf Smirnov Same accepted 100 100 97 98 Same rejected 0 0 3 2 Different accepted 0 100 0 38 Different rejected 100 0 100 62 Comparing samples from normal and prostate tumor tissues. H 0 is hypothesis that p = q repetitions 100 sample size (each) 25 dimension of sample vectors: 12,600 Alexander J. Smola: Maximum Mean Discrepancy 24 / 42

Tumor subtype tests Subtype H 0 MMD t-test FR FR Wolf Smirnov Same accepted 100 100 95 96 Same rejected 0 0 5 4 Different accepted 0 100 0 22 Different rejected 100 0 100 78 Comparing samples from different and identical tumor subtypes of lymphoma. H 0 is hypothesis that p = q. repetitions 100 sample size (each) 25 dimension of sample vectors: 2,118 Alexander J. Smola: Maximum Mean Discrepancy 25 / 42

Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 26 / 42

Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

Application: Attribute Matching Goal Two datasets, find corresponding attributes. Use only distributions over random variables. Occurs when matching schemas between databases. Examples Match different sets of dates Match names Can we merge the databases at all? Approach Use MMD to measure distance between distributions over different coordinates. Alexander J. Smola: Maximum Mean Discrepancy 27 / 42

Application: Attribute Matching Alexander J. Smola: Maximum Mean Discrepancy 28 / 42

Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

Linear Assignment Problem Integrality constraint can be dropped, as the remainder of the matrix is unimodular. Hungarian Marriage (Kuhn, Munkres, 1953) Solve in cubic time. Alexander J. Smola: Maximum Mean Discrepancy 29 / 42 Goal Find good assignment for all pairs of coordinates (i, j). m maximize C iπ(i) where C ij = D(X i, X j, F) π i=1 Optimize over the space of all permutation matrices π. Linear Programming Relaxation maximize π subject to i tr C π π ij = 1 and j π ij = 1 and π ij 0 and π ij {0, 1}

Schema Matching with Linear Assignment Key Idea Use D(X i, X j, F) = C ij as compatibility criterion. Results Dataset Data d m rept. % correct BIO uni 6 377 100 92.0 CNUM uni 14 386 100 99.8 FOREST uni 10 538 100 100.0 FOREST10D multi 2 1000 100 100.0 ENYZME struct 6 50 50 100.0 PROTEINS struct 2 200 50 100.0 Alexander J. Smola: Maximum Mean Discrepancy 30 / 42

Outline 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 31 / 42

Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Maximum Mean Discrepancy 32 / 42

Sample Bias Correction The Problem Training data X, Y is drawn iid from Pr(x, y). Test data X, Y is drawn iid from Pr (x, y ). Simplifying assumption: only Pr(x) and Pr (x ) differ. Conditional distributions Pr(y x) are the same. Applications In medical diagnosis (e.g. cancer detection from microarrays) we usually have very different training and test sets. Active learning Experimental design Brain computer interfaces (drifting distributions) Adapting to new users Alexander J. Smola: Maximum Mean Discrepancy 32 / 42

Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

Importance Sampling Swapping Distributions Assume that we are allowed to draw from p but want to draw from q: E q [f (x)] = q(x) p(x) }{{} β(x) f (x)dp(x) = E p [ q(x) p(x) f (x) ] Reweighted Risk Minimize reweighted empirical risk plus regularizer, as in SVM, regression, GP classification. 1 m m β(x i )l(x i, y i, θ) + λω[θ] i=1 Problem We need to know β(x). Problem We are ignoring the fact that we know the test set. Alexander J. Smola: Maximum Mean Discrepancy 33 / 42

Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Maximum Mean Discrepancy 34 / 42

Reweighting Means Theorem The optimization problem minimize β(x) 0 E q [k(x, )] E p [β(x)k(x, )] 2 subject to E p [β(x)] = 1 is convex and its unique solution is β(x) = q(x)/p(x). Proof. 1 The problem is obviously convex: convex objective and linear constraints. Moreover, it is bounded from below by 0. 2 Hence it has a unique minimum. 3 The guess β(x) = q(x)/p(x) achieves the minimum value. Alexander J. Smola: Maximum Mean Discrepancy 34 / 42

Empirical Version Re-weight empirical mean in feature space µ[x, β] := 1 m m β i k(x i, ) i=1 such that it is close to the mean on test set µ[x ] := 1 m k(x m i, ) i=1 Ensure that β i is proper reweighting. Alexander J. Smola: Maximum Mean Discrepancy 35 / 42

Optimization Problem Quadratic Program 1 m minimize β i k(x i, ) 1 m k(x β m m i, ) i=1 i=1 m subject to 0 β i B and β i = m i=1 Upper bound on β i for regularization Summation constraint for reweighted distribution Standard solvers available 2 Alexander J. Smola: Maximum Mean Discrepancy 36 / 42

Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Maximum Mean Discrepancy 37 / 42

Consistency Theorem The reweighted set of observations will behave like one drawn from q with effective sample size m 2 / β 2. Proof. The bias of the estimator is proportional to the square root of the value of the objective function. 1 Show that for smooth functions expected loss is close. This only requires that both feature map means are close. 2 Show that expected loss is close to empirical loss (in a transduction style, i.e. conditioned on X and X. Alexander J. Smola: Maximum Mean Discrepancy 37 / 42

Regression Toy Example Alexander J. Smola: Maximum Mean Discrepancy 38 / 42

Regression Toy Example Alexander J. Smola: Maximum Mean Discrepancy 39 / 42

Breast Cancer - Bias on features Alexander J. Smola: Maximum Mean Discrepancy 40 / 42

Breast Cancer - Bias on labels Alexander J. Smola: Maximum Mean Discrepancy 41 / 42

More Experiments Alexander J. Smola: Maximum Mean Discrepancy 42 / 42

Summary 1 Two Sample Problem Direct Solution Kolmogorov Smirnov Test Reproducing Kernel Hilbert Spaces Test Statistics 2 Data Integration Problem Definition Examples 3 Attribute Matching Basic Problem Linear Assignment Problem 4 Sample Bias Correction Sample Reweighting Quadratic Program and Consistency Experiments Alexander J. Smola: Maximum Mean Discrepancy 43 / 42