Forefront of the Two Sample Problem

Similar documents
Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Hilbert Space Representations of Probability Distributions

Hypothesis testing:power, test statistic CMS:

Overview. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Pattern Recognition and Machine Learning

Computer simulation on homogeneity testing for weighted data sets used in HEP

Kernel Methods. Barnabás Póczos

Fish SR P Diff Sgn rank Fish SR P Diff Sng rank

Package dhsic. R topics documented: July 27, 2017

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

An Adaptive Test of Independence with Analytic Kernel Embeddings

ST4241 Design and Analysis of Clinical Trials Lecture 7: N. Lecture 7: Non-parametric tests for PDG data

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Learning Interpretable Features to Compare Distributions

CS 6375 Machine Learning

Kernel Methods. Machine Learning A W VO

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

CIS 520: Machine Learning Oct 09, Kernel Methods

RegML 2018 Class 2 Tikhonov regularization and kernels

Nonparametric Bayesian Methods (Gaussian Processes)

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

SVMs, Duality and the Kernel Trick

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Support Vector Machines

High-dimensional test for normality

Diffeomorphic Warping. Ben Recht August 17, 2006 Joint work with Ali Rahimi (Intel)

Support Vector Machines for Classification: A Statistical Portrait

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Oslo Class 2 Tikhonov regularization and kernels

The Curse of Dimensionality for Local Kernel Machines

Machine Learning and Related Disciplines

An Adaptive Test of Independence with Analytic Kernel Embeddings

Learning features to compare distributions

Least-Squares Independence Test

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Support Vector Method for Multivariate Density Estimation

Machine Learning A Bayesian and Optimization Perspective

Kernel Methods. Konstantin Tretyakov MTAT Machine Learning

Kernel methods, kernel SVM and ridge regression

Introduction to Graphical Models

where x i and u i are iid N (0; 1) random variates and are mutually independent, ff =0; and fi =1. ff(x i )=fl0 + fl1x i with fl0 =1. We examine the e

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Kernel Sequential Monte Carlo

Basis Expansion and Nonlinear SVM. Kai Yu

Classifier Complexity and Support Vector Classifiers

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Learning features to compare distributions

Primer on statistics:

Learning sets and subspaces: a spectral approach

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Measure Data Description for Group Anomaly Detection

Can we do statistical inference in a non-asymptotic way? 1

Modeling Hydrologic Chanae

Stochastic optimization in Hilbert spaces

Adv. App. Stat. Presentation On a paradoxical property of the Kolmogorov Smirnov two-sample test

Kernel Learning via Random Fourier Representations

Manifold Regularization

LEC 4: Discriminant Analysis for Classification

COMS 4771 Introduction to Machine Learning. Nakul Verma

Convergence Rates of Kernel Quadrature Rules

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Support Vector Machine For Functional Data Classification

Hypothesis Testing with Kernel Embeddings on Interdependent Data

Monte-Carlo Methods and Stochastic Processes

Advanced Statistics II: Non Parametric Tests

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Maximum Mean Discrepancy

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

Kernel Methods & Support Vector Machines

Machine learning for pervasive systems Classification in high-dimensional spaces

arxiv: v2 [stat.ml] 3 Sep 2015

Nonparametric Statistics

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Support Vector Machines: Maximum Margin Classifiers

CS 188: Artificial Intelligence Spring Announcements

Kernel adaptive Sequential Monte Carlo

ANGLO-CHINESE JUNIOR COLLEGE MATHEMATICS DEPARTMENT. Paper 1 18 August 2016 JC 2 PRELIMINARY EXAMINATION Time allowed: 3 hours

Data Mining. CS57300 Purdue University. March 22, 2018

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

EECS490: Digital Image Processing. Lecture #26

Unsupervised Kernel Dimension Reduction Supplemental Material

6.036 midterm review. Wednesday, March 18, 15

Unsupervised Nonparametric Anomaly Detection: A Kernel Method

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

Bayesian Quadrature: Model-based Approximate Integration. David Duvenaud University of Cambridge

3. Tests in the Bernoulli Model

A central limit theorem for an omnibus embedding of random dot product graphs

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

MLE/MAP + Naïve Bayes

CS 188: Artificial Intelligence Spring Announcements

Transcription:

Forefront of the Two Sample Problem From classical to state-of-the-art methods Yuchi Matsuoka

What is the Two Sample Problem? X ",, X % ~ P i. i. d Y ",, Y - ~ Q i. i. d Two Sample Problem P = Q? Example: Is there a difference in blood glucose level between two groups? Is there a difference in test score between two schools? 2

Parametric Approach Assume the type of distribution and compare their empirical moments. Two sample t tests. (1 st order moment) F tests. (2 nd order moment) If the parametric assumptions are not satisfied, they cannot work well. Ex: 3

Ex: Welch s t-test and F test Apparently, they cannot take into account 3 rd or higher order moments. 4

Nonparametric Approach (Kolmogorov Smirnov test) One dimensional variables Empirical Distribution KS test statistics Image: 5 t

Nonparametric Approach (Mann Whitney U test) Define kernel And corresponding U-statistic is Test statistic: Mann-Whitney U statistic is average number of for possible combinations. 6

Asymptotic properties üfor more general kernel, see Van der Vaart(2000). Consider kernel, and U-statistics Assume,. Then, where,, 7

Extension to kernel methods Idea: RKHS embedding of distributions. RKHS 8

Relationship with usual Kernel Method Feature map Reproducing property y o x RKHS Measure map Reproducing property We write RKHS 9.

Why Kernel? Characteristic kernels hold all the information of the moment. This can be checked easily. Consider the taylor expansion of characteristic kernels, e.g. gaussian kernel, and take expectations by any distribution. They can be defined for any data. Of course, they can be used for multi-dimensional data. Furthermore, for structured data such as strings. 10

Applications of Kernel Method Feature map perspective SVM, Kernel PCA, Kernel CCA, Kernel FDA, Kernel Ridge Regression, SVR, etc Measure map perspective Kernel Two Sample Test, HSIC, Kernel Dimensionality Reduction Kernel Bayes Rule, Kernel Monte Carlo Filter, Kernel Spectral Algorithm for HMM, Support Measure Machines, etc 11

Measure distance of distributions in RKHS MMD: maximum mean discrepancy To conduct test, we define the distance of distributions as MMD; If, MMD becomes 0. RKHS 12

Empirical Estimator of MMD By replacing kernel mean with its empirical estimator, We use its unbiased version; 13

Relationship with U statistics is an 2 sample U-statistics using kernel We can get its asymptotic null distribution applying theory of U-statistics!! 14

Difficulty Reminded the claim of theorem of asymptotic normality of U-statistics, it calculate a quantity, This case, it becomes 0. This implies that we have to multiply quantities bigger than in order not to asymptotic distribution degenerate. 15

Key Theorem Let total sample size is N, and N = l + n. Assume Then, Under the null hypothesis, Ø(Proof) See 福水 (2010). 16

Who are s? A non-zero eigenvalues of integral operator over that has kernel i.e. non-negative real value that satisfies. 17

Proposed method to get a critical value We have to find. This can be estimated consistently by eigenvalues of the gram matrix defined by, (called centered gram matrix.) 18

Power analysis Synthetic Data What times is null hypothesis rejected on 100 trials. Kernel Two Sample Test sperilor to Other tests in terms of power. 19

Power analysis Synthetic Data What times is null hypothesis rejected on 100 trials. Kernel Two Sample Test sperilor to Other tests in terms of power. 20

Significance analysis Synthetic data Type 1 error rate on 5000 trials. All tests output the expected values. 21

Multidimensional case: iris data(setosa and virginica) Labeled 5-dim data (m=50, l=50, N=100) Is there a difference between these features? Setosa and Setosa Null hypothesis should not be rejected. Histogram: estimated null distribution. Red line: critical value. Dash Line: test statistic. 22

High dimensional case: MNIST data(hand-written digit data) Labeled(1~9) data with 784 features. Each feature represents 0 or 1 of the pixel. Is it possible to classify these numbers based on the distribution? Can Kernel two sample test overcome the curse of dimensionality? 23

Digit 1 vs Digit 2 Compare group of 1 and 2. Each group is about 4000 sample. Divide 1 into two groups. Compare 1 and 1 Histogram: estimated null distribution. Red line: critical value. Dash Line: test statistic. 24

Reference A. Gretton, et al. A fast, consistent kernel two-sample test. Advances in neural information processing systems (2009). A. Gretton, et al. A kernel two-sample test. Journal of Machine Learning Research 13: 723-773. (2012). 福水健次. カーネル法入門 正定値カーネルによるデータ解析. 朝倉書店, (2010). K. Muandet, et al. Kernel Mean Embedding of Distributions: A Review and Beyonds. arxiv:1605.09522 (2016). A. W. Van der Vaart. Asymptotic statistics. Cambridge series in statistical and probabilistic mathematics. (2000). 25