Least-Squares Independence Test

Similar documents
Dependence Minimizing Regression with Model Selection for Non-Linear Causal Inference under Non-Gaussian Noise

Class Prior Estimation from Positive and Unlabeled Data

Mutual Information Approximation via. Maximum Likelihood Estimation of Density Ratio:

Direct Density-Ratio Estimation with Dimensionality Reduction via Hetero-Distributional Subspace Analysis

3.0 PROBABILITY, RANDOM VARIABLES AND RANDOM PROCESSES

From the help desk: It s all about the sampling

Density-Difference Estimation

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

Kernel Measures of Conditional Dependence

Hilbert Space Representations of Probability Distributions

Density-ratio matching under the Bregman divergence: a unified framework of density-ratio estimation

Hilbert Schmidt Independence Criterion

arxiv: v2 [stat.ml] 3 Sep 2015

Relative Density-Ratio Estimation for Robust Distribution Comparison

Support Vector Machines

arxiv: v1 [stat.ml] 23 Jun 2011

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Input-Dependent Estimation of Generalization Error under Covariate Shift

Package dhsic. R topics documented: July 27, 2017

A Density-ratio Framework for Statistical Data Processing

Direct Density-ratio Estimation with Dimensionality Reduction via Least-squares Hetero-distributional Subspace Search

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Kernel Learning via Random Fourier Representations

Minimax Estimation of Kernel Mean Embeddings

Conditional Density Estimation via Least-Squares Density Ratio Estimation

Approximating the Best Linear Unbiased Estimator of Non-Gaussian Signals with Gaussian Noise

Identification of Nonlinear Dynamic Systems with Multiple Inputs and Single Output using discrete-time Volterra Type Equations

Forefront of the Two Sample Problem

QUADRATIC AND CONVEX MINIMAX CLASSIFICATION PROBLEMS

CHAPTER 5. Jointly Probability Mass Function for Two Discrete Distributed Random Variables:

Direct Importance Estimation with Gaussian Mixture Models

Approximate Kernel PCA with Random Features

A Magiv CV Theory for Large-Margin Classifiers

Mathematical Notation Math Introduction to Applied Statistics

Statistical Inference

Kernel Methods. Lecture 4: Maximum Mean Discrepancy Thanks to Karsten Borgwardt, Malte Rasch, Bernhard Schölkopf, Jiayuan Huang, Arthur Gretton

Distinguishing Causes from Effects using Nonlinear Acyclic Causal Models

Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation

An Adaptive Test of Independence with Analytic Kernel Embeddings

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

On a Nonparametric Notion of Residual and its Applications

Convergence Rates of Kernel Quadrature Rules

Approximation Theoretical Questions for SVMs

CONTINUOUS SPATIAL DATA ANALYSIS

Kernel methods for comparing distributions, measuring dependence

10-701/ Recitation : Kernels

Inference about the Slope and Intercept

Scale-Invariance of Support Vector Machines based on the Triangular Kernel. Abstract

Introduction. Chapter 1

Perturbation Theory for Variational Inference

Comparison of Modern Stochastic Optimization Algorithms

ODIF FOR MORPHOLOGICAL FILTERS. Sari Peltonen and Pauli Kuosmanen

Kernels for Multi task Learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

9.2 Support Vector Machines 159

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

Kernel Bayes Rule: Nonparametric Bayesian inference with kernels

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

( ) ( ) ( )( ) Algebra II Semester 2 Practice Exam A DRAFT. = x. Which expression is equivalent to. 1. Find the solution set of. f g x? B.

A Conditional Expectation Approach to Model Selection and Active Learning under Covariate Shift

SVRG++ with Non-uniform Sampling

CS281 Section 4: Factor Analysis and PCA

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Estimation of Mixed Exponentiated Weibull Parameters in Life Testing

Classifier Complexity and Support Vector Classifiers

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Cross-validation for detecting and preventing overfitting

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Kernel-Based Information Criterion

Stochastic Variational Inference for Gaussian Process Latent Variable Models using Back Constraints

Kernel Partial Least Squares for Nonlinear Regression and Discrimination

arxiv: v1 [cs.lg] 22 Jun 2009

A Resampling Method on Pivotal Estimating Functions

Tensor Product Kernels: Characteristic Property, Universality

Statistical Convergence of Kernel CCA

STAT 518 Intro Student Presentation

Machine Learning. 1. Linear Regression

Metrol. Meas. Syst., Vol. XVIII (2011), No. 2, pp METROLOGY AND MEASUREMENT SYSTEMS. Index , ISSN

Semi-Supervised Laplacian Regularization of Kernel Canonical Correlation Analysis

Information-Maximization Clustering based on Squared-Loss Mutual Information

Distribution Regression: A Simple Technique with Minimax-optimal Guarantee

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

arxiv: v2 [cs.lg] 1 May 2013

Approximate Kernel Methods

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

FACTORIZATION MACHINES AS A TOOL FOR HEALTHCARE CASE STUDY ON TYPE 2 DIABETES DETECTION

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Understanding the relationship between Functional and Structural Connectivity of Brain Networks

Inferential Statistics and Methods for Tuning and Configuring

An Adaptive Test of Independence with Analytic Kernel Embeddings

The Fusion of Parametric and Non-Parametric Hypothesis Tests

Sufficient Dimension Reduction via Direct Estimation of the Gradients of Logarithmic Conditional Densities

Maximum Mean Discrepancy

Feature Selection via Dependence Maximization

Defect Detection using Nonparametric Regression

A Novel Nonparametric Density Estimator

Learning Gaussian Process Models from Uncertain Data

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Importance Reweighting Using Adversarial-Collaborative Training

Multiscale Principal Components Analysis for Image Local Orientation Estimation

Transcription:

IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc, Toko, Japan sugi@cs.titech.ac.jp http://sugiama-www.cs.titech.ac.jp/~sugi/ Taiji Suzuki The Universit of Toko, Toko, Japan s-taiji@stat.t.u-toko.ac.jp Abstract Identifing the statistical independence of random variables is one of the important tasks in statistical data analsis. In this paper, we propose a novel non-parametric independence test based on a least-squares densit ratio estimator. Our method, called least-squares independence test (LSIT), is distribution-free, and thus it is more flexible than parametric approaches. Furthermore, it is equipped with a model selection procedure based on cross-validation. This is a significant advantage over existing non-parametric approaches which often require manual parameter tuning. The usefulness of the proposed method is shown through numerical experiments. Kewords independence test, densit ratio estimation, unconstrained least-squares importance fitting, squared-loss mutual information. Introduction Identifing the statistical independence of random variables is one of the fundamental tasks in statistical data analsis. Independence tests can be used for various purposes such as feature selection [9] and causal inference []. A traditional independence measure is the Pearson correlation coefficient, which can be used for detecting linear dependenc. Thus, it is useful for Gaussian data, although the Gaussian assumption is rarel fulfilled in practice. Recentl, kernel-based independence measures have been studied in order to overcome the weakness of the Pearson correlation coefficient. The Hilbert-Schmidt independence criterion (HSIC) [4] utilizes

Least-Squares Independence Test cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) [7], which is an infinite-dimensional generalization of covariance matrices. HSIC allows efficient detection of non-linear dependenc b making use of the reproducing propert of RKHSs []. However, HSIC has a critical weakness that its performance depends on the choice of RKHSs and there is no theoreticall justified wa to determine the RKHS properl. In practice, using the Gaussian RKHS with width set to the median distance between samples is a popular heuristic [4]. Another popular independence criterion would be mutual information []. In this paper, we consider a squared-loss variant of mutual information and use its estimator leastsquares mutual information (LSMI) [9] for independence test. LSMI is also distributionfree as HSIC, but it is equipped with a natural model selection procedure based on crossvalidation, which is an advantage over HSIC. Through experiments, we show the usefulness of the LSMI-based independence test called least-squares independence test (LSIT). Least-Squares Independence Test In this section, we propose a novel non-parametric independence test.. Formulation Let x ( X R dx ) be an input feature and ( Y R d ) be an output feature, which follow a joint probabilit distribution with densit p(x, ). Suppose we are given a set of i.i.d. paired samples {(x i, i )} n i=. Our goal is to test whether x and are statisticall independent or not, based on the samples {(x i, i )} n i=. For independence test, we emplo the squared-loss mutual information (SMI) defined as follows: SMI := ( ) p(x, ) p(x)p() p(x)p() dxd, () where p(x) and p() are marginal densities of x and, respectivel. SMI is the Pearson divergence [6] from the joint densit p(x, ) to the product of marginals p(x)p(), and SMI is zero if and onl if x and are statisticall independent. Hence, SMI can be used for detecting the statistical independence of random variables. SMI includes unknown probabilit densities p(x, ), p(x), and p(), and thus it cannot be directl computed. A naive approach is to estimate the densities p(x, ), p(x), and p(), and plug the estimated densities in Eq.(). However, since densit estimation is known to be a hard task and division b estimated densities can magnif the estimation error, we consider estimating the following densit ratio function directl: r(x, ) := p(x, ) p(x)p(). ()

Least-Squares Independence Test 3 Once an estimator of the densit ratio r(x, ) is obtained, SMI can be approximated using samples as follows: ŜMI := n n i= r(x i, i ) n n r(x i, j ). (3) i,j=. Least-Squares Mutual Information Here we explain the least-squares mutual information (LSMI) method [9], which directl learns r(x, ) from data samples without going through densit estimation of p(x, ), p(x), and p(). Let us approximate the densit ratio () using the following model: r(x, ) = b α l ψ l (x, ) = α ψ(x, ), l= where ψ(x, ) : R d x R d R b is a non-negative basis function vector, α ( R b ) is a parameter vector, and denotes the transpose. We determine the parameter α so that the following squared-error J is minimized: J (α) := ( r(x, ) r(x, )) p(x)p()dxd = r(x, ) p(x)p()dxd r(x, )p(x, )dxd + Const. Let us denote the first two terms b J. Since J contains the expectations over unknown densities p(x, ), p(x), and p(), we approximate the expectations b empirical averages. B including an l -regularizer, the LSMI optimization problem is formulated as follows. α := argmin α R b [ α Ĥα α ĥ + λ α α ], (4) where λ ( ) is the regularization parameter that controls the strength of regularization, and Ĥ := n ψ(x n i, j )ψ(x i, j ), ĥ := n i,j= n ψ(x i, i ). i= The solution α can be analticall obtained as α = (Ĥ + λi b) ĥ, (5)

Least-Squares Independence Test 4 where I b is the b-dimensional identit matrix. Finall, the densit ratio estimator r(x) is given b r(x) := α ψ(x). Once a densit ratio estimator r(x, ) is obtained, SMI can be approximated b Eq.(3). Thanks to the analtic-form expression, LSMI is computationall ver efficient. Furthermore, the above least-squares densit ratio estimator was shown to possess the optimal non-parametric convergence rate and optimal numerical stabilit [8, 5]..3 Basis Function Choice and Model Selection Given that both {x i } n i= and { i } n i= are normalized so that ever element of x and has unit variance, we use the following basis functions: exp ( x x ) l exp ( ) l for regression, σ σ ψ l (x, ) = exp ( x x ) l δ( = σ l ) for classification, where δ(c) = if the condition c is true; otherwise δ(c) =. We ma also include a constant basis function ϕ (x, ) = to the above kernel basis functions. The practical performance of LSMI depends on the choice of the kernel width σ and the regularization parameter λ. Model selection of LSMI is possible based on cross-validation with respect to the criterion J. More specificall, the sample set Z = {(x i, i )} n i= is divided into M disjoint sets {Z m } M m=. Then an LSMI solution r m (x) is obtained using Z\Z m (i.e., all samples without Z m ), and its J-score for the hold-out samples Z m is computed as Ĵ CV m := Z m r m (x, ) Z m (x,) Z m (x,) Z m r m (x, ), where Z denotes the number of elements in the set Z. This procedure is repeated for m =,..., M, and the average score Ĵ CV := M M m= Ĵ m CV is computed. Finall, the model (the kernel width σ and the regularization parameter λ in the current setup) that minimizes Ĵ CV is chosen as the most suitable one..4 Permutation Test Our independence test procedure is based on the permutation test [3]. We first run LSMI using the original datasets Z = {(x i, i )} n i=, and obtain an SMI estimate ŜMI(Z). Next, we randoml permute { i} n i= and form a shuffled dataset Z = {(x i, τ(i) )} n i=, where τ( ) is a randoml chosen permutation function. Then we run LSMI

Least-Squares Independence Test 5 again using the randoml shuffled dataset Z, and obtain an SMI estimate ŜMI( Z). Note that the random permutation eliminates the dependenc between x and (if exists), so ŜMI( Z) would take a value close to zero. This random permutation procedure is repeated man times, and the distribution of ŜMI( Z) under the null-hpothesis (i.e., x and are independent) is constructed. Finall, the p-value is approximated b evaluating the relative ranking of ŜMI(Z) in the distribution of ŜMI( Z). We refer to this procedure as the least-squares independence test (LSIT). A MATLAB R implementation of LSIT is available from http://sugiama-www.cs.titech.ac.jp/ sugi/software/lsit/. 3 Experiments In this section, we report experimental results. 3. Numerical Illustration First, we illustrate how the proposed LSIT method behaves using the following to datasets with one-dimensional input x and one-dimensional output : (A) Regression: For x U(, ) where U(a, b) denotes the uniform distribution on (a, b), U(, ) (Independent), N(sin(x/π), ) (Dependent), where N(µ, σ ) denotes the normal distribution with mean µ and variance σ. (B) Classification: For B(.5) where B(p) denotes the binomial distribution on {, +} with probabilit of having + being p,.5n(, ) +.5N(, ) (Independent), x N(, ) (Dependent). Examples of realized samples are plotted in Figure for n =, where {x i } n i= and { i } n i= are normalized to have unit variance. Figure depicts the distributions of ŜMI( Z) and the value of ŜMI(Z). The graphs show that reasonable p-values were obtained for all the four cases. Figure 3 depicts the p-values and the frequenc of accepting the null hpothesis (i.e., x and are independent) as functions of the sample size n. The graphs show that LSIT works reasonabl well.

Least-Squares Independence Test 6 3 x 3 x (a) Regression/Independent (b) Regression/Dependent.5.5.5.5 3 3 x 3 x (c) Classification/Independent (d) Classification/Dependent 3. Performance Comparison Figure : To datasets. Here we compare the performance of LSMI and HSIC under the common permutationtest framework. HSIC is the state-of-the-art measure of statistical independence utilizing Gaussian kernels [4]. The performance of HSIC depends on the choice of the Gaussian width, and to the best of our knowledge, there is no theoreticall justified method to determine the kernel width. Here we use a standard heuristic of setting the Gaussian width to the median distance between samples, which was also adopted in the original paper [4]. We generate data samples b where [ ] x [ ] [ cos θ sin θ x = sin θ cos θ ], x.5n(, ) +.5N(, ),.5N(, ) +.5N(, ).

Least-Squares Independence Test 7 6 p value=.74 6 p value=. 4 4.5 ^. SMI.5 ^..5 SMI (a) Regression/Independent (b) Regression/Dependent 6 p value=. 6 p value=. 4 4..4.6.8 ^ SMI.5..5..5 ^ SMI (c) Classification/Independent (d) Classification/Dependent Figure : Distributions of ŜMI( Z) (randoml shuffled samples) for the to datasets. denotes the value of ŜMI(Z) (original samples). Reg/ind Cla/ind Reg/dep Cla/dep.8.6.4.8.6.4 Reg/ind Cla/ind Reg/dep Cla/dep.. 3 4 5 6 7 8 9 n 3 4 5 6 7 8 9 n Figure 3: Experimental results for the to datasets. Left: Mean and standard deviation of p-values over runs. Right: The frequenc of accepting the null hpothesis over runs under the significance level.5.

Least-Squares Independence Test 8 Thus, (x, ) are rotation of (x, ) b angle θ. We conduct experiments for θ = (i.e., x and are independent) and θ = π/8 (i.e., x and are dependent). Data samples {x i } n i= and { i } n i= are normalized to have unit variance (see Figure 4). The results plotted in Figure 5 show that the proposed LSIT has comparable tpe-i error (rejecting correct null-hpotheses) to HSIC, with far smaller tpe-ii error (accepting incorrect null-hpotheses). Next, we var the Gaussian width of HSIC and evaluate how the performance is changed. In Figure 5, HSIC(c) denotes HSIC with Gaussian width multiplied b c. The results show that, although the tpe-i error does not reall change with respect to c, the tpe-ii error is heavil affected b the choice of the Gaussian kernel width. For this dataset, the median heuristic does not work well, and c = / works the best. However, the optimall-tuned HSIC is still outperformed b the proposed LSIT, which is automaticall tuned based on cross-validation and does not involve manual parameter tuning. Finall, we use the MNIST handwritten digit dataset for further performance evaluation. Each digit image (representing an integer in {,,,..., 9}) consists of 784 (= 8 8) pixels, each of which takes an integer value between to 55 representing its intensit level in gra-scale. Here, we label the data as small for digits,,, 3, and 4, and large for digits 5, 6, 7, 8, and 9. We randoml choose 5 samples and randoml shuffle the label of 5( η) samples. Thus, increasing η from to corresponds to increasing the dependence between digit patterns and labels. The results are plotted in Figure 6, showing that the proposed LSIT has slightl larger tpe-i error (i.e., lower acceptance rate when η = ) than HSIC, but the tpe-ii error of LSIT is slightl smaller than HSIC (i.e., lower acceptance rate when η > ). For this dataset, the optimall-tuned HSIC (c = /) slightl outperforms automaticall-tuned LSIT. 4 Discussions and Conclusions We proposed a novel non-parametric method of independence test based on an estimator of a squared-loss variant of mutual information called least-squares mutual information [9]. The proposed method, which we called the least-squares independence test (LSIT), can overcome the limitation of the state-of-the-art method, the Hilbert-Schmidt independence criterion (HSIC) [4], which is not equipped with a model selection procedure of the Gaussian kernel width. Through experiments, we confirmed that the proposed LSIT compares favorabl with the HSIC-based independence test. Acknowledgment MS was supported b SCAT, AOARD, and the JST PRESTO program. TS was supported b MEXT Grant-in-Aid for Young Scientists (B) 789.

Least-Squares Independence Test 9.5 θ= θ=π/8.5.5.5.5 3 3 x Figure 4: Rotation datasets. x and are independent if θ =, and the are dependent when θ = π/8..9.8.7.6.5.4.3.. θ= θ=π/8 LSIT HSIC HSIC(/5) HSIC(/) HSIC() 3 4 5 6 7 8 9 n Figure 5: Experimental results for the rotation datasets. Frequenc of accepting the null hpothesis over runs under the significance level.5 is depicted..9.8.7.6.5.4.3.. LSIT HSIC HSIC(/5) HSIC(/) HSIC()..4.6.8 η Figure 6: Experimental results for the MNIST datasets. Frequenc of accepting the null hpothesis over 5 runs under the significance level.5 is depicted.

Least-Squares Independence Test References [] N. Aronszajn, Theor of reproducing kernels, Trans. the American Mathematical Societ, vol.68, pp.337 44, 95. [] T.M. Cover and J.A. Thomas, Elements of Information Theor, nd ed., John Wile & Sons, Inc., 6. [3] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, 993. [4] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and A. Smola, A kernel statistical test of independence, in Advances in Neural Information Processing Sstems, pp.585 59, 8. [5] T. Kanamori, T. Suzuki, and M. Sugiama, Condition number analsis of kernelbased densit ratio estimation, tech. rep., arxiv, 9. [6] K. Pearson, On the criterion that a given sstem of deviations from the probable in the case of a correlated sstem of variables is such that it can be reasonabl supposed to have arisen from random sampling, Philosophical Magazine, vol.5, pp.57 75, 9. [7] I. Steinwart, On the influence of the kernel on the consistenc of support vector machines, J. Machine Learning Research, vol., pp.67 93,. [8] T. Suzuki and M. Sugiama, Sufficient dimension reduction via squared-loss mutual information estimation, International Conference on Artificial Intelligence and Statistics, pp.84 8,. [9] T. Suzuki, M. Sugiama, T. Kanamori, and J. Sese, Mutual information estimation reveals global associations between stimuli and biological processes, BMC Bioinformatics, vol., no., p.s5, 9. [] M. Yamada and M. Sugiama, Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise, AAAI Conference on Artificial Intelligence pp.643 648,.