Least-Squares Independence Test

IEICE Transactions on Information and Sstems, vol.e94-d, no.6, pp.333-336,. Least-Squares Independence Test Masashi Sugiama Toko Institute of Technolog, Japan PRESTO, Japan Science and Technolog Agenc, Toko, Japan sugi@cs.titech.ac.jp http://sugiama-www.cs.titech.ac.jp/~sugi/ Taiji Suzuki The Universit of Toko, Toko, Japan s-taiji@stat.t.u-toko.ac.jp Abstract Identifing the statistical independence of random variables is one of the important tasks in statistical data analsis. In this paper, we propose a novel non-parametric independence test based on a least-squares densit ratio estimator. Our method, called least-squares independence test (LSIT), is distribution-free, and thus it is more flexible than parametric approaches. Furthermore, it is equipped with a model selection procedure based on cross-validation. This is a significant advantage over existing non-parametric approaches which often require manual parameter tuning. The usefulness of the proposed method is shown through numerical experiments. Kewords independence test, densit ratio estimation, unconstrained least-squares importance fitting, squared-loss mutual information. Introduction Identifing the statistical independence of random variables is one of the fundamental tasks in statistical data analsis. Independence tests can be used for various purposes such as feature selection [9] and causal inference []. A traditional independence measure is the Pearson correlation coefficient, which can be used for detecting linear dependenc. Thus, it is useful for Gaussian data, although the Gaussian assumption is rarel fulfilled in practice. Recentl, kernel-based independence measures have been studied in order to overcome the weakness of the Pearson correlation coefficient. The Hilbert-Schmidt independence criterion (HSIC) [4] utilizes

Least-Squares Independence Test cross-covariance operators on universal reproducing kernel Hilbert spaces (RKHSs) [7], which is an infinite-dimensional generalization of covariance matrices. HSIC allows efficient detection of non-linear dependenc b making use of the reproducing propert of RKHSs []. However, HSIC has a critical weakness that its performance depends on the choice of RKHSs and there is no theoreticall justified wa to determine the RKHS properl. In practice, using the Gaussian RKHS with width set to the median distance between samples is a popular heuristic [4]. Another popular independence criterion would be mutual information []. In this paper, we consider a squared-loss variant of mutual information and use its estimator leastsquares mutual information (LSMI) [9] for independence test. LSMI is also distributionfree as HSIC, but it is equipped with a natural model selection procedure based on crossvalidation, which is an advantage over HSIC. Through experiments, we show the usefulness of the LSMI-based independence test called least-squares independence test (LSIT). Least-Squares Independence Test In this section, we propose a novel non-parametric independence test.. Formulation Let x ( X R dx ) be an input feature and ( Y R d ) be an output feature, which follow a joint probabilit distribution with densit p(x, ). Suppose we are given a set of i.i.d. paired samples {(x i, i )} n i=. Our goal is to test whether x and are statisticall independent or not, based on the samples {(x i, i )} n i=. For independence test, we emplo the squared-loss mutual information (SMI) defined as follows: SMI := ( ) p(x, ) p(x)p() p(x)p() dxd, () where p(x) and p() are marginal densities of x and, respectivel. SMI is the Pearson divergence [6] from the joint densit p(x, ) to the product of marginals p(x)p(), and SMI is zero if and onl if x and are statisticall independent. Hence, SMI can be used for detecting the statistical independence of random variables. SMI includes unknown probabilit densities p(x, ), p(x), and p(), and thus it cannot be directl computed. A naive approach is to estimate the densities p(x, ), p(x), and p(), and plug the estimated densities in Eq.(). However, since densit estimation is known to be a hard task and division b estimated densities can magnif the estimation error, we consider estimating the following densit ratio function directl: r(x, ) := p(x, ) p(x)p(). ()

Least-Squares Independence Test 3 Once an estimator of the densit ratio r(x, ) is obtained, SMI can be approximated using samples as follows: ŜMI := n n i= r(x i, i ) n n r(x i, j ). (3) i,j=. Least-Squares Mutual Information Here we explain the least-squares mutual information (LSMI) method [9], which directl learns r(x, ) from data samples without going through densit estimation of p(x, ), p(x), and p(). Let us approximate the densit ratio () using the following model: r(x, ) = b α l ψ l (x, ) = α ψ(x, ), l= where ψ(x, ) : R d x R d R b is a non-negative basis function vector, α ( R b ) is a parameter vector, and denotes the transpose. We determine the parameter α so that the following squared-error J is minimized: J (α) := ( r(x, ) r(x, )) p(x)p()dxd = r(x, ) p(x)p()dxd r(x, )p(x, )dxd + Const. Let us denote the first two terms b J. Since J contains the expectations over unknown densities p(x, ), p(x), and p(), we approximate the expectations b empirical averages. B including an l -regularizer, the LSMI optimization problem is formulated as follows. α := argmin α R b [ α Ĥα α ĥ + λ α α ], (4) where λ ( ) is the regularization parameter that controls the strength of regularization, and Ĥ := n ψ(x n i, j )ψ(x i, j ), ĥ := n i,j= n ψ(x i, i ). i= The solution α can be analticall obtained as α = (Ĥ + λi b) ĥ, (5)

Least-Squares Independence Test 4 where I b is the b-dimensional identit matrix. Finall, the densit ratio estimator r(x) is given b r(x) := α ψ(x). Once a densit ratio estimator r(x, ) is obtained, SMI can be approximated b Eq.(3). Thanks to the analtic-form expression, LSMI is computationall ver efficient. Furthermore, the above least-squares densit ratio estimator was shown to possess the optimal non-parametric convergence rate and optimal numerical stabilit [8, 5]..3 Basis Function Choice and Model Selection Given that both {x i } n i= and { i } n i= are normalized so that ever element of x and has unit variance, we use the following basis functions: exp ( x x ) l exp ( ) l for regression, σ σ ψ l (x, ) = exp ( x x ) l δ( = σ l ) for classification, where δ(c) = if the condition c is true; otherwise δ(c) =. We ma also include a constant basis function ϕ (x, ) = to the above kernel basis functions. The practical performance of LSMI depends on the choice of the kernel width σ and the regularization parameter λ. Model selection of LSMI is possible based on cross-validation with respect to the criterion J. More specificall, the sample set Z = {(x i, i )} n i= is divided into M disjoint sets {Z m } M m=. Then an LSMI solution r m (x) is obtained using Z\Z m (i.e., all samples without Z m ), and its J-score for the hold-out samples Z m is computed as Ĵ CV m := Z m r m (x, ) Z m (x,) Z m (x,) Z m r m (x, ), where Z denotes the number of elements in the set Z. This procedure is repeated for m =,..., M, and the average score Ĵ CV := M M m= Ĵ m CV is computed. Finall, the model (the kernel width σ and the regularization parameter λ in the current setup) that minimizes Ĵ CV is chosen as the most suitable one..4 Permutation Test Our independence test procedure is based on the permutation test [3]. We first run LSMI using the original datasets Z = {(x i, i )} n i=, and obtain an SMI estimate ŜMI(Z). Next, we randoml permute { i} n i= and form a shuffled dataset Z = {(x i, τ(i) )} n i=, where τ( ) is a randoml chosen permutation function. Then we run LSMI

Least-Squares Independence Test 5 again using the randoml shuffled dataset Z, and obtain an SMI estimate ŜMI( Z). Note that the random permutation eliminates the dependenc between x and (if exists), so ŜMI( Z) would take a value close to zero. This random permutation procedure is repeated man times, and the distribution of ŜMI( Z) under the null-hpothesis (i.e., x and are independent) is constructed. Finall, the p-value is approximated b evaluating the relative ranking of ŜMI(Z) in the distribution of ŜMI( Z). We refer to this procedure as the least-squares independence test (LSIT). A MATLAB R implementation of LSIT is available from http://sugiama-www.cs.titech.ac.jp/ sugi/software/lsit/. 3 Experiments In this section, we report experimental results. 3. Numerical Illustration First, we illustrate how the proposed LSIT method behaves using the following to datasets with one-dimensional input x and one-dimensional output : (A) Regression: For x U(, ) where U(a, b) denotes the uniform distribution on (a, b), U(, ) (Independent), N(sin(x/π), ) (Dependent), where N(µ, σ ) denotes the normal distribution with mean µ and variance σ. (B) Classification: For B(.5) where B(p) denotes the binomial distribution on {, +} with probabilit of having + being p,.5n(, ) +.5N(, ) (Independent), x N(, ) (Dependent). Examples of realized samples are plotted in Figure for n =, where {x i } n i= and { i } n i= are normalized to have unit variance. Figure depicts the distributions of ŜMI( Z) and the value of ŜMI(Z). The graphs show that reasonable p-values were obtained for all the four cases. Figure 3 depicts the p-values and the frequenc of accepting the null hpothesis (i.e., x and are independent) as functions of the sample size n. The graphs show that LSIT works reasonabl well.

Least-Squares Independence Test 6 3 x 3 x (a) Regression/Independent (b) Regression/Dependent.5.5.5.5 3 3 x 3 x (c) Classification/Independent (d) Classification/Dependent 3. Performance Comparison Figure : To datasets. Here we compare the performance of LSMI and HSIC under the common permutationtest framework. HSIC is the state-of-the-art measure of statistical independence utilizing Gaussian kernels [4]. The performance of HSIC depends on the choice of the Gaussian width, and to the best of our knowledge, there is no theoreticall justified method to determine the kernel width. Here we use a standard heuristic of setting the Gaussian width to the median distance between samples, which was also adopted in the original paper [4]. We generate data samples b where [ ] x [ ] [ cos θ sin θ x = sin θ cos θ ], x.5n(, ) +.5N(, ),.5N(, ) +.5N(, ).

Least-Squares Independence Test 7 6 p value=.74 6 p value=. 4 4.5 ^. SMI.5 ^..5 SMI (a) Regression/Independent (b) Regression/Dependent 6 p value=. 6 p value=. 4 4..4.6.8 ^ SMI.5..5..5 ^ SMI (c) Classification/Independent (d) Classification/Dependent Figure : Distributions of ŜMI( Z) (randoml shuffled samples) for the to datasets. denotes the value of ŜMI(Z) (original samples). Reg/ind Cla/ind Reg/dep Cla/dep.8.6.4.8.6.4 Reg/ind Cla/ind Reg/dep Cla/dep.. 3 4 5 6 7 8 9 n 3 4 5 6 7 8 9 n Figure 3: Experimental results for the to datasets. Left: Mean and standard deviation of p-values over runs. Right: The frequenc of accepting the null hpothesis over runs under the significance level.5.

Least-Squares Independence Test 8 Thus, (x, ) are rotation of (x, ) b angle θ. We conduct experiments for θ = (i.e., x and are independent) and θ = π/8 (i.e., x and are dependent). Data samples {x i } n i= and { i } n i= are normalized to have unit variance (see Figure 4). The results plotted in Figure 5 show that the proposed LSIT has comparable tpe-i error (rejecting correct null-hpotheses) to HSIC, with far smaller tpe-ii error (accepting incorrect null-hpotheses). Next, we var the Gaussian width of HSIC and evaluate how the performance is changed. In Figure 5, HSIC(c) denotes HSIC with Gaussian width multiplied b c. The results show that, although the tpe-i error does not reall change with respect to c, the tpe-ii error is heavil affected b the choice of the Gaussian kernel width. For this dataset, the median heuristic does not work well, and c = / works the best. However, the optimall-tuned HSIC is still outperformed b the proposed LSIT, which is automaticall tuned based on cross-validation and does not involve manual parameter tuning. Finall, we use the MNIST handwritten digit dataset for further performance evaluation. Each digit image (representing an integer in {,,,..., 9}) consists of 784 (= 8 8) pixels, each of which takes an integer value between to 55 representing its intensit level in gra-scale. Here, we label the data as small for digits,,, 3, and 4, and large for digits 5, 6, 7, 8, and 9. We randoml choose 5 samples and randoml shuffle the label of 5( η) samples. Thus, increasing η from to corresponds to increasing the dependence between digit patterns and labels. The results are plotted in Figure 6, showing that the proposed LSIT has slightl larger tpe-i error (i.e., lower acceptance rate when η = ) than HSIC, but the tpe-ii error of LSIT is slightl smaller than HSIC (i.e., lower acceptance rate when η > ). For this dataset, the optimall-tuned HSIC (c = /) slightl outperforms automaticall-tuned LSIT. 4 Discussions and Conclusions We proposed a novel non-parametric method of independence test based on an estimator of a squared-loss variant of mutual information called least-squares mutual information [9]. The proposed method, which we called the least-squares independence test (LSIT), can overcome the limitation of the state-of-the-art method, the Hilbert-Schmidt independence criterion (HSIC) [4], which is not equipped with a model selection procedure of the Gaussian kernel width. Through experiments, we confirmed that the proposed LSIT compares favorabl with the HSIC-based independence test. Acknowledgment MS was supported b SCAT, AOARD, and the JST PRESTO program. TS was supported b MEXT Grant-in-Aid for Young Scientists (B) 789.

Least-Squares Independence Test 9.5 θ= θ=π/8.5.5.5.5 3 3 x Figure 4: Rotation datasets. x and are independent if θ =, and the are dependent when θ = π/8..9.8.7.6.5.4.3.. θ= θ=π/8 LSIT HSIC HSIC(/5) HSIC(/) HSIC() 3 4 5 6 7 8 9 n Figure 5: Experimental results for the rotation datasets. Frequenc of accepting the null hpothesis over runs under the significance level.5 is depicted..9.8.7.6.5.4.3.. LSIT HSIC HSIC(/5) HSIC(/) HSIC()..4.6.8 η Figure 6: Experimental results for the MNIST datasets. Frequenc of accepting the null hpothesis over 5 runs under the significance level.5 is depicted.

Least-Squares Independence Test References [] N. Aronszajn, Theor of reproducing kernels, Trans. the American Mathematical Societ, vol.68, pp.337 44, 95. [] T.M. Cover and J.A. Thomas, Elements of Information Theor, nd ed., John Wile & Sons, Inc., 6. [3] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap, Chapman & Hall, 993. [4] A. Gretton, K. Fukumizu, C.H. Teo, L. Song, B. Schölkopf, and A. Smola, A kernel statistical test of independence, in Advances in Neural Information Processing Sstems, pp.585 59, 8. [5] T. Kanamori, T. Suzuki, and M. Sugiama, Condition number analsis of kernelbased densit ratio estimation, tech. rep., arxiv, 9. [6] K. Pearson, On the criterion that a given sstem of deviations from the probable in the case of a correlated sstem of variables is such that it can be reasonabl supposed to have arisen from random sampling, Philosophical Magazine, vol.5, pp.57 75, 9. [7] I. Steinwart, On the influence of the kernel on the consistenc of support vector machines, J. Machine Learning Research, vol., pp.67 93,. [8] T. Suzuki and M. Sugiama, Sufficient dimension reduction via squared-loss mutual information estimation, International Conference on Artificial Intelligence and Statistics, pp.84 8,. [9] T. Suzuki, M. Sugiama, T. Kanamori, and J. Sese, Mutual information estimation reveals global associations between stimuli and biological processes, BMC Bioinformatics, vol., no., p.s5, 9. [] M. Yamada and M. Sugiama, Dependence minimizing regression with model selection for non-linear causal inference under non-gaussian noise, AAAI Conference on Artificial Intelligence pp.643 648,.