On a Nonparametric Notion of Residual and its Applications

Size: px

Start display at page:

Download "On a Nonparametric Notion of Residual and its Applications"

Elmer Harrington
5 years ago
Views:

1 On a Nonparametric Notion of Residual and its Applications Bodhisattva Sen and Gábor Székely arxiv: v1 [stat.me] 12 Sep 2014 Columbia University and National Science Foundation September 16, 2014 Abstract Let (X, Z) be a continuous random vector in R R d, d 1. In this paper, we define the notion of a nonparametric residual of X on Z that is always independent of the predictor Z. We study its properties and show that the proposed notion of residual matches with the usual residual (error) in a multivariate normal regression model. Given a random vector (X, Y, Z) in R R R d, we use this notion of residual to show that the conditional independence between X and Y, given Z, is equivalent to the mutual independence of the residuals (of X on Z and Y on Z) and Z. This result is used to develop a test for conditional independence. 1 Introduction Let (X, Z) be a random vector in R R d = R d+1, d 1. We assume that (X, Z) has a joint density on R d+1. If we want to predict X using Z we usually formulate the following regression problem: X = m(z) + ɛ, (1.1) where m(z) = E(X Z = z) is the conditional mean of X given Z = z and ɛ := X m(z) is the residual (although ɛ is usually called the error, and its estimate the residual, for this paper we feel that the term residual is more appropriate). Typically we further assume that the residual ɛ is independent of Z. However, Supported by NSF Grant DMS

2 intuitively, we are just trying to break the information in (X, Z) into two parts: a part that contains all relevant information about X, and the residual (the left over) which does not have anything to do with the relationship between X and Z. In this paper we address the following question: given any random vector (X, Z) how do we define the notion of a residual of X on Z that matches with the above intuition? Thus, formally, we want to find a function ϕ : R d+1 R such that the residual ϕ(x, Z) satisfies the following two conditions: (C.1) the residual ϕ(x, Z) is independent of the predictor Z, i.e., ϕ(x, Z) Z, and (1.2) (C.2) i.e., the information content of (X, Z) is the same as that of (ϕ(x, Z), Z), σ(x, Z) = σ(ϕ(x, Z), Z), (1.3) where σ(x, Z) denotes the σ-field generated by X and Z. We can also express (1.3) as: there exists a measurable function h : R d+1 R such that see e.g., Theorem 20.1 of Billingsley (1995). X = h(z, ϕ(x, Z)); (1.4) In this paper we propose a notion of a residual that satisfies (slightly stronger forms of) the above two conditions, under any joint distribution of X and Z; see Section 2 for the details. We investigate the properties of this notion of residual in Section 2. We show that this notion indeed reduces to the usual residual (error) in a multivariate normal regression model. Suppose now that (X, Y, Z) has a joint density on R R R d = R d+2. The assumption of conditional independence means that X is independent of Y given Z, i.e., X Y Z. Conditional independence is an important concept in modeling causal relations and in graphical models. However, there are very few easily implementable statistical tests for checking the assumption of conditional independence: Fukumizu et al. (2007) propose a measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces; Zhang et al. (2012) propose another kernel-based conditional independence test; Székely and Rizzo (2014) investigate a method that is easy to compute and can capture non-linear dependencies but does not completely characterize conditional independence; also see Györfi and Walk (2012) and the references therein. 2

3 In Section 3 we use the notion of residual defined in Section 2 to show that the conditional independence between X and Y, given Z, is equivalent to the mutual independence of the residuals (of X on Z and Y on Z) and Z. This reduction immediately allows the practitioner to use any of the numerous statistical methods available for testing mutual independence of random variables/vectors to test the hypothesis of conditional independence. We, in particular, in Section 3.1, use this result to propose a test for conditional independence using the energy statistic (see Székely and Rizzo (2005) and Rizzo and Székely (2010)). The resulting testing procedure is tuning parameter free (an advantage over existing methods), once the residuals have been estimated. The use of the energy statistic to test the mutual independence of two or more random variables (or vectors) is also new, and is outlined in Section 3.1. The estimation of the proposed residual, from data, is briefly described in Section 3.2. We end with a brief discussion, see Section 4, where we point to some open research problems and outline an idea, using the proposed residuals, to define (and test) a nonparametric notion of partial correlation. 2 A nonparametric notion of residual Conditions (C.1) (C.2) do not necessarily lead to a unique choice for ϕ. To find a meaningful and unique function ϕ that satisfies conditions (C.1) (C.2) we impose the following natural restrictions on ϕ. We assume that (C.3) x ϕ(x, z) is strictly increasing in its support, for every fixed z R d. Note that condition (C.3) is a slight strengthening of condition (C.2). Suppose that a function ϕ satisfies conditions (C.1) and (C.3). Then any strictly monotone transformation of ϕ(, z) would again satisfy (C.1) and (C.3). Thus, conditions (C.1) and (C.3) do not uniquely specify ϕ. To handle this identifiability issue, we replace condition (C.1) with (C.4), described below. First observe that, by condition (C.1), the conditional distribution of the random variable ϕ(x, Z) given Z = z does not depend on z. We assume that (C.4) ϕ(x, Z) Z = z is uniformly distributed, for all z R d. Condition (C.4) is again quite natural we usually assume that the residual has a fixed distribution, e.g., in regression we assume that the (standardized) residual in normally distributed with zero mean and unit variance. Note that condition (C.4) is slightly stronger than (C.1) and will help us uniquely identify ϕ. The following result shows that, indeed, under conditions (C.3) (C.4), a unique ϕ exists and gives its form. 3

4 Lemma 2.1. Let F X Z ( z) denote the conditional distribution function of X Z = z. Under conditions (C.3) and (C.4), we have a unique choice of ϕ(x, z), given by Also, h(z, u) can be taken as ϕ(x, z) = F X Z (x z). (2.1) h(z, u) = F 1 X Z (u z). (2.2) Proof. Fix z in the support of Z. Let u (0, 1). Let us write ϕ z (x) = ϕ(x, z). By condition (C.4), we have P[ϕ(X, Z) u Z = z] = u. On the other hand, by (C.3), Thus, we have P[ϕ(X, Z) u Z = z] = P[X ϕ 1 z (u) Z = z] = F X Z (ϕ 1 (u) z). F X Z (ϕ 1 z (u) z) = u, for all u (0, 1), which is equivalent to ϕ z (x) = F X Z (x z). Let h be as defined in (2.2). Then, as required. h(z, ϕ(x, z)) = F 1 1 X Z (ϕ(x, z) z) = FX Z (F X Z(x z) z) = x, Thus from the above lemma, we conclude that in the nonparametric setup, if we want to have a notion of a residual satisfying conditions (C.3) (C.4) then the residual has to be F X Z (X Z). z Remark 2.2. Let us first consider the example when (X, Z) follows a multivariate Gaussian distribution, i.e., ( ) (( ) ( )) X µ 1 σ 11 σ12 N, Σ :=, Z µ 2 σ 12 Σ 22 where µ 1 R, µ 2 R d, Σ is a (d + 1) (d + 1) positive definite matrix with σ 11 > 0, σ 12 R d 1 and Σ 22 R d d. Then the conditional distribution of X given Z = z is N(µ 1 + σ12σ 1 22 (z µ 2 ), σ 11 σ12σ 1 22 σ 12 ). Therefore, we have the following representation in the form of (1.1): X = µ 1 + σ 12Σ 1 22 (Z µ 2 ) + 4 ( ) X µ 1 σ12σ 1 22 (Z µ 2 )

5 where the usual residual is X µ 1 σ12σ 1 22 (Z µ 2 ), which is known to be independent of Z. In this case, using Lemma 2.1, we get ( ) X µ 1 σ ϕ(x, Z) = Φ 12Σ 1 22 (Z µ 2 ), σ11 σ12σ 1 22 σ 12 where Φ( ) is the distribution function of the standard normal distribution. Thus ϕ(x, Z) is just a fixed strictly increasing transformation of the usual residual, and the two notions of residual essentially coincide. Remark 2.3. The above notion of residual does not extend so easily to the case of discrete random variables. Conditions (C.1) and (C.2) are equivalent to the fact that σ(x, Z) factorizes into two sub σ-fields as σ(x, Z) = σ(ϕ(x, Z)) σ(z). This may not be always possible as can be seen from the following simple example. Let (X, Z) take values in {0, 1} 2 such that P[X = i, Z = j] > 0 for all i, j {0, 1}. Then it can be shown that such a factorization exists if and only if X and Z are independent, in which case ϕ(x, Z) = X. Remark 2.4. Lemma 2.1 also gives an way to generate X, using Z and the residual. We can first generate Z, following its marginal distribution, and an independent Uniform(0, 1) random variable U, which will act as the residual. Then (1.4), where h is defined in (2.2), shows that we can generate X = F 1 X Z (U Z). In practice, we need to estimate the residual F X Z (X Z) from observed data, which can be done both parametrically and non-parametrically. If we have a parametric model for F X Z ( ), we can estimate the parameters, using e.g., maximum likelihood, etc. If we do not want to assume any structure on F X Z ( ), we can use any nonparametric smoothing method, e.g., standard kernel methods, for estimation; see Bergsma (2011) for such an implementation. We will discuss the estimation of the residuals in more detail in Section Testing Conditional Independence Suppose now that (X, Y, Z) has a joint density on R R R d = R d+2. We want to test the hypothesis of conditional independence between X and Y given Z, i.e., X Y Z. Recently this problem has received some attention in the statistical literature; see e.g., Su and White (2007), Huang (2010), Song (2009). Although there is a large body of work on testing mutual independence of two random 5

6 vectors, testing conditional independence has received relatively less attention, albeit it being a central notion in modeling causal relations, and its importance in graphical modeling (see Lauritzen (1996), Pearl (2000)). The conditional independence assumption also has important implications in the insurance markets and, more generally, in economic theory (see Chiappori and Salanié (2000)) and in the literature of program evaluations (see Heckman et al. (1997)) among other fields. In this section we state a simple result that reduces testing for the conditional independence hypothesis H 0 : X Y Z to a problem of testing mutual independence between three random variables/vectors that involve our notion of residual. We also briefly describe a procedure to test the mutual independence of the three random variables/vectors. We start with the statement of the crucial lemma. Lemma 3.1. Suppose that (X, Y, Z) has a continuous joint density on R d+2. Then, X Y Z if and only if F X Z (X Z), F Y Z (Y Z) and Z are mutually independent. Proof. Let us make the following change of variable (X, Y, Z) (U, V, Z) := (F X Z (X), F Y Z (Y ), Z). The joint density of (U, V, Z) can be expressed as f (U,V,Z) (u, v, z) = where x = F 1 X Z=z f(x, y, z) f X Z=z (x)f Y Z=z (y) = f (X,Y ) Z=z(x, y)f Z (z) f X Z=z (x)f Y Z=z (y), (3.1) 1 (u), and y = FY Z=z (v). Note that as the Jacobian matrix is upper-triangular, the determinant is the product of the diagonal entries of the matrix, namely, f X Z=z (x), f Y Z=z (y) and 1. If X Y Z then f (U,V,Z) (u, v, z) reduces to just f Z (z), for u, v (0, 1), from the definition of conditional independence, which shows that U, V, Z are independent (note that it is easy to show that U, V are marginally Uniform(0, 1)). Now, given that U, V, Z are independent, we know that f (U,V,Z) (u, v, z) = f Z (z) for u, v (0, 1), which from (3.1) easily shows that X Y Z. Remark 3.2. Bergsma (2011) developed a test for conditional independence by testing mutual independence between F X Z (X Z) and F Y Z (Y Z). However, as the following example illustrates, the independence of F X Z (X Z) and F Y Z (Y Z) is not enough to guarantee that X Y Z. Let W 1, W 2, W 3 be i.i.d. Uniform(0, 1) 6

7 random variables. Let X = W 1 + W 3, Y = W 2 and Z = mod(w 1 + W 2, 1), where mod stands for the modulo (sometimes called modulus) operation that finds the remainder of the division W 1 + W 2 by 1. Clearly, the random vector (X, Y, Z) has a smooth continuous density on [0, 1] 3. Note that Z is independent of W i, for i = 1, 2. Hence, X, Y and Z are pairwise independent. Thus, F X Z (X) = F X (X) and F Y Z (X) = F Y (Y ), where F X and F Y are the marginal distribution functions of X and Y respectively. From the independence of X and Y, F X (X) and F Y (Y ) are independent. On the other hand, the value of W 1 is clearly determined by Y and Z, i.e., W 1 = Z Y if Y Z and W 1 = Z Y + 1 if Y > Z. Consequently, X and Y are not conditionally independent given Z. To see this, note that for every z (0, 1), { z Y if Y z E[X Y, Z = z] = z Y if Y > z, which obviously depends on Y. Remark 3.3. We can extend the above result to the case when X and Y are random vectors in R p and R q respectively. In that case we define the conditional multivariate distribution transform F X Z by successively conditioning on the co-ordinate random variables, i.e., if X = (X 1, X 2 ) then we can define F X Z as (F X2 X 1,Z, F X1 Z). With this definition, Lemma 3.1 still holds. To use Lemma 3.1 to test the conditional independence between X and Y given Z, we need to first estimate the residuals F X Z (X Z) and F Y Z (Y Z) from observed data, which can be done by any nonparametric smoothing method, e.g., standard kernel methods (see Section 3.2). Then, any procedure for testing the mutual independence of F X Z (X Z), F Y Z (Y Z) and Z can be used. In this paper we advocate the use of the energy statistic (see Rizzo and Székely (2010)), described briefly in the next subsection, to test the mutual independence of the three random variables/vectors. 3.1 Testing mutual independence of three or more random vectors Testing independence of two random variables (or vectors) has received much recent attention in the statistical literature; see e.g., Székely et al. (2007), Gretton et al. (2005), and the references therein. To test the mutual independence of the above three random variables we use the methodology of Rizzo and Székely (2010) 7

8 (also see Székely et al. (2007), Székely and Rizzo (2009)) developed in the context of testing the equality of the distribution of two or more random variables (or vectors). In the following we briefly describe our procedure in the general setup. Suppose that we have r 3 random variables (or vectors) T 1, T 2,..., T r. We write T := (T 1, T 2,..., T r ) and introduce T ind := (T1, T2,..., Tr ) where Tj is an i.i.d. copy of T j, j = 1, 2,..., r, but in T ind the coordinates, T1, T2,..., Tr, are independent. To test the mutual independence of T 1, T 2,..., T r all we need to test now is whether T and T ind are identically distributed. We can test this by considering the following energy statistic (see Székely and Rizzo (2005) and Rizzo and Székely (2010)) Λ(T ) = 2E T T ind E T T E T ind T ind, (3.2) where the denotes an i.i.d. copy, and denotes the Euclidean norm. Note that Λ(T ) is always nonnegative, and equals 0, if and only if T and T ind are identically distributed, i.e., if and only if T 1, T 2,..., T d are mutually independent (see Corollary 1 of Székely and Rizzo (2005)). We use the sample version of (3.2) as our test statistic. In the following we briefly describe our procedure; a complete analysis of our procedure is beyond the scope of the paper and will be topic for future research. Under the null hypothesis of conditional independence, we have T := (F X Z (X Z), F Y Z (Y Z), Z) D = (U 1, U 2, Z) := T ind (3.3) where U 1, U 2 are i.i.d. Uniform(0, 1), independent of Z. We observe i.i.d. data {(X i, Y i, Z i ) : i = 1,..., n} from the joint distribution of (X, Y, Z) and we are interested in testing the distributional equality (3.3). Suppose first that the conditional distribution functions F X Z ( ) and F Y Z ( ) are known. Then we can estimate P T, the distribution of T, by its empirical counterpart P T := 1 n δ n ( ), F X Z (X i Z i ),F Y Z (Y i Z i ),Z i i=1 where δ x denotes the Dirac measure at x. On the other hand, we can estimate P Tind by P Tind = 1 n δ n ( ). 3 F X Z (X i Z i ),F Y Z (Y j Z j ),Z k i,j,k=1 Hence, we can estimate E T T ind by V 1 := t t P T (dt) P Tind (dt ) = 1 n 4 n i,j,k,l=1 (F X Z (X i Z i ), F Y Z (Y i Z i ), Z i ) (F X Z (X j Z j ), F Y Z (Y k Z k ), Z l ). 8

9 Similarly we can estimate E T T and E T ind T ind by V 2 := t t P T (dt) P T (dt ) and V 3 := t t P Tind (dt) P Tind (dt ), respectively. Thus our test statistic, the sample version of (3.2) reduces to V := 2V 1 V 2 V 3. As F X Z and F Y Z are unknown, we can replace them by their estimates F X Z and F Y Z, respectively. We can plug-in these estimates in P T and P Tind above to obtain our final test statistic. To obtain the critical value of the proposed test, we can simply use the permutation test: consider the permuted data {( F X Z (X π1 (i) Z π1 (i)), F Y Z (Y π2 (i) Z π2 (i)), Z i ) : i = 1,..., n}, where π 1 and π 2 are random permutations of {1,..., n}, and compute the test statistic V using this permuted data set. Repeating this numerous times, with different π 1 and π 2 would give the null distribution of our test statistic. Note that this method will not involve recomputing the transformations F X Z ( ) and F X Z ( ) as they remain fixed. 3.2 Nonparametric estimation of the residuals The nonparametric estimation of the conditional distribution functions would involve smoothing. In the following we briefly describe the standard approach to estimating the conditional distribution functions using the kernel smoothing method (also see Lee et al. (2006), Yu and Jones (1998),Hall et al. (1999)). For notational simplicity, we restrict to the case d = 1, i.e., Z is a real-valued random variable. Given an i.i.d. sample of {(X i, Z i ) : i = 1,..., n} from f X,Z, the joint density of (X, Z), we can use the following kernel density estimator of f X,Z : 1 n ( ) ( ) x Xi z Zi f n (x, z) = k k nh 1,n h 2,n h 1,n h 2,n i=1 where k is a symmetric probability density on R (e.g., the standard normal density function), and h i,n, i = 1, 2, are the smoothing bandwidths. It can be shown that if nh 1,n h 2,n and max{h 1,n, h 2,n } 0 as n then f n (x, z) P f X,Z (x, z). In fact, the theoretical properties of the kernel density estimator are very well studied; see e.g., Fan and Gijbels (1996) and Einmahl and Mason (2005) and the references therein. For the convenience of notation, we will write h i,n as h i, i = 1, 2. The conditional density of X given Z can then be estimated by f X Z (x z) = f n (x, z) f Z (z) = ( ) 1 n nh 1 h 2 i=1 k x X i h 1 k 9 ( ) z Z i h 2 ( ). 1 n nh 2 i=1 k z Z i h 2

10 Thus the conditional distribution function of X given Z can be estimated as x F f ( ) ( ) 1 n n (t, z) dt nh 2 i=1 K x X i z Z h 1 k i h 2 n ( ) x Xi X Z (x z) = = ( ) = w i (z)k f Z (z) 1 n nh 2 i=1 k z Z i h h 2 i=1 1 where K is the distribution function corresponding to k (i.e., K(u) = u and w i (z) = ( ) 1 z Zi k nh 2 h 2 1 ( n nh 2 j=1 k z Zj 4 Discussion h 2 ) are weights that sum to one for every z. k(v) dv) Given a random vector (X, Z) in R R d = R d+1 we have defined the notion of a nonparametric residual of X on Z as F X Z (X Z), which is always independent of the response Z. We have studied some of properties and showed that it indeed reduces to the usual residual in a multivariate normal regression model. However, the estimation of F X Z ( ) requires nonparametric smoothing techniques, and hence suffers from the curse of dimensionality. One natural way of mitigating this curse of dimensionality could be to use dimension reduction techniques in estimating the residual F X Z (X Z). Suppose now that (X, Y, Z) has a joint density on R R R d = R d+2. We have used this notion of residual to show that the conditional independence between X and Y, given Z, is equivalent to the mutual independence of the residuals F X Z (X Z) and F Y Z (Y Z) and the predictor Z. We have used this result to propose a test for conditional independence, based on the energy statistic. The asymptotic theory of the proposed test statistic is however unknown, and will be topic for future research. We can also use these residuals to come up with a nonparametric notion of partial correlation. The partial correlation of X and Y measures the degree of association between X and Y, removing the effect of Z. In the nonparametric setting, this reduces to measuring the dependence between the residuals F X Z (X Z) and F Y Z (Y Z). We can use distance covariance (Székely et al. (2007)), or any other measure of dependence, for this purpose. We can also test for zero partial correlation by testing for the independence of the residuals F X Z (X Z) and F Y Z (Y Z). Acknowledgements: The first author would like to thank Arnab Sen for many helpful discussions, and for his help in writing parts of the paper. He would also like to thank Probal Chaudhuri for motivating the problem. The research of both authors is supported by National Science Foundation. 10

11 References Bergsma, W. P. (2011). Nonparametric testing of conditional independence by means of the partial copula. Billingsley, P. (1995). Probability and measure. Wiley Series in Probability and Mathematical Statistics. John Wiley & Sons, Inc., New York, third edition. A Wiley-Interscience Publication. Chiappori, P. and Salanié, B. (2000). Testing for asymmetric information in insurance markets. Journal of Political Economy, 108(1): Einmahl, U. and Mason, D. M. (2005). Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33(3): Fan, J. and Gijbels, I. (1996). Local polynomial modelling and its applications, volume 66 of Monographs on Statistics and Applied Probability. Chapman & Hall, London. Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2007). Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems, pages Gretton, A., Bousquet, O., Smola, A., and Schölkopf, B. (2005). Measuring statistical dependence with hilbert-schmidt norms. Proceedings of the Conference on Algorithmic Learning Theory (ALT), pages Györfi, L. and Walk, H. (2012). Strongly consistent nonparametric tests of conditional independence. Statist. Probab. Lett., 82(6): Hall, P., Wolff, R. C. L., and Yao, Q. (1999). Methods for estimating a conditional distribution function. J. Amer. Statist. Assoc., 94(445): Heckman, J., Ichimura, H., and Todd, P. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. The Review of Economic Studies, 64(4):605. Huang, T.-M. (2010). Testing conditional independence using maximal nonlinear conditional correlation. Ann. Statist., 38(4): Lauritzen, S. L. (1996). Graphical models, volume 17 of Oxford Statistical Science Series. The Clarendon Press Oxford University Press. Oxford Science Publications. 11

12 Lee, Y. K., Lee, E. R., and Park, B. U. (2006). Conditional quantile estimation by local logistic regression. J. Nonparametr. Stat., 18(4-6): Pearl, J. (2000). Causality. Cambridge University Press, Cambridge. Models, reasoning, and inference. Rizzo, M. L. and Székely, G. J. (2010). DISCO analysis: a nonparametric extension of analysis of variance. Ann. Appl. Stat., 4(2): Song, K. (2009). Testing conditional independence via Rosenblatt transforms. Ann. Statist., 37(6B): Su, L. and White, H. (2007). A consistent characteristic function-based test for conditional independence. J. Econometrics, 141(2): Székely, G. J. and Rizzo, M. L. (2005). A new test for multivariate normality. J. Mult. Anal., 93(1): Székely, G. J. and Rizzo, M. L. (2009). Brownian distance covariance. Ann. Appl. Stat., 3(4): Székely, G. J. and Rizzo, M. L. (2014). Partial distance correlation with methods for dissimilarities. to appear in Ann. Statist. Székely, G. J., Rizzo, M. L., and Bakirov, N. K. (2007). Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6): Yu, K. and Jones, M. C. (1998). Local linear quantile regression. J. Amer. Statist. Assoc., 93(441): Zhang, K., Peters, J., Janzing, D., and Schölkopf, B. (2012). Kernel-based conditional independence test and application in causal discovery. arxiv preprint arxiv:

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon

Discussion of the paper Inference for Semiparametric Models: Some Questions and an Answer by Bickel and Kwon Jianqing Fan Department of Statistics Chinese University of Hong Kong AND Department of Statistics