Sufficient dimension reduction via distance covariance

Sufficient dimension reduction via distance covariance Wenhui Sheng Xiangrong Yin University of Georgia July 17, 2013

Outline 1 Sufficient dimension reduction 2 The model 3 Distance covariance 4 Methodology 5 Simulation studies 6 Determining d 7 Real data analysis 8 Summary

Sufficient dimension reduction (SDR). Dimension Reduction Subspace Let B be a p q matrix with q p, if Y X B T X, then the space S(B), spanned by the columns of B, is a dimension reduction subspace.

Sufficient dimension reduction (SDR). Central Subspace (CS) If the intersection of all dimension reduction subspace is itself a dimension reduction subspace, it is called a central subspace, denoted by S Y X. Cook (1998) and Yin, Li and Cook (2008) showed that under mild conditions, CS exists and is unique. In SDR, since the dimension reduction subspace is not unique, our primary interest is to estimate the CS.

The Model We consider the following regression model: Y X η T X, (2.1) where Y is a scalar response, X is a p 1 predictor vector and η is a p d matrix with d p. Here d = dim(s Y X ), which is the structural dimension. Our goal is to estimate the S Y X by finding a η R p d which satisfies (2.1).

Distance covariance Székely, Rizzo and Bakirov (2007) proposed distance covariance (DCOV) as a new distance measure of dependence between two random vectors. Let U R p and V R q, then the DCOV between U and V with finite first moments is the nonnegative number, V(U, V ), defined by V 2 (U, V ) = f U,V (t, s) f U (t)f V (s) 2 w(t, s)dtds, R p+q

Distance covariance (Con t). where f U and f V stand for the characteristic functions of U and V respectively, and their joint characteristic function is denoted by f U,V. f 2 = f f for a complex-valued function f, with f being the conjugate of f. w(t, s) is a specially chosen positive weight function; more details of w(t, s) can be found in Székely, Rizzo and Bakirov (2007) and Székely and Rizzo (2009).

Properties of Distance Covariance U and V are independent if and only if V(U, V ) = 0. DCOV is efficient to measure nonlinear relationship. The sample version of V(U, V ) is very simple. This property benefits the computation. Others

The method We consider the squared distance covariance between Y and β T X, where β is an p d 0 arbitrary matrix with d 0 p: V 2 (β T X, Y ) = f R d 0 β T (t, s) f +1 X,Y β T (t)f X Y (s) 2 w(t, s)dtds, We show that under a mild condition, solving (4.1) will yield a basis of the central subspace, max V 2 (β T X, Y ), (4.1) β T Σ X β=i d0 1 d 0 p In (4.1), we use a scale constraint β T Σ X β = I d0, which is necessary to make the maximization procedure work.

The method (Con t) Proposition 1 Let η be a basis of the CS, β be a p d 0 matrix with d 0 d, η T Σ X η = I d and β T Σ X β = I d0. Assume S(β) S(η), then V 2 (β T X, Y ) V 2 (η T X, Y ). The equality holds if and only if S(β) = S(η).

The method (Con t) Proposition 2 Let η be a basis of the CS, β be a p d 1 matrix with η T Σ X η = I d and β T Σ X β = I d1. Here d 1 could be bigger, less or equal to d. Suppose Pη(Σ T X ) X X and S(β) S(η), then V 2 (β T X, Y ) < V 2 (η T X, Y ). QT η(σ X )

The method (Con t) Independence condition Independence condition: P T η(σ X ) X QT η(σ X ) X Independence condition will be satisfied when X is normal. Low dimensional projection of the predictor are approximately multivariate normal (Diaconis and Freedman 1984; Hall and Li 1993).

Estimating the central subspace when d is specified Suppose d is known (A permutation test will be proposed to estimate d) The estimate of η: η n = arg max V β T n 2 (β T X, Y) ˆΣX β=i d where V 2 n (β T X, Y) is the sample version of V 2 (β T X, Y ).

k, l = 1,, n. Similarly, define b kl = Y k Y l and B kl = b kl b k. b.l + b.., for k, l = 1,, n. Estimating the central subspace when d is specified (Con t) The sample version of V 2 (β T X, Y ): Vn 2 (β T X, Y) = 1 n n 2 A kl (β)b kl, where, k,l=1 A kl (β) = a kl (β) ā k. (β) ā.l (β) + ā.. (β) a kl (β) = β T X k β T X l, ā k. (β) = 1 n a kl (β), n ā.l (β) = 1 n a kl (β), ā.. (β) = 1 n n 2 k=1 l=1 n a kl (β), k,l=1

Asymptotic properties Proposition 3 Assume η is a basis of the central subspace S Y X and η T Σ X η = I d. Suppose the support of X is compact, E Y < and Pη(Σ T X ) X X. Let QT η(σ X ) η n = arg max βt ˆΣ X β=i d V 2 n (β T X, Y), then η n is a consistent estimator of a basis of S Y X, that is, there exists a rotation matrix Q: Q T P Q = I d, such that η n ηq.

Asymptotic properties (Con t) Proposition 4 Assume η is a basis of the central subspace S Y X and η T Σ X η = I d. Suppose the support of X is compact, E Y < and Pη(Σ T X ) X X. Let QT η(σ X ) η n = arg max βt ˆΣX β=i d V 2 n (β T X, Y), then under the regularity conditions given in the supplementary file, there exists a rotation matrix Q: Q T Q = I d such that n(ηn ηq) D N(0, V 11 ), where V 11 is the covariance matrix defined in the supplementary file.

Simulation studies Consider the following two models: Let β 1 = (1, 0, 0, 0, 0, 0, ) T, β 2 = (0, 1, 0, 0, 0, 0, ) T, β 3 = (1, 1, 1, 0, 0, 0) T and β 4 = (1, 0, 0, 0, 1, 3, ) T and n = 100. The models are (A) Y = (β T 1 X)2 + (β T 2 X) + 0.1ɛ (B) Y = (β T 3 X)2 + 3 sin(β T 4 X/4) + 0.2ɛ

Simulation studies(con t) For each model, three different kinds of predictors: Part (1): X N(0, I 6 ); Part (2): X is continuous but nonnormal; Part (3): X is discrete. Comparison: SIR (Li 1991), SAVE (Cook and Weisberg 1991), PHD (Li 1992) and LAD (Cook and Forzani 2009)

Model A Table: Comparison with model A Part(1) Part(2) Part(3) (n, p) Method m SE (n,p) Method m SE (n,p) Method m SE (100,6) SIR 0.9025 0.1184 (100,6) SIR 0.6283 0.1834 (100,6) SIR 0.5422 0.1877 PHD 0.8288 0.1604 PHD 0.8568 0.1481 PHD 0.7382 0.2554 SAVE 0.4227 0.1822 SAVE 0.4688 0.1859 SAVE 0.4945 0.2506 LAD 0.2952 0.1047 LAD 0.2869 0.0936 LAD * * Dcov 0.2014 0.0570 Dcov 0.1816 0.0632 Dcov 0.0083 0.0727 LAD does not work sometimes

Model B Table: Comparison with model B Part(1) Part(2) Part(3) (n, p) Method m SE (n,p) Method m SE (n,p) Method m SE (100,6) SIR 0.8606 0.1719 (100,6) SIR 0.8585 0.1705 (100,6) SIR 0.9607 0.0648 PHD 0.8916 0.1299 PHD 0.9594 0.0644 PHD 0.7673 0.1978 SAVE 0.5870 0.2487 SAVE 0.6626 0.2811 SAVE 0.7987 0.1780 LAD 0.2846 0.1129 LAD 0.2866 0.1338 LAD * * Dcov 0.2271 0.0903 Dcov 0.2205 0.0839 Dcov 0.4200 0.2593 LAD does not work sometimes

Determining d We want to test the conditional independence, that is, given β = (β 1, β 2,, β k ) S Y X and β = (β k+1,, β p ), (β, β ) form an orthogonal basis of R p and we want to test whether the relationship, Y X β T X, is right. The permutation test we suggest here is to test the independent between (y, β T X) versus β T X. Without further assumption β T X an upper bound of d. β T X, we can only get

Determining d (Con t) We use permutation test to determine the dimensionality of central subspace, d = dim(s Y X ). The sample size is n = 200 with p = 6. For each model and each part, we use 100 replications. Table: Permutation test Model Normal Nonnormal Discrete A 93% 100% 100% B 100% 87% 62% The percentage of d = 2 and d = 3.

Bird, plane or car This data set concerns the identification of the sounds made by birds, planes and cars. A two hour recording was made in the city of Ermont, France, and then 5 second snippets of interesting sounds were manually selected. This resulted in 58 recordings identified as birds, 44 as cars and 67 as planes. Each recording was further processed, and was ultimately represented by 13 SDMFCCs (Scale Dependent Mel- Frequency Cepstrum Coefficients).

Bird, plane or car Figure: Plot of the first two Dcov directions for the birds-planes-cars example. Birds, black s; planes, red s; cars, green + s.

Summary 1 The article extends the methodology in the single-index paper to multiple-index model and it uses a permutation test to estimate the dimensionality of the central subspace. 2 The method has very weaker assumptions on the distribution of the predictors, and it works very efficiently on discrete predictors. 3 The article finds new theoretical properties of V 2 (β T X, Y ).

Thank You! Q & A