Sliced Inverse Regression Ge Zhao gzz13@psu.edu Department of Statistics The Pennsylvania State University
Outline Background of Sliced Inverse Regression (SIR) Dimension Reduction Definition of SIR Inversed Regression Curve SIR Algorithm of SIR Discussion on SIR Consistency and Sparsity of SIR
Background Regression analysis is a popular way of studying the relationship between a response variable y R and its explanatory variable x R p. In some cases finding correct parametric model is not easy, leading to a nonparametric way. When dimension increases higher more and more data are required in the sample. We want an ideal model catching most or all of the interesting feature with least dimension. y = f(β 1 x, β 2 x,..., β K x, ɛ), K p.
Dimension Reduction y = f(β 1 x, β 2 x,..., β K x, ɛ), K p. f is not identifiable, arbitrary function in R K+1. β can be changed since β T x is a projection on a K-dimension space. When K is much smaller than p, we may claim we reduce the dimension once most of the information about y is remained and we estimate β efficiently. Estimating the projection β indicates that we have the new space. We call β effective dimension reduction (e.d.r.) direction.
Dimension Reduction (Continue) Ideal statement: y = f(β 1 x, β 2 x,..., β K x, ɛ), K p. Alternative statement: The conditional distribution of y given x depends on x only through the K-dimensional variable (β 1 x,..., β K x). y x β T x
Intuition of SIR Difficulty arises. Dimension p is larger than n, regressing y against x does not make sense. Hard to view the data (cooradinates) by using traditional method due to high dimension. Flip y and x!
Definition of SIR Consider the model x = g p (y) as an one-dimension problem. E(x y) will be a curve in p-dimension space. If possible, this curve will hover around a K-dimensional affine subspace. Later shows the relationship between K-dimensional affine subspace and effective dimension reduction space (spanned from e.d.r. directions.)
Definition of SIR (Continue) Affine invariant criterion, the squared trace correlation: R 2 (b) = max β B If x is standardized as follows: (bσ xx β ) 2 (bσ xx b )(βσ xx β ). z = Σ 1 xx {x E(x)}, the inverse regression curve falls into a subspace which coincides with e.d.r. space.
Algorithm of SIR We have a data set (y i, x i ), i = 1, 2,..., n. 1. Standardized x: x i = Σ 1 xx {x i x}; 2. Divide range of y into H slices, I 1,..., I H. Each slice has proportion p h of total observations; 3. Compute sample mean of each slice, denoted by m h ; 4. Conduct a weighted Principle Component Analysis on m h from the weighted covariance matrix V = H h=1 p h m h m h ; 5. Output β k = η k Σ 1/2 xx, k = 1,..., K where η k s are K largest eigenvectors.
Remarks Sample mean is just for simplicity. We can use other methods to estimate inverse curve, such as kernel based nonparametric regression, nearest neighbor, smoothing splines. Here we are only interested in the orientation. Weighted version of PCA takes care of unequal sample slice. In general we have lim n p/n = 0 to guarantee the consistency. If violated, we need more conditions. First K components locate most important subspace. We will discuss how to decide k later. Last step transform back to β k.
Further discussion No need to standardize x i, but still need to transform sliced means as follows: H Σ 1 = p h ( x h x)( x h x) h=1 where x h is the sliced sample mean. Then we do PCA on Σ 1 instead of Σ xx. One can use other methods, such as robust version, to standardize x. The purpose is to downweight or cut out the influential design points.
Further discussion (Continue) Slice can be equal distance. We prefer to have it varied from slice to slice such that they have similar sample size with each other. We hope the range of each slice will converge to 0 so that only local points will contribute to the estimation. Even with large number of slice, we still have good consistency. Usually using following slice: I h = (F 1 y {(h 1)/H}, F 1 {h/h}) Choice of H may affect the asymptotic variance of β, but not as important as bandwidth in nonparametric model. y
Further discussion (Continue) Expectation of the squared trace correlation between β k x and β k x is given by E{R 2 ( B)} = 1 p K n ( 1 + 1 K K λ k k=1 ) 1 + o ( ) 1. n To be really successful in picking up all K dimensions for reduction, the inverse regression curve cannot be too straight. In other words, the first K eigenvalues for V must be significantly different from zero compared with the sampling error.
Further discussion (Continue) Theorem If x is normally distributed, then n(p K) λ p K follows a χ 2 distribution with (p K)(H K 1) degree of freedom asymptotically, where λ p K denotes the average of the smallest p K eigenvalues for V. We can use this result to assess the number of component we have.
Theoretical results: Conditions 1. The conditional distribution of y given x depends on x only through the K-dimensional variable (β 1 x,..., β K x). y x β T x; 2. For any b R p, the conditional expectation E(bx β 1 x,..., β K x) is a linear combination of β 1 x,..., β K x.
Theoretical results: Inverse regression curve Theorem The centered inverse regression curve E(x y) E(x) is contained in the linear subspace spanned by β k Σ xx, k = 1,..., K, where Σ xx denotest the covariance matrix of x. Here we have E(E(x y)) = E(x) by law of total expectation. The following corollary is straightforward. Corollary Assume x has been standardized to z, then the standardized inverse regression curve E(z y) is contained in the linear space generated by the standardized e.d.r. directions η 1,..., η K.
Theoretical results: Remarks According to law of total variance, E{cov(z y)} = covz cov{e(z y)} = I cov{e(z y)}. Hence the largest eigen value of E{cov(z y)} equals the smallest eigen value of cov{e(z y)}. The consistency is at rate of n. Let p h = P r{y I h } and m h = E(Z y I h ). We have m h m h at rate n 1/2, V H h=1 p hm h m h at rate n.
Conditions for more detialed discussion 1. Linearity condition: For and b R p, E(bx β 1 x,..., β K x) is a linear combination of β 1 x,..., β K x. 2. Coverage condition: The dimension of the space spanned by the central curve is the same as the dimension of the central space. 3. Boundness Condition: There exist positive constant C 1 and C 2 such that C 1 λ min (Σ xx ) λ max (Σ xx ) C 2 where λ min and λ max are the minimum and maximum eigenvalue of Σ xx respectively.
Conditions (Continue) 4. The central curve E(x y) has fourth finite moment and is κ-sliced stable with respect to y and E(x y). Sliced stable is an intrinsic property of E(x y). If we expect the slice estimate 1/H h m h m h of var(m(y)) is consistent, we must require that the average loss of variance in each slice (1/c m h,i m h,i m h m h ) to be decreasing as H is increasing.
Consistency Theorem Assume the conditions are all hold, for sufficiently large H and n, we have: ( ) Λ 1 p Λ p 2 O p H κ 1 + H2 p H n + 2 p n where Λ p = var(e(x y)) and Λ p is its estimate. A direct corollary is that if p/n 0, we may choose H = log(n/p) such that the right hand side converges to 0. Hence Λ p is a consistent estimate of Λ p = var(e(x y)).
Consistency (Continue) Theorem Assume the conditions are all hold, x is sub-gaussian and lim n p/n = 0, then 1 Σ Λ xx p Σ 1 xx Λ p 2 0 with probability converging to 1 as n, where Σ xx = 1 n n x i x i. i=1
Consistency (Continue) Theorem Assume the conditions are all hold and x n(0, I p ), for the single index model y = f(β T x, ɛ) we have When lim n p/n (0, ), Λ p Λ p 2 is (as a function of p/n) dominated by p/n p/n if H, n ; Let β be the PCA eigenvector of Λ p. If lim n p/n 0, the there exists a positive constant c(p/n) > 0 such that lim inf n E (β, β) > c(p/n) with probability converges to 1.
Conditions for ultrahigh dimension SIR We are now discussing the case when p n. 5. s = S p where S = { i β j (i) 0 for some j, 1 j K } and S is the number of elements in S. 6. Σ xx U(ɛ 0, α, C) and max 1 i p r i is bounded where r i is the number of non-zero elements in the i-th row of Σ xx. Where U(ɛ 0, α, C) = { Σ xx : max { σ i,j : i j > l} Cl α for all l > 0, j i and 0 < ɛ 0 λ min (Σ xx) λ max(σ xx) 1 } ɛ 0
Conditions for ultrahigh dimension SIR (Continue) We are now discussing the case when p n. 7. There exist positive constants C and ω such that var[e{x(k) y}] > C/s ω when E{x(k) y} is not constant. 8. There exists a constant K such that every coordinate x(k) is sub-gaussian and upper-exponentially bounded by K. Now we have the following theorems:
Ultrahigh dimension consistency Theorem Assuming the conditions are hold, let t = a/s ω where a is a sufficiently small positive constant such that t < var{m(y, k)}/2 for any k T, one has 1 T c ɛ p holds with probability at least { n } 1 C 1 exp C 2 H 2 s ω + C 3log(H) + log(p s) ; 2 T c I p holds with probability at least { n } 1 C 4 exp C 5 H 2 s ω + C 6log(H) + log(s) ; for some positive constants C 1,..., C 6.
Ultrahigh dimension consistency (Continue) Theorem Under the same assumptions of previous Theorem and the same setting of t, let T = I(t) and H = log(n/s ω log(p)), then ( ) e Λ T, T 2 p Λ p 0, n with probability converges to 1. As a direct corollary we have ( ) 1 Σ X e Λ T, T p Σ 1 with probability converges to 1. X Λ p 0, n 2
Ultrahigh dimension algorithm 1. Calculate var H,c {x(k)} for k = 1,..., p according to var H,c {x(k)} = 1 H 1 H { x h (k) x(k)} 2 ; i=1 2. Let T = {k var H,c {x(k)} > t} for an appropriate t; 3. Let Λ T, T p be the SIR estimator of the conditional covariance matrix for the data (y, x, T ) according to: Λ = 1 H 1 H { x h x} { x h x} T ; h=1
Ultrahigh dimension algorithm (Continue) 4. Calculate η i = e( η T i ) where η T i, 1 i K are the top eigenvectors of Λ T, T ; 5. Calculate β i = Σ xx ; Σ 1 xx η i where Σ xx is a consistent estimate of 6. The central space is estimated by the span of β i s.
Summary Introduce the Sliced Inverse Regression (SIR) Provide the algorithm of SIR Further discuss the consistency of SIR Extend original SIR to ultrahigh dimension SIR
Thank you!