Simultaneous envelopes for multivariate linear regression

1 2 3 4 Simultaneous envelopes for multivariate linear regression R. Dennis Cook and Xin Zhang December 2, 2013 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Abstract We introduce envelopes for simultaneously reducing the predictors and the responses in multivariate linear regression, so the regression then depends only on estimated linear combinations of X and Y. We use a likelihood-based objective function for estimating envelopes and then propose algorithms for estimation of a simultaneous envelope as well as for basic Grassmann manifold optimization. The asymptotic properties of the resulting estimator are studied under normality and extended to general distributions. We also investigate likelihood ratio tests and information criteria for determining the simultaneous envelope dimensions. Simulation studies and real data examples show substantial gain over the classical methods like, partial least squares, canonical correlation analysis and reduced-rank regression. This article has supplementary material available online. Key Words: Canonical correlations; envelope model; Grassmann manifold; partial least squares; principal component analysis; reduced-rank regression; sufficient dimension reduction. 21 22 23 24 25 26 27 1 Introduction Multivariate linear regression with predictors X R p and responses Y R r is a corner- stone of multivariate statistics. When p and r are not small, it is widely recognized that reducing the dimensionalities of X and Y may often result in improved performance. Perhaps the most popular methods for reducing the number of predictors and responses are principal component analysis (PCA), partial least squares (PLS) regression, canonical correlation analysis (CCA) and reduced-rank regression (RRR). R. Dennis Cook is Professor, School of Statistics, University of Minnesota, Minneapolis, MN 55455 (E-mail: dennis@stat.umn.edu). Xin Zhang is Ph.D student, School of Statistics, University of Minnesota, Minneapolis, MN, 55455 (Email: zhan0648@umn.edu). 1

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 Principal component analysis (Jolliffe, 1986, 2005) is an un-supervised dimension reduction method designed to select orthogonal linear combinations of X and Y with maximal variation. However, it does not use any information about the relationship between X and Y, and thus separate PCA reductions of X and Y could be ineffective for regression. Partial least squares was proposed as iterative algorithms, NIPALS (Wold, 1966) and SIMPLS (de Jong, 1993), for predictor reduction. These methods, which are used extensively in chemometrics, reduce the predictors by iteratively estimating linear combinations of them that have maximal covariance with the response vector. Canonical correlation analysis (Hotelling, 1936; Anderson, 1984) is used to investigate the overall correlations between the two sets of variables X and Y. It can simultaneously reduce X and Y by finding pairs of linear combinations such that the correlations of these pairs are in descending order and components are uncorrelated across different pairs. A probabilistic interpretation of CCA was given by Bach and Jordan (2005) as a latent variable model for two normal random vectors. Reduced-rank regression (Izenman 1975; Reinsel and Velu 1998) restricts on the rank of the regression coefficient matrix and therefore improves prediction by reducing the number of parameters in the model. Sufficient dimension reduction methods for multivariate responses are also available for reducing dimensionality. For example, sliced inverse regression (Li, 1991) was extended to multivariate response data by Li, et al. (2003). Li, Wen and Zhu (2008) proposed projective resampling to deal with a multivariate regression by using univariate reduction methods. However, such methods are beyond the scope of this work since they are designed to estimate only a subspace as a preliminary step in an analysis. In contrast, we work in the context of the multivariate linear model with a view toward prediction and coefficient estimation. Informally, multivariate linear regression can involve both material and immaterial variation in the responses and in the predictors. Material variation provides information that is directly relevant to the regression, while the immaterial variation is essentially irrelevant to the regression and serves to increase estimative variation. Envelopes, which were introduced by Cook, Li and Chiaromonte (2010) for response reduction, use a subspace to envelop the material information and thereby exclude the immaterial variation. 2

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 Essentially a form of targeted dimension reduction, this process can lead to substantial efficiency gains when the immaterial variation is large relative to the material variation. Cook, Helland and Su (2013) adapted envelopes to the predictors, and showed that the SIMPLS algorithm for partial least squares regression converges to an envelope in the predictor space. Following Cook, et al. (2010), they demonstrated that using a likelihoodbased objective function to separate the material and immaterial variation and to provide an estimator of the coefficient matrix produces clear and often substantial estimative and predictive advantages over SIMPLS. However, little is known about using envelopes for joint reduction of the responses and predictors. The previous developments kindle a hope that we can combine their advantages to produce efficiency gains than are greater than those possible by reducing either the responses or the predictors alone. In this article, we develop likelihood-based envelope methods for simultaneously separating the material and immaterial variation in the responses and in the predictors. We show a potential for synergy in a synchronized reduction, producing an overall reduction in estimative variation surpassing that indicated by the marginal reductions. Finding a likelihood-based envelope can be computationally challenging, and so we propose a novel and fast optimization algorithm. The rest of this paper is organized as follows. In Section 2, we briefly review the algebraic definition of an envelope as a prelude to our development of simultaneous reduction that starts in Section 3. We depart a bit from convention and integrate our review of individual response and predictor envelopes into our development of simultaneous envelopes. We also link the simultaneous envelope method with partial least squares and canonical correlation analysis. In Section 4 we introduce a likelihood-based objective function that includes the objective functions used by Cook, et al. (2010) and Cook, et al. (2013) as special cases. Novel algorithms for estimating a simultaneous envelope are also given in this section. In Section 5, asymptotic properties of the simultaneous envelope estimators are studied under normality and under general distributional assumptions. Encouraging simulation results on prediction and on determine the dimensions of envelopes are given in Section 7. In Section 8, we demonstrate the superiority of the simultaneous envelope estimator compared to classical methods by simulations and by predicting the contents of 3

88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 biscuit dough samples. Proofs and other technical details are included in the Supplement to this article. The following notations and definitions will be used in our exposition. Let A B denote that A has the same distribution as B, let A B denote that A is independent of B and let A B C indicate that A is conditionally independent of B given C. Let R m n be the set of all real m n matrices and let S k k be the set of all real symmetric k k matrices. If M R m n then span(m) R m is the subspace spanned by columns of M. We use θ u to denote sample estimator for θ with known parameters u. We use P A(V) = A(A T VA) 1 A T V to denote the projection onto span(a) with the V inner product and use P A to denote projection onto span(a) with the identity inner product. Let Q A(V) = I P A(V). For an m n matrix A and a p q matrix B, their direct sum is defined as the (m+p) (n+q) block diagonal matrix A B = diag(a, B). We will also use the operator for two subspaces. If S R p and R R q then S R = span(s R) where S and R are basis matrices for S and R. The sum of two subspaces S 1 and S 2 of R m is defined as S 1 + S 2 = {v 1 + v 2 v 1 S 1, v 2 S 2 }. We will use operators vec : R a b R ab, which vectorizes an arbitrary matrix by stacking its columns, and vech : R a a R a(a+1)/2, which vectorizes a symmetric matrix by extracting its columns of elements below or on the diagonal. The estimation of envelopes will eventually be done by optimizing over Grassmann manifolds. A Grassmann manifold is the collection of all linear subspaces of a vector space of a given dimension. We use G (k,n) to denote all the k-dimensional subspaces in R n, which is a Grassmann manifold with dimension k(n k). For more background in Grassmann manifolds and Grassmann optimizations, see Edelman, Tomas and Smith (1998). There are two currently available packages for Grassmann manifold optimization: R package GrassmannOptim by Adragni, Cook and Wu (2012) and the MATLAB package sg_min by Ross A. Lippert (http://web.mit.edu/~ripper/www/software/). 4

114 2 Definition of an envelope 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 In this section we briefly review definitions and a property of reducing subspaces and envelopes. Definition 1. A subspace R R p is said to be a reducing subspace of M R p p if and only if R decomposes M as M = P R MP R + Q R MQ R. If R is a reducing subspace of M, we say that R reduces M. This definition of a reducing subspace is equivalent to the usual definition found in functional analysis (Conway, 1990) and in the literature on invariant subspaces, but the underlying notion of reduction is incompatible with how it is usually understood in statis- tics. Nevertheless, it is common terminology in those areas and is the basis for the defi- nition of an envelope (Cook, et al., 2010) which is central to our developments. Definition 2. Let M S p p and let B span(m). Then the M-envelope of B, denoted by E M (B), is the intersection of all reducing subspaces of M that contain B. The intersection of two reducing subspaces of M is still a reducing subspace of M. This means that E M (B), which is unique by its definition, is the smallest reducing sub- space containing B. Also, the M-envelope of B always exist because of the requirement B span(m). The following proposition from Cook, et al. (2010) gives a characterization of en- velopes. Proposition 1. If M S p p has q p eigenspaces, then the M-envelope of B span(m) can be constructed as E M (B) = q i=1 P ib, where P i is the projection onto the i-th eigenspace of M. From this proposition, we see that the M-envelope of B is the sum of the eigenspaces of M that are not orthogonal to B; that is, the eigenspaces of M onto which B projects non-trivially. This implies that the envelope is the span of some subset of the eigensvec- tors of M. In the regression context, B is typically the span of a regression coefficient matrix or a matrix of covariances, and M is chosen as a covariance matrix which is usu- ally positive definite. 5

142 143 144 3 Simultaneous envelopes 3.1 Definition and structure The standard multivariate linear model can be written as Y = µ Y + β T (X µ X ) + ɛ, (3.1) 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 where µ Y is the mean for Y, µ X is the mean for X, ɛ is the error vector that has mean 0, variance Σ Y X > 0 and is independent of X, and β R p r is the regression coefficient matrix in which we are primarily interested. We use Σ X > 0 and Σ Y > 0 to denote the population covariance matrices of X and Y, and use Σ XY to denote their cross- covariance matrix. The covariance matrix of the population residual vector ɛ is then Σ Y X = Σ Y Σ T XYΣ 1 X Σ XY. Similarly, let Σ X Y = Σ X Σ XY Σ 1 Y ΣT XY. The sample counterparts of these population covariance matrices are denoted by S X, S Y, S XY, S Y X and S X Y. We also define Σ A B and S A B in the same way for two arbitrary random vectors A and B. Envelope methods have the potential to increase efficiency in estimation of β after reducing Y (Cook et al., 2010) and to improve prediction of Y after reducing X (Cook et al., 2013). Our goal is to combine their advantages by simultaneously reducing X and Y to decrease both predictive and estimative variation. We next give a coordinate representation of simultaneous envelopes. Let d min(r, p) denote the rank of β and consider the singular value decomposition of β = UDV T, where U R p d and V R r d are orthogonal matrices, U T U = I d = V T V, and D = diag(λ 1,..., λ d ) is a diagonal matrix with elements λ 1 λ d > 0 being the d singular values of β. We define the column space (the left eigenspace) and the row space (the right eigenspace) of β as span(β) = span(u) L and span(β T ) = span(v) R. Then two envelopes can be constructed for simultaneously reducing the predictor space and the response space: 1. X-envelope (left envelope): E ΣX (L) with dim(e ΣX (L)) = d X, d d X p. 2. Y-envelope (right envelope): E ΣY X (R) with dim(e ΣY X (R)) = d Y, d d Y r. 6

168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 We know from Cook, et al. (2010; Proposition 3.1) that E ΣY (R) = E ΣY X (R). Consequently, an alternative definition of the Y-envelope is E ΣY (R), which then has the same form as X-envelope, by replacing X with Y and L with R. We use E ΣY X (R) as the definition of the Y-envelope to facilitate later parameterizations. If we imaging that the elements of β are generated with respect to Lebesque measure then it follows from Propostion 1 that no reduction is possible: E ΣX (L) = R p and E ΣY X (R) = R r with probability one. Later in this section we show that proper envelopes imply certain relations between X and Y that may reasonably hold in practice. From the definitions of the X- and Y-envelopes, if d = r < p then we can reduce X only and E ΣY X (R) = R r. Similarly, if d = p < r then E ΣX (L) = R p and reduction is possible only in the response space. Hence, we will assume d < min(r, p) from now on and discuss the general situation where simultaneous reduction is possible. Let L R p d X be an orthogonal basis for E ΣX (L) and let R R r d Y be an orthogonal basis for E ΣY X (R). Also, (L, L 0 ) is an orthogonal basis for R p and (R, R 0 ) is an orthogonal basis for R r. Then from Definition 1 we can write the covariance matrices as Σ X = LΩL T + L 0 Ω 0 L T 0, (3.2) Σ Y X = RΦR T + R 0 Φ 0 R T 0. (3.3) 183 184 185 186 187 188 189 190 191 192 193 The covariance matrix decomposition in (3.2) indicates that the eigenvectors of Σ X fall in either E ΣX (L) or E Σ X (L) with corresponding eigenvalues being the eigenvalues of Ω and Ω 0. No relationship is assumed between the eigenvalues of Ω and Ω 0 : The eigenvalues of Ω could be any subset of the eigenvalues of Σ X. Similar results hold for the eigenvalues and eigenvectors of Σ Y X, as seen in (3.3). All of the information that is available about β is carried by the reduced variables R T Y and L T X, which can be seen as follows. Recall the singular value decomposition of β: β = UDV T, where span(u) = L span(l) and span(v) = R span(r) by the definition of the X- and Y-envelopes. Hence, β = UDV T = (LA)D(B T R T ) for some semi-orthogonal matrices A R d X d and B R d Y d. Then R T 0 β T = 0, β T L 0 = 0 and model (3.1) can be reduced to { R T Y = R T µ Y + η T {L T (X µ X )} + R T ɛ, R T 0 Y = R T 0 µ Y + R T 0 ɛ, (3.4) 7

194 195 where η = BDA T R d X d Y has rank d. The simultaneous envelope model then be- comes Y = RR T Y + R 0 R T 0 Y = µ Y + Rη T L T (X µ X ) + ɛ. (3.5) 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 with Σ X and Σ Y X given by (3.2) and (3.3). Comparing to (3.1), we see that the regression coefficient matrix is now β = LηR T, where η contains the coordinates of β relative to L and R. This implies that the columns and rows of β vary only within the Y-envelope and the X-envelope. By letting L = I p or R = I r, there will be reductions only in the column space or the row space of β. These are the two special situations studied by Cook et al. (2010) and Cook et al. (2013). If L = I p, it follows from (3.3) and (3.4) that cov(r T Y, R T 0 Y X) = 0 and R T 0 Y X R T 0 Y, which motivated the construction of response envelopes (Cook, et al., 2010). If ɛ is normally distributed, then this pair of conditions is equivalent to R T 0 Y R T Y X and R T 0 Y X R T 0 Y. If (X, Y) is jointly normal then this pair of conditions is equivalent to R T 0 Y (R T Y, X). We refer to R T Y and R T 0 X as the material and immaterial parts of the responses. This is because R T 0 Y is neither affected by the predictors nor correlated with the complementary part of the responses R T Y and in this sense has no contribution to linear regression. If R = I r then, from (3.2) and (3.5), cov(y, L T 0 X L T X) = 0 and cov(l T X, L T 0 X) = 0, which are the two conditions used by Cook, et al. (2013, Proposition 2.1) for predictor reduction. If (X, Y) has a joint normal distribution, then this pair of conditions is equivalent to L T 0 X (L T X, Y). We refer to L T Y and L T 0 X as the material and immaterial parts of the predictors. Similar to the response, this is because L T 0 Y is neither affected by the response nor correlated with the rest of the predictors. The above conditions for L and R are not stated symmetrically because of the natural of regression. Cook et al. (2010) treated X as fixed since S X is an ancillary statistic for the Y-envelope, while Cook et al. (2013) treated X as random because S X is not ancillary for X reduction. For simultaneous reductions, we assume that X and Y have a joint distribution throughout this article. The covariance decompositions (3.2) and (3.3) 8

221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 play a critical role in obtaining the above relationships, and they distinguish the envelope reductions from other methods for reducing the column and row dimensions of β. The previous relationships follow from the marginal response and predictor envelopes. The following lemma describes additional relationships between the material part in X (or Y) and the immaterial part in Y (or X) that come with the simultaneous envelopes. Lemma 1. Assume the simultaneous envelope model (3.5). Then cov(r T Y, L T 0 X) = 0 and cov(l T X, R T 0 Y) = 0. This lemma, which does not require normality of X or Y, is implied by the previous discussion if L = I p or R = I r. It shows a similarity between simultaneous envelope reduction and canonical correlation analysis: the selected components are uncorrelated with the rest of the components. Additional discussion of the connection between envelopes and canonical correlation is given in Section 3.3. 3.2 A visualized example of simultaneous envelope Figure 3.1 (a) and (b) illustrate the working mechanism of the simultaneous envelope reduction for a multivariate regression with two response Y = (Y 1, Y 2 ) T and two predictors X = (X 1, X 2 ) T. For ease of illustration, we assume that Y = β T X + ɛ with rank one regression coefficient matrix β = LR T for some 2 1 matrices L and R such that (L, L 0 ) R 2 2 and (R, R 0 ) R 2 2 are orthogonal matrices. Then the plots demonstrate the set-up where L and R span the X- and Y-envelope. In the first plot, the conditional distribution of Y X is represented by the ellipses, whose axes are the directions of the eigenvectors of Σ Y X. The shift from one contour to another is captured by β T X = RL T X, which is in the direction of R and has magnitude proportional to L T X. From Proposition 1, the Y-envelope is the sum of eigenspaces of Σ Y X that are not orthogonal to R. In this plot, the eigenvector corresponds to the larger eigenvalue of Σ Y X is orthogonal to R and hence represents the immaterial information of the regression. By projecting the data onto R T Y, we will eliminate the immaterial variation in the response. The response envelope reduction in this case is very efficient because R lies in the eigenspace corresponds to the small eigenvalue of Σ Y X. 9

249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 The second plot represents the marginal distribution of X. From Proposition 1, the reduction by X-envelope is available when L is contained in a subset of the eigenspaces of Σ X. In this plot, L happens to be the first eigenvector of Σ X, which means it spans the X-envelope and L T 0 X will be immaterial information of the regression. It should be pointed out that if we assign the Lebesgue measure to this 2-dimensional space, the probability of L being one of the two eigenvector of Σ X will be zero. But statistically, this event could happen because it is equivalent to requiring (a) cov(y, L T 0 X L T X) = 0 and (b) cov(l T X, L T 0 X) = 0, see discussion in Section 3.1. As represented by this plot, the predictor envelope reduction has great advantage over OLS when L corresponds to the larger eigenvalue of Σ X. This can be seen from the first plot, where the magnitude of L T X is proportional to the strength of the linear relationship. For a toy data example, we use the meat data analyzed by Cook et al. (2013) for envelope predictor reduction in multivariate linear regression. This dataset consists of spectral measurements from infrared transmittance for 103 meat samples. There are three response variables: percentages of protein, fat and water. The sum of the three percentages is not one because of the other chemical content in the sample. We take Y 1 to be protein and Y 2 to be the sum of water and fat. We take two spectral measurements at 910nm and 960nm for illustration. The estimated X-envelope and the Y-envelope are both onedimensional. The left plot was constructed by conditioning on the high, median and low values of L T X, we can clearly see that the estimated envelope direction R matches the major axes of the contour of each sub-sample. So in this example the Y-envelope reduces the dimension but not very much immaterial information. On the other hand, the right plot closely resembles the schematic representation shown in top-right. Consequently, the simultaneous envelope method offers a more precise estimate of β than the standard method. The standard errors of the OLS estimated coefficients in β OLS are 1.2, 12.7, 12.8 and 49.5 times of that of the simultaneous envelope estimator. 3.3 Links to PCA, PLS, CCA and RRR As stated previously, the eigenvalues of Ω 0 could be any subset of the eigenvalues of Σ X. When some or all of the largest few eigenvalues of Σ X come from Ω 0, the first few 10

15 ~L T X R 0 T Y 15 L T X 10 10 5 5 Y 2 X 2 0 0 5 5 R T Y 10 L 0 T X 10 15 10 5 0 5 10 Y 1 (a) Y-envelope 15 15 10 5 0 5 10 15 X 1 (b) X-envelope 5 4 3 0 1 2 0.1 data First eigenvector of S X Estimated X envelope 2 0.05 centered Y 2 1 0 1 centered X 2 0 2 3 Estimated Y envelope 0.05 4 5 5 4 3 2 1 0 1 2 3 4 centered Y 1 (c) Estimated Y-envelope 0.1 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 centered X 1 (d) Estimated X-envelope Figure 3.1: Working mechanism of simultaneous envelope reduction. (a) Schematic representation of response envelope reduction; (b) schematic representation of envelope in the predictor space; (c) the meat data with estimated Y-envelope, where the data points were marked differently according to their values in the predictor envelope L T X i. ; (d) the meat data with estimated X-envelope and the first eigenvector of S X. To help visualization, elliptical contours that cover 90% of the data points were added in (c) and (d). 11

278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 principal components of X will be from the immaterial part of X, which is ineffective for the regression as we can see from the above relationships. Principal components may be effective only if the larger eigenvalues of Σ X all come from Ω; that is, from the material variation. Similar problems of principal component analysis arise for reducing Y. As mentioned in the Introduction, the partial least squares method reduces only the predictors and so it is comparable to simultaneous envelopes method when setting R = I r. Cook, et al. (2013) showed that in this case the SIMPLS algorithm (de Jong, 1993) produces a n-consistent estimator of E ΣX (L), and that a likelihood-based estimator can do much better for prediction than the SIMPLS estimator. Canonical correlation analysis is widely used for the purpose of simultaneously re- ducing the predictors and the responses. In the population, it finds canonical pairs of directions {a i, b i }, i = 1,..., d, such that the correlations between a T i X and b T i Y are maximized. The maximization is over the constraints a T j Σ X a k = 0, a T j Σ XY b k = 0 and b T j Σ Y b k = 0 for all j k, and a T j Σ X a j = 1 and b T j Σ Y b j = 1 for all j. The solution is then {a i, b i } = {Σ 1/2 X e i, Σ 1/2 Y f i}, where {e i, f i } is the i-th left-right eigenvector pair of the correlation matrix ρ = Σ 1/2 X Σ XYΣ 1/2 Y. Lemma 2. Under the simultaneous envelope model (3.5), canonical correlation analysis can find at most d directions in the population, where d = rank(β) min(d X, d Y ). Moreover, the directions are contained in the simultaneous envelope as span(a 1,..., a d ) E ΣX (L), span(b 1,..., b d ) E ΣY X (R). (3.6) 297 298 299 300 301 302 303 304 Hence, canonical correlation analysis may miss some information about the regression by ignoring some material parts of X and/or Y. For example, when r is small, it can find at most r linear combinations of X, which can be insufficient for regression. Moreover, the most correlated pairs are not, in general, the most predictable pairs for regression. Canonical correlations are often used for data visualization instead of regression, so it may be expected that they can fail in prediction. In the simulation studies of Section 7.1 and Section J in the Supplement, we found that the prediction performances based on canonical correlations varied widely for different covariance structures. 12

305 306 307 308 309 310 311 312 313 The maximum likelihood estimator of RRR is obtained as n β RRR = arg min { (Y i β T X i ) T S 1 Y X (Y i β T X i )}, (3.7) rank(β)=d i=1 which is the same as the regression coefficient matrix obtained by using the canonical variables (see Reinsel and Velu 1998; Section 2.4.2). Therefore, in the purpose of estimation and prediction, the maximum likelihood RRR estimator is equivalent to the estimator based on CCA estimator. RRR can also be applied with identity inner product in (3.7) instead of S Y X. And for any minimization criteria of RRR, the estimators always have the form β = Â B where Â Rp d and B R d r are both n-consistent estimators for their population counterparts A and B. Similar to Lemma 2, we have the following relations by definition, span(a) = span(β) E ΣX (L), span(b T ) = span(β T ) E ΣY (R). (3.8) 314 315 316 317 318 319 320 321 322 323 Therefore, RRR has the same drawback as CCA that it may loss some material information in regression. Simulation studies in Section 7 includes RRR estimator based on identity inner product. 3.4 Potential gain To gain intuition about the potential advantages of simultaneous envelopes, we next consider the case where E ΣX (L) and E ΣY (R) are known. Estimation of these envelopes in practice will mitigate the findings in this section, but we have nevertheless found them to be useful qualitative indicators of the benefits of simultaneous reduction. The envelope estimator for β with known semi-orthogonal basis matrices L and R, denoted by β L,R, can be written as β L,R = L η L,R R T = L(L T S X L) 1 L T S XY RR T = P L(SX ) β OLS P R, (3.9) 324 325 326 327 328 where β OLS = S 1 X S XY is the ordinary least squares estimator. Clearly, the estimator β L,R uses only the material variation in Y and X. It can be obtained by projecting β OLS to the reduced predictor space and the reduced response space, and so does not depend on the particular bases L and R selected. The estimator η L,R is the ordinary least squares estimator of the coefficient matrix for the regression of R T Y on L T X. 13

329 330 331 332 333 334 335 336 337 338 339 340 341 The next proposition shows how the variance of the ordinary least squares estimator with normal predictors can be potentially reduced by using simultaneous envelopes. Let f p = n p 2 and f x = n d X 2. Proposition 2. Assume that X N p (µ X, Σ X ), n > p + 2 and that semi-orthogonal basis matrices L, R for the left and right envelopes are known. Then var(vec( β OLS )) = fp 1 Σ Y X Σ 1 X and ( var(vec( β L,R )) = f ) x 1 RΦR T ( LΩ 1 L ) T = f p f 1 x var(vec( β OLS )) f 1 R 0 Φ 0 R T 0 LΩ 1 L T x fx 1 RΦR T L 0 Ω 1 0 L T 0 fx 1 R 0 Φ 0 R T 0 L 0 Ω 1 0 L T 0, where Ω = L T Σ X L, Ω 0 = L T 0 Σ X L 0, Φ 0 = R T 0 Σ Y X R 0 and Φ = R T Σ Y X R. This proposition shows that the variation in β L,R can be seen in two parts: the first part is the variation in β OLS times a constant f p f 1 x 1, and the second consists of terms that reduce this value depending on the variances associated with the immaterial information L T 0 X and R T 0 Y. If R = I r then Φ = Σ Y X and we get the multivariate version of Proposition 2.3 by Cook, et al. (2013) for univariate response regression: var(vec( β L )) = f p f 1 x var(vec( β OLS )) f 1 x Σ Y X L 0 Ω 1 0 L T 0. (3.10) 342 343 344 345 346 347 348 349 350 When p is close to n and the X-envelope dimension d X is small, the constant f p f 1 x could be small and the gain by β L,R over β OLS could be substantial. If there is substantial co- linearity in the predictors, so Σ X has some small eigenvalues, and if the corresponding eigenvectors of Σ X fall in E Σ X (L), then the variance of β L could be reduced consider- ably since Ω 1 0 will be large. It is widely known that colinearity in X can increase the variance of β OLS. However, when the eigenvectors of Σ X corresponding to these small eigenvalues lie in E Σ X (L), the variance of β L,R is not affected by colinearity. Similarly, if L = I p then Ω = Σ X and we get the following new expression for Y reduction: var(vec( β R )) = f p fx 1 var(vec( β OLS )) fx 1 R 0 Φ 0 R T 0 Σ X. (3.11) 14

351 352 353 354 355 356 357 358 359 360 361 362 If the eigenvectors with larger eigenvalues of Σ Y X lie in E Σ Y X (R), then the variance of β R may be reduced considerably since then Φ 0 will be large. More importantly, the last term of the expansion in Proposition 2 represents a syn- ergy between the X and Y reductions that is not present in the marginal reductions. If the eigenvectors of Σ Y X with large eigenvalues lie in E ΣY X (R) or if the eigenvectors of Σ X with small eigenvalues lie in E ΣX (L), then the variance reductions in either (3.11) or (3.10) could be insignificant. However, the synergy in simultaneous X and Y reduc- tions may still reduce the variance substantially because one of factors in the Kronecker product fx 1 R 0 Φ 0 R T 0 L 0 Ω 1 0 L T 0 could be still be large. Let x N denote a new observation from N(µ X, Σ X ), let z N = x N µ X be held fixed ( ) and consider var( β T z N ) = var (z T N I)vec( β T ), which is the variance of a fitted vector for β = β OLS and β = β L,R. It is straightforward from Proposition 2 that, f p var( β T OLSz N ) = f x var( β T L,Rz N ) + R 0 Φ 0 R T 0 ( z T N LΩ 1 L T z N ) +RΦR T (z T NL 0 Ω 1 0 L T 0 z N ) + R 0 Φ 0 R T 0 ( z T N L 0 Ω 1 0 L T 0 z N ). 363 364 365 366 367 368 369 370 371 372 373 We see from the above equality that only the part of z N that lies in E ΣX (L) will contribute to the variance in prediction from β L,R, while the prediction variance from β OLS depends on the whole of z N. If, in an extreme case, z N EΣ X (L), then var( β T L,Rz N ) = 0. 4 Estimating envelopes Let C = (X T, Y T ) T denote the concatenated random vectors, which has mean µ C and covariance Σ C, and let S C be the sample covariance matrix of C. In order to estimate the parameters of the simultaneous envelope model (3.5), we introduce and probe a likelihood-based objective function that includes the objective functions in Cook, et al. (2010) and Cook, et al. (2013) as special cases. Variations on the objective function and their corresponding algorithms are also studied. We first give coordinate dependent and coordinate independent representations of Σ C in Section 4.1 to facilitate estimation. 15

374 375 376 377 4.1 Structure of Σ C Since we already have the structure of Σ X and Σ Y X in (3.2) and (3.3), and Σ XY = Σ X β = LΩηR T, we need only the following expression for Σ Y to complete the necessary ingredients of Σ C, Σ Y = Σ Y X + Σ T XYΣ 1 X Σ XY = RΦR T + R 0 Φ 0 R T 0 + Rη T ΩL T ( LΩL T + L 0 Ω 0 L T 0 = R(Φ + η T Ωη)R T + R 0 Φ 0 R T 0. ) 1 LΩηR T 378 Then we get the coordinate representation of the covariance Σ C as ( ) ΣX Σ Σ C = XY Σ T XY Σ Y ( LΩL = T + L 0 Ω 0 L T 0 LΩηR T Rη T ΩL T R(Φ + η T Ωη)R T + R 0 Φ 0 R T 0 ). (4.1) 379 380 381 By noticing that Σ X = P L Σ X P L + Q L Σ X Q L, Σ XY = P L Σ XY P R and Σ Y = P R Σ Y P R + Q R Σ Y Q R from the above expression, we can further obtain the coordi- nate independent representation Σ C = P L R Σ C P L R + Q L R Σ C Q L R, = P L R Σ C P L R + Q L R Σ D Q L R, (4.2) 382 383 384 385 386 387 388 389 390 where Σ D Σ X Σ Y and P L R = P L P R is the projection onto the direct sum of two envelopes, E ΣX (L) E ΣY X (R). So far we have considered E ΣX (L) and E ΣY X (R) as separate subspaces. Motivated by (4.2), the next lemma states that the direct sum of two arbitrary envelopes is itself an envelope. Let M 1 S p 1 p 1, M 2 S p 2 p 2, and let S 1 and S 2 be subspaces of span(m 1 ) and span(m 2 ), which is required by Definition 2. Then Lemma 3. E M1 (S 1 ) E M2 (S 2 ) = E M1 M 2 (S 1 S 2 ). From this lemma, we have E ΣX (L) E ΣY X (R) = E ΣX (L) E ΣY (R) = E ΣX Σ Y (L R) = E ΣD (L R). We call this the simultaneous envelope for β. 16

391 4.2 The estimation criterion and resulting estimators 392 393 Assuming multivariate normality for C, the negative log-likelihood minimized over µ C leads to the following objective function for estimation of Σ C, F(Σ C ) = log Σ C + trace(s C Σ 1 C ). (4.3) 394 395 396 397 398 399 400 401 402 We use this as a multi-purpose objective function, which gives the maximum likelihood estimator of β under normality of C and a n-consistent simultaneous envelope estimator of β when C has finite fourth moments. We also use F( ) as a generic objective function whose definition changes and is implied by its own argument. This should cause no confusion since F( ) will always be written with its arguments. Substituting the coordinate form for Σ C from (4.1) into F(Σ C ) leads to an objective function that can be minimized explicitly over Ω, Ω 0, Φ, Φ 0 and η with R and L held fixed. The resulting partially minimized objective function for simultaneous envelopes can be expressed as follows. F(L R) = log (L T R T )S C (L R) + log (L T R T )S 1 D (L R), (4.4) 403 404 405 where S D = S X S Y is the sample version of Σ D. Let A 0 denote the product of all nonzero eigenvalues of a matrix A. Then we have the following coordinate independent representation of (4.4), F(P L R ) = log P L R S C P L R 0 + log P L R S 1 D P L R 0. (4.5) 406 407 408 409 410 411 412 413 414 Moreover, if we substitute the coordinate free representation of Σ C from (4.2) into F(Σ C ), we immediately get (4.5). Objective function (4.4) is invariant under orthogonal transformations: for any orthogonal (d X + d Y ) (d X + d Y ) matrix O, F(L R) = F((L R)O). The same result holds if we replace S C and S D with their population counterparts. Minimization of F(L R) is thus a Grassmann optimization problem with special direct sum structure. Consequently, neither L nor R is identified. However, E ΣX (L) and E ΣY X (R) are identified, and these are all that is needed to estimate β, Σ X and Σ Y X. While the estimators of η, Ω, Ω 0, Φ and Φ 0 depend on the particular bases chosen to minimize F(L R), 17

415 416 417 418 419 the estimators of β, Σ X and Σ Y X are independent of the choice since the estimated projection matrices P L and P R do not depend on the basis. In short, any values L of L and R of R that minimize F(L R) are allowed. Estimators of L 0 and R 0 are then any semi-orthogonal matrices L 0 and R 0 so that ( L, L 0 ) and ( R, R 0 ) are orthogonal matrices. The next lemma summarizes the estimators that result from this process. 420 421 422 Lemma 4. Let L R = arg min F(L R), where L R p d X and R R r d Y are semi-orthogonal basis matrices. Then the estimators of the remaining parameters are Ω = L T S X L, Ω0 = L T 0 S X L0, η = ( L T S X L) 1 ( L T S XY R), Φ0 = R T 0 S Y R0 and Φ = (S R ( LT ) ) 1 T Y S T XY L S X L LT S XY R. 423 The simultaneous envelope estimator for β is then β = L η R T = P L(SX ) β OLS P R. (4.6) 424 425 426 The simultaneous envelope estimator for β in (4.6) coincides with the plug-in enve- lope estimator β L, R obtained by regarding estimated L and R as the known values of L and R. We next turn to methods for minimizing (4.4). 427 4.3 Alternating algorithm 428 429 If we fix L as an arbitrary orthogonal basis, then the objective function F(L R) in (4.4) can be re-expressed as an objective function in R: F(R L) = log R T S Y L T XR + log R T S 1 Y R. (4.7) 430 431 Similarly, if we fix R, the objective function F(L R) reduces to the conditional objective function F(L R) = log L T S X R T YL + log L T S 1 X L. (4.8) 432 433 We use the following alternating algorithm based on (4.7) and (4.8) to obtain a minimizer of the objective function F(L R) in (4.4). 434 1. Initialization. Set the starting value L (0) and get R (0) = arg min R F(R L (0) ). 18

435 436 437 438 439 2. Alternating. For the k-th stage, obtain L (k) = arg min L F(L R = R (k 1) ) and R (k) = arg min R F(R L = L (k) ). 3. Convergence criterion. Evaluate { F(L (k 1) R (k 1) ) F(L (k) R (k) ) } and return to the alternating step if it is bigger than a tolerance; otherwise, stop the iteration and use L (k) R (k) as the final estimator. 440 441 442 443 444 445 446 447 448 449 450 451 452 In the initialization step of the algorithm, we could also set R (0) to be some initial value and get L (0) = arg min L F(L R (0) ). Then interchanging the roles of L and R in the alternating step will give us another algorithm, which has the same performance as the alternating algorithm outlined above. Comparing to the objective function used by Cook et al. (2010), we see that F(R L) is the objective function for estimating the Y-envelope in the regression of Y on the reduced predictors L T X for fixed L. Similarly from the objective function used by Cook et al. (2013), we notice that F(L R) is the objective function for estimating the X-envelope in the regression of the reduced responses R T Y on the predictors X for fixed R. In our experience, as long as we use good initial values, the alternating algorithm, which monotonically decreases F(L R), will converge after only a few cycles, typically less than four. Root-n consistent starting values are particularly important to mitigate potential problems caused by multiple local minima and to ensure efficient estimation. 453 For instance, under joint normality of X and Y, one Newton-Raphson iteration from any 454 n-consistent estimator will be asymptotically as efficient as the maximum likelihood 455 estimator, even if local minima are present (Small et al., 2000; Lehmann and Casella, 456 457 458 459 460 461 462 1998, Theorem 4.3). 4.4 One-dimensional optimization In this section we propose a fast 1D algorithm for n-consistent envelope estimation. The algorithm, which does not itself require starting values, can be used for standalone estimation or to obtain starting values for the alternating algorithm. We state our algorithm in terms of estimating E ΣX (L) from f(l) F(L R = I r ). It works similarly for estimating E ΣY X (R) and for estimating a general envelope E M (S). 19

463 464 465 466 467 468 469 470 471 472 473 474 Optimization over G (p,dx ), as required in each cycle of the alternating algorithm, can be computationally expensive. The next lemma offers a way to break down the d X - dimensional manifold into fast 1D components. Lemma 5. Let M S p p be a positive definite matrix and let S R p. Suppose (B, B 0 ) is an orthogonal basis of R p where B R p q, B 0 R p (p q) and span(b) E M (S). Then v E B T 0 MB 0 (B T 0 S) implies B 0 v E M (S). Suppose we have an estimate B of q directions in the envelope E M (S). Then we can estimate the rest of E M (S) by focusing our attention on E B T 0 MB 0 (B T 0 S), which is a lower dimensional envelope. This leads to the following 1D algorithm. Let l k R p, k = 1,..., d X, be the stepwise directions obtained. Let L k = (l 1,..., l k ), let (L k, L 0k ) be an orthogonal basis for R p and set initial value l 0 = L 0 = 0, then for k = 0,..., d X 1, get v k+1 = arg min f k (v), subject to v T v = 1, v R p k (4.9) l k+1 = L 0k v k+1, 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 where f k (v) = log v T L T 0k S X YL 0k v + log v T (L T 0k S XL 0k ) 1 v for v R p k. This algorithm does not require an initial value search. It starts simply from l 0 = L 0 = 0 and builds the estimator sequentially. Suppose we know span(l k ) E ΣX (L), then we can estimate the remaining part of E ΣX (L) by substituting L = (L k, L 0k V), where V R (p k) (dx k), into f(l), leading to the objective function log V T L T 0k S X YL 0k V + log V T (L T 0k S XL 0k ) 1 V, whose one dimensional version is exactly f k (v). This describes how the 1D algorithm is minimizing the full objective function f(l) sequentially. Let L 1D denote the estimator from this algorithm. The following proposition formally states the n-consistency of the 1D algorithm for estimating E ΣX (L). Similar results hold for estimating E ΣY X (R) and for estimating an arbitrary envelope using the 1D algorithm. Proposition 3. Assume that S X and S X Y are n-consistent estimators of Σ X and Σ X Y. Let L 1D denote the estimator obtained from the 1D algorithm. Then is P L1D n-consistent for the projection onto EΣX (L). 20

We have also considered sequentially minimizing f(l) = log l T S X Y l +log l T S 1 490 X l, so that l T 491 i l j = 0 if i j and equals one if i = j. Although this seems to be a reasonable 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 way of sequentially constructing the envelope, we do not know if this procedure results in a consistent estimator. Our numerical experience is that the 1D algorithm (4.9) performs much better than such sequential optimization. The 1D algorithm can be used to get fast n-consistent starting values for the al- ternating algorithm in Section 4.3 by separately estimating the X-envelope and the Y- envelope bases. Since the 1D algorithm turns the optimizations over d X and d Y -dimensional manifolds to d X and d Y sequential optimizations over 1D manifolds, the computation complexity is reduced drastically. For the simulation examples we studied, the computa- tion costs of the 1D algorithm are from tens to hundreds times less than the computation costs of using d X and d Y -dimensional Grassmann manifold optimizations. Perhaps more importantly, we have found L 1D and R 1D to be practically as efficient as the final estimators obtained by alternating because the alternating algorithm nearly always converges after only a few iterations and produces only small changes. To em- phasize the utility of the 1D estimators, we use them in the simulations and real data examples that follow in later sections. In Section 7.3, we use a simulation example to demonstrate that estimators from the 1D manifold algorithm may have the same behavior as maximum likelihood estimators. 509 510 511 512 513 5 Asymptotic properties The parameters involved in the coordinate representation of the simultaneous envelope model are η, Ω, Ω 0, Φ, Φ 0, L and R. In Sections 5.1 and 5.2, we focus on the asymptotic properties of the estimable functions β = LηR T, Σ X = LΩL T +L 0 Ω 0 L T 0, and Σ Y X = RΦR T + R 0 Φ 0 R T 0. Specifically, we study the asymptotic covariances of the following 21

514 515 516 517 parameters φ and estimable functions h. φ 1 vec(η) φ 2 vec(l) φ 3 vec(r) φ = φ 4 vech(ω), h = φ 5 vech(ω 0 ) φ 6 vech(φ) φ 7 vech(φ 0 ) h 1 (φ) h 2 (φ) h 3 (φ) vec(β) vech(σ X ) vech(σ Y X ) Since h R 1 2 (p+r)(p+r+1) and φ R 1 2 p(p+1)+ 1 2 r(r+1)+d Xd Y, the simultaneous envelope model reduces the total number of parameters by 1(p + r)(p + r + 1) 1 p(p + 1) 2 2 1 r(r + 1) d 2 Xd Y = pr d X d Y.. 518 5.1 Without the normality assumption 519 Let ĥfull ( T = vec T ( β OLS ), vech T (S X ), vech T (S Y X )) denote the OLS estimator of h 520 under the standard model (3.1), and let ĥ denote the estimator from Lemma 4 under the simultaneous envelope model (3.5). The true values of h and φ are denoted as h 0 521 522 523 524 525 526 527 528 529 530 531 532 and φ 0. Define h(φ)/ φ to be the gradient matrix, whose explicit form is given in the Supplement, Section I.1. We use avar to denote an asymptotic covari- ance matrix: avar( nĥ) = A is equivalent to n(ĥ h 0) N(0, A) in distribu- tion. We use the expansion matrix (Henderson and Searle, 1979) E p R p2 p(p+1)/2 to relate the vec operation and vech operation: for any symmetric matrix M S p p, vec(m) = E p vech(m). Then n(ĥfull h 0 ) N(0, Γ), for some positive definite covariance matrix Γ. Since there is a one-to-one relationship between h and Σ C, we treat the objective function F(Σ C ) in (4.3) as a function of h and ĥfull and write it as F(h, ĥfull). Let J h = 1/2 2 F(h, ĥfull)/ h h T evaluated at ĥfull = h = h 0, which is the Fisher information matrix for h when C is normal, Σ 1 Y X Σ X 0 0 J h = 1 0 2 ET p (Σ 1 X Σ 1 X )E p 0. (5.1) 1 0 0 2 ET r (Σ 1 Y X Σ 1 Y X )E r The following proposition formally states the asymptotic distribution of ĥ. Proposition 4. Assume that the data (x i, y i ) are i.i.d. from a joint distribution with finite fourth moments. Then n(vec(ĥ) vec(h 0)) converges in distribution to a normal 22

533 random variable with mean 0 and covariance matrix W = ( T J h ) T J h ΓJ h ( T J h ) T, where indicates the Moore-Penrose inverse. In particular, n(vec( β) vec(β)) con- 534 verges in distribution to a normal random variable with mean 0 and covariance W 11, the upper-left pr pr block of W. Moreover, avar( 535 nĥfull) avar( nĥ) if span(j1/2 h ) is a reducing subspace of J 1/2 536 h ΓJ1/2 h. 537 538 539 540 541 542 The n-consistency of the estimator β is essentially because S X, S Y and S XY are n-consistent and because of the properties of F(h, ĥ full ). The asymptotic covariance matrix W 11 can be computed straightforwardly, but its accuracy for any fixed sample size may depend on the distribution of C. Fortunately, bootstrap methods can provide a good approximation of W 11, as discussed in Section 5.3. 5.2 Under the normality assumption 543 The asymptotic covariance W simplifies when C N(µ C, Σ C ) because then objective function F(h, ĥfull) agrees with the negative log-likelihood function and Γ = J 1 544 h = avar( nĥfull). And {avar( 545 nĥfull) avar( nĥ)} will be positive semi-definite because span(j 1/2 546 h ) is always a reducing subspace of J1/2 h ΓJ1/2 h = I. 547 548 Proposition 5. Assume that C N(µ C, Σ C ). Then avar( nĥ) = ( T J h ) T. Moreover, avar( nĥ) avar( nĥfull), J 1 h ( T J h ) T = J 1 2 h Q J 1 2 h J 1 2 h 0. 549 550 551 552 553 554 555 In particular, avar( nvec( β)) avar( nvec( β OLS )). This proposition is obtained by direct computation, and is consistent with Cook et al. (2010) and Cook et al. (2013). Also, because the estimator β is the MLE under normality, its asymptotic variance is no larger than that of X-envelope estimator or Y- envelope estimator. Explicit expressions can be found in the Supplement, Section I. To further interpret this result, we next consider the asymptotic variance of vec( β), which can lead to the asymptotic variances for predictions and fitted values. 23

556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 Proposition 6. Assume that C N(µ C, Σ C ). Then avar( nvec( β)) = avar( ( ) nvec( β L,R )) + avar nvec(ql βη,r ) ( ) + avar nvec( βη,l Q R ), where we use β η,r, β L,R and β η,l to denote the maximum likelihood estimators when the parameters in their subscripts are known. Explicit expressions for the asymptotic variances of β η,r, β L,R and β η,l are given in the Supplement, Section I. The first term in the above decomposition, avar( nvec( β L,R )), is the same as that in Proposition 2, which gave the asymptotic variance if we knew bases for the true envelope. If we set L = I p in the above decomposition, so we are pursuing Y reduction only, then the decomposition reduces to that given by Cook et al. (2010; Theorem 6.1). Setting R = I r, which indicates X reduction only gives the corresponding result of Cook et al. (2013; Proposition 4.4). The projections Q L and Q R serve to orthogonalize the random vectors so that the asymptotic variances are additive. 5.3 Residual bootstrap To illustrate the application of the above bootstrap method, we consider a simple model with p = r = 3 and d X = d Y = 1. We generated L, R and η by filling in random numbers from uniform(0, 1) distribution. Then we orthonormalized L, R and obtained corresponding L 0, R 0. The covariance matrices were Ω = 5, Ω 0 = I 2, Φ = 1 and Φ 0 = 10I 2. The data vectors C i = (X T i, Y T i ) T, i = 1,..., n, were simulated from Σ 1/2 C U i where U i is a vector of i.i.d uniform random variables with mean 0 standard deviation 1. Therefore, C i would follow a distribution with mean 0, covariance Σ C and finite fourth moments. A dataset with n = 100 observations was generated and B = 100 bootstrap datasets were used throughout. We used the bootstrap method to estimate the variances of two estimators: the OLS estimator β OLS and the simultaneous envelope estimator β 1D which was obtained by using the 1D algorithm without alternating. Table 1 summarizes all the p r = 9 elements of vec(β), vec( β OLS ) and vec( β 1D ) and their asymptotic, bootstrap and actual standard errors. We included the asymptotic standard errors of the elements in β OLS, which were the square roots of diagonals in 24