Indirect High-Dimensional Linear Regression

Size: px

Start display at page:

Download "Indirect High-Dimensional Linear Regression"

Justina Grant
5 years ago
Views:

1 Indirect High-Dimensional Linear Regression Matt Galloway University of Minnesota This study evaluates the performance of a class of indirect regression coefficient estimators designed to perform well in high-dimension regression settings. We demonstrate several simulation studies and compare this class of estimators to other standard methods. This work is heavily based off of Cook, Forzani, and Rothman 2013) and Molstad and Rothman 2016). Keywords: high dimenison, big data, abundant, graphical lasso, indirect Introduction Consider the classical linear regresion model for a univariate response: Y = µ y + β T X µ x ) + ɛ, 1) where Y R, X R p, and β R p is a vector of unknown regression coefficients. Unlike a controlled experiment in which we take our explanatory variable X to be fixed, in this project we will assume X is random. Let X N p µ x, Σ xx ) take Σ xx > 0) and ɛ N0, σ 2 y x ) so that X 1, Y 1 ),..., X n, Y n ) are realizations of the joint multivariate normal distribution X, Y): ) X T, Y) T µx N p+1, µ y Σxx Σ xy Σ T xy σ 2 y )), The goal throughout this report is to estimate the unknown regression coefficient vector β = Σ 1 xx Σ xy. When n > p, it is standard practice to use the ordinary least squares estimator ˆβ OLS = X T X) 1 X T Y = ˆΣ 1 xx ˆΣ xy, 2) where X R n p has ith row X i µ x, Y R n has elements Y i µ y, and ˆΣ xx and ˆΣ xy are the sample covariance of X and sample covariance of X and Y, respectively. This estimator is also the maximum likelihood estimator and holds whether we take X to be fixed or random. When n p - the so-called, high dimensional setting - the OLS estimator is no longer identifiable. This is due to the fact that the matrix X T X is rank deficient causing ˆΣ xx to be singular. To address this issue, shrinkage estimators of β have been proposed that work by penalizing the log-likelihood used to calculate 2) and, in effect, pushing the eigen values of ˆΣ xx further from zero making ˆΣ xx non-singular). A few of the most popular December 12, Contact: gall0441@umn.edu. 1

2 shrinkage estimators are presented below: Ridge Penalty: Lasso Penalty: ˆβ R = arg min β lβ) + λ 2 p β 2 j, j=1 p ˆβ L = arg min lβ) + λ β βj, j=1 where lβ) is the log-likelihood and λ is a tuning parameter. These methods have proved largely successful by making specific assumptions about β. For instance, we might assume sparsity in our coefficient vector. This assumption would likely favor a lasso penalty due to its non-differentiability that causes a number of entries in β to be equal to zero. In other situations we might assume a non-sparse solution in which case a ridge penalty would be more appropriate. However, these estimators lack the ability to fully exploit the random component in X unique to our design. In the following section we explore indirect regression estimators that seek to leverage the joint distribution between the explanatory variables and the response for potential gain. Indirect Regression Coefficient Estimators Recall that we assume X N p µ x, Σ xx ) and ɛ N0, σ 2 ). This assumption in conjunction with 1) allows us to define the following conditional y x distributions: Forward Regression: Inverse Regression: Y X = x N ) µ y + β T X µ x ), σ 2 y x ) X Y = y N p µ x + α T Y µ y ), where β = Σ 1 xx Σ xy, α = σy 2 Σ T xy, σ 2 y x = σ2 y β T Σ xx β and VarX Y) = Σ xx Σ xy Σ T xy/σy. 2 We will use the last equivalence to construct our indirect regression coefficient estimator. Using the Woodbury Identity, it is relatively straight forward details in the appendix) to show that Σ 1 xx = 1 1 Σ xy Σ T xy 1 σ 2 y + Σ T xy 1 Σ T xy We can then plug this formulation into β to show that 2

3 ) 1 β = Σ 1 xx Σ xy = 1 Σ xy 1 + Σ T xy 1 Σ xy /σy 2 assuming a positive definite covariance matrix of X, Y)) [1]. This work is closely related to Molstad & Rothman 2016), however, they extend this class of estimators to the case when the response is multivariate. By exploiting the joint distribution of X, Y), it is clear that β can be expressed as a function of 1, Σ xy, and σ 2 y - no longer requiring Σ xx to be invertible. Of course not all issues are solved using this re-formulation: like Σ xx in high-dimensions, the maximum likelihood estimator for ˆ will be singular when n p. ˆ MLE = ˆΣ x y = X Yˆα) T X Yˆα)/n where ˆα = Y T Y) 1 Y T X is the MLE of the coefficient vector for the regression of X on Y. Similar to previous methods, this issue will be addressed by introducing shrinkage estimators. However, instead of shrinking β we propose shrinking the precision matrix 1. This serves two purposes. The first is that instead of making assumptions about β we can instead make assumptions about the structure of 1. The second purpose is the expectation that shrinking 1 will greatly improve our estimates when p n. In this setting, a large number of unknowns and low relative) sample size may result in poor estimates. Shrinkage Estimators The shrinkage estimators proposed in this project are analogous to the forward regression case where we use the negative log-likelihood as our loss function. We take Θ 1 for convenience. Ridge Penalty: Lasso Penalty: ˆΘ IR λ ˆΘ IL λ { = arg min trθ ˆΣ x y ) log Θ + λ Θ S p 2 Θ 2 F + = arg min Θ S p + { trθ ˆΣ x y ) log Θ + λ i =j θ ij The first estimator 3) uses a ridge-type penalty in which we penalize the squared sum of all the entries in Θ using the Frobenius norm. Similar to the forward regression, this estimator has a closed form solution and thus can be computed with minimal cost. We show 3) 4) 3

4 in the appendix) that if we decompose ˆΣ x y = VQV T using the spectral decomposition then { 1 ˆΘ λ IR = 2λ V [ Q + Q 2 + 4λI p ) 1/2] V T if λ > 0 ˆΣ 1 if ˆΣ 1 5) exists and λ = 0 x y x y The second estimator 4) uses a lasso-type penalty by summing the absolute values of the off-diagonal entries. This penalty encourages sparse solutions of Θ. In our multivariate normal setting, a zero in the precision matrix Θ implies that two elements are conditionally uncorrelated give the other elements - but may still be marginally correlated. We will use the popular graphical lasso algorithm for calculation not presented here). Using these shrinkage estimators, our proposed indirect estimators are the following: Indirect Ridge: Indirect Lasso: ˆβ IR = ˆΘ λ IR ˆΣ xy 1 + ˆΣ T IR xy ˆΘ ˆΣ 1 λ xy /ˆσ y) 2 6) ˆβ IL = ˆΘ λ IL ˆΣ xy 1 + ˆΣ T IL xy ˆΘ ˆΣ 1 λ xy /ˆσ y) 2 7) where, unlike Molstad & Rothman 2016), ˆΣ xy and ˆσ 2 y are the sample estimates using denominator n). In their previous work, they proposed shrinkage estimators for those as well. We compare the performance of both of these estimators to their forward regression counter-parts in the section that follows. Simulations We generate realizations of n = 100 independent copies of X T, Y) T where Y N0, σy) 2 and X Y = y N p α T Y, ). Following Molstad & Rothman, the inverse regression coefficient vector α was chosen so that α = Z B where Z is a vector of standard normal entries and B is a vector of values drawn from a Bernoulli distribution with probability b. In addition, is calculated so that all of the off-diagonal entries are equal to 0.9 and 1 otherwise. A few of the parameters have multiple candidate values: p 10, 25, 80, 120), σy 2 0.3, 0.7), and b 0.3, 0.9). The simulations are constructed so that each scenario is replicated a total of 50 times. { MEβ, ˆβ) = tr ˆβ β) T Σ xx ˆβ β) MSPE ˆβ) = 1 n Y X ˆβ 2 4

5 For each replication, the model error ME) and mean squared prediction error MSPE) are recorded for each of the estimators ˆβ IR, ˆβ IL, ˆβ R, ˆβ L. The tuning parameters λ were chosen from the set 10 4, ,..., , 10 8 ) using three-fold crossvalidation. MSPE ME Avg. Error p IR IL R L Model Error IR IL R L IR IL R L IR IL R L IR IL R L Estimator 5

6 We can see from the simulations that performances under the MSPE metric were all relatively comparable. Each estimator s performance increased as the sample size increased and appeared to plateau as p exceeded 100. This general trend was also true for the model error except for the forward regression lasso estimator, ˆβ L, noticeably outperforming the others. The worst estimator when p = 10 was the indirect lasso estimator but it recovered by being the second best when p = 120 σ 2 y = 0.3, b = 0.3). P IR mean Table 1: Model error by estimator and dimension. sd IL mean sd R mean sd L mean All sd Focusing in on the high dimension case, we note that the ridge estimators appear to be the clear winners if we are focused on MSPE. Not only did their average error outperform the other methods but their standard errors appear to be smaller as well. Interestingly, the forward regression ridge estimator, ˆβ R, was the worst in terms of model error. High Dimension p = 120) MSPE ME Error IR IL R L IR IL R L Estimator 6

7 As we vary the sparsity of our inverse regression coefficient vector, α, and the variance of Y, we can see that the indirect lasso estimator performs best when the sparsity level is low b = 0.9) and the variance of Y is low σ 2 y = 0.3) - though the difference in performance is minimal when p is large. Indirect Lasso MSPE 0.5 ME Avg. Error p MSPE Binom ME Avg. Error p Sigmay

8 Table 2: Model error for indirect lasso by b and dimension. P b0.3 mean sd b0.9 mean sd Discussion Indirect regression estimators are yet another tool that statisticians can use in problematic environments such as the case the p > n. We illustrated the fact that both of the new estimators proposed offer comparable results to the more standard methods when the joint distribution of X T, Y) T is known and 1 is non-sparse. Future work is needed to explore the case when 1 is sparse and/or p n. Our simulations tease the fact that the indirect estimators, specifically ˆβ IL, might outperorm the forward regression methods when p is a magnitue much larger than n - however, due to time constraints we were unable to investigate further. References [1] Cook, R. Dennis, Liliana Forzani, and Adam J. Rothman. "Prediction in abundant high-dimensional linear regression." Electronic Journal of Statistics ): [2] Molstad, Aaron J., and Adam J. Rothman. "Indirect multivariate response linear regression." Biometrika ):

9 Appendix Proof of OLS estimator for β Consider the log-likelihood of the joint distribution of X and Y: log gx, Y; β) = log f Y X, β)hx) = log f Y X, β) + log hx) where log f Y X, β) can be simplified to the following form: log f Y X, β) = log = log n i=1 n i=1 f Y i X i, β) 2πσ 2 y x ) 1/2 exp = log2πσ 2 y x ) n/2 exp = n 2 log2πσ2 y x ) 1 = const. 1 2σ 2 y x n i=1 { 2σ 2 y x { 1 1 2σ 2 y x 2σ 2 y x n i=1 ) 2 Y i µ Y β T X i µ X ) n ) 2 Y i µ Y β T X i µ X ) i=1 Y i µ Y β T X i µ X ) Y i µ Y β T X i µ X ) Because we are taking the gradient with respect to β and log hx) does not depend on β, it can be ignored in further computation. ) 2 ) 2 β {log gx, Y; β) = β {log f Y X, β) { = β 1 n ) 2 2σ 2 Y i µ Y β T X i µ X ) y x i=1 { = β 1 2σ 2 Y Xβ 2 y x { = β 1 2σ 2 Y T Y 2β T X T Y + β T X T Xβ) y x = 1 σ 2 X T Y 1 σ 2 X T Xβ y x y x where X R n p with rows X i µ X and Y R n with elements Y i µ Y. Setting the gradient equal to zero, it follows that ˆβ MLE = ˆβ OLS = X T X) 1 X T Y. 9

10 Proof of indirect regression coefficient β We stated as fact that Σ xx = + Σ xy Σ T xy/σ 2 y. Using the Woodbury Identity, it follows that ) 1 Σ 1 xx = + Σ xy Σ T xy/σy 2 ) 1 = 1 1 Σ xy σy 2 + Σ T xy 1 Σ xy Σ T xy 1 = 1 1 Σ xy Σ T xy 1 σ 2 y + Σ T xy 1 Σ T xy This directly implies that β is of the following form: β = Σ 1 xx Σ xy = 1 Σ xy 1 Σ xy Σ T xy 1 Σ xy σy 2 + Σ T xy 1 Σ T xy ) = 1 Σ xy 1 ΣT xy 1 Σ xy σy 2 + Σ T xy 1 Σ xy ) = 1 σy 2 Σ xy σy 2 + Σ T xy 1 Σ xy = 1 Σ xy 1 + Σ T xy 1 Σ xy /σ 2 y ) 1 Proof of MLE for 1 Recall that ) X Y = y N p µ x + α T Y µ y ), The likelihood and maximum likelihood estimators can be simplified to the following using the same notation defined previously): lα, 1 ) = n log φx i ; µ x + α T Y i µ y ), 1 ) i=1 = np 2 log2π) + n 2 log { 1 n i=1 = const. + n 2 log 1 n 2 tr n i=1 = const. + n { 2 log 1 n 1 2 tr n X Yα)T X Yα) 1 n 10 X i µ x α T Y i µ y )) T 1 X i µ x α T Y i µ y )) X i µ x α T Y i µ y ))X i µ x α T Y i µ y )) T 1

11 { α lα, 1 n 1 ) = α 2 tr n X Yα)T X Yα) 1 = 1 2 αtr { 2α T Y T X 1 + α T Y T Yα 1 = Y T X 1 Y T Yα 1 Setting the gradient equal to zero, it follows that ˆα MLE = Y T Y) 1 Y T X which we know is identifiabile because Y T Y is a scalar. Now we take the gradient with respect to 1 : [ n 1lα, 1 ) = 1 2 log 1 n = 1 = n 2 n 2 ˆΣ x y { 1 2 tr [ n 2 log 1 n 2 tr { ˆΣ x y 1] ] n X Yα)T X Yα) 1 where ˆΣ x y = X Yα) T X Yα)/n. This is the residual sample variance for the regression X on Y denominator n). Setting the gradient equal to zero, it follows that ˆ 1 MLE = ˆΣ 1 x y if it exists) where α is replaced with ˆαMLE = Y T Y) 1 Y T X. Proof of ridge penalized 1 ˆ 1 λ { = arg min trθ ˆΣ x y ) log Θ + λ Θ S p 2 Θ 2 F + Let g be the objective function in the previous equation. Θ gθ) = Θ {trθ ˆΣ x y ) log Θ + λ 2 Θ 2 F = ˆΣ x y Θ 1 + λθ Setting the gradient equal to zero... ˆΣ x y = ˆΘ 1 λ ˆΘ = VDV T ) 1 λvdv T = VD 1 λd)v T 11

12 using the spectral decomposition ˆΘ = VDV T where D is a diagonal matrix with diagonal elements being the eigen values of Θ and V is matrix with the corresponding eigen vectors as columns. This structure implies that where φ j ) is the jth eigen value. φ j ˆΣ x y ) = 1 φ j ˆΘ) λφ j ˆΘ) λφ j ˆΘ) + φ j ˆΣ x y )φ j ˆΘ) 1 = 0 φ j ˆΣ x y ) ± φ 2 j φ j ˆΘ) ˆΣ x y ) + 4λ = 2λ In summary, if we decompose ˆΣ x y = VQV T then ˆΘ λ = { 1 2λ V [ Q + Q 2 + 4λI p ) 1/2] V T if λ > 0 ˆΣ 1 x y if ˆΣ 1 exists and λ = 0 x y proof taken from Adam Rothman s STAT 8931 lecture notes.) 12

13 Code # All code Code taken and/or augmented # from Adam Rothman s STAT 8931 course # libraries libraryglmnet) libraryglasso) # define sigma ridge function sigma_ridge = functions, lam) { # dimensions p = dims)[1] # gather eigen values of S spectral # decomposition) e.out = eigens, symmetric = TRUE) # augment eigen values for omega hat new.evs = -e.out$val + sqrte.out$val^2 + 4 * lam))/2 * lam) # compute omega hat for lambda zero # gradient equation) omega = tcrossprode.out$vec * repnew.evs, each = p), e.out$vec) returnomega) # define betai function betai = functiondelta, Sxy, Syy) { # betai delta %*% Sxy/as.numeric1 + tsxy) %*% delta %*% Sxy/as.numericSyy)) # define betamp function betamp = functionx, Y) { 13

14 # betamp MASS::ginvtX) %*% X) %*% covx, Y) * nrowx) - 1)/nrowX)) # define betar function betar = functionx, Y, lam) { # betar m = glmnetx, Y, alpha = 0, lambda = lam, intercept = F) predictm, type = "coefficients")[-1, ] # define betal function betal = functionx, Y, lam) { # betal m = glmnetx, Y, lambda = lam, intercept = F) predictm, type = "coefficients")[-1, ] # define model error function ME = functionbeta_hat, beta, Sxx) { # ME tbeta_hat - beta) %*% Sxx %*% beta_hat - beta) # define mean squared error function MSE = functionbeta, X.valid, Y.valid) { # loss meanx.valid %*% beta - Y.valid)^2) 14

15 # define CV function CV = functionx, Y, lam, ind = NULL, K = 5, quiet = TRUE, crit = NULL) { # dimensions of data n = dimx)[1] p = dimx)[2] # if the user did not specify a # permutation of 1,..,n, then randomly # permute the sequence: if is.nullind)) ind = samplen) # allocate the memory for the loss matrix # rows correspond to values of the # tuning paramter) columns correspond to # folds) cv.loss = array0, clengthlam), 4, K)) for k in 1:K) { leave.out = ind[1 + floork - 1) * n/k)):floork * n/k)] # training set X.train = X[-leave.out,, drop = FALSE] X_bar = applyx.train, 2, mean) X.train = scalex.train, center = X_bar, scale = FALSE) Y.train = Y[-leave.out,, drop = FALSE] Y_bar = applyy.train, 2, mean) Y.train = scaley.train, center = Y_bar, scale = FALSE) # validation set X.valid = X[leave.out,, drop = FALSE] X.valid = scalex.valid, center = X_bar, scale = FALSE) 15

16 Y.valid = Y[leave.out,, drop = FALSE] Y.valid = scaley.valid, center = Y_bar, scale = FALSE) # sample covariances Sxx = crossprodx.train)/nrowx.train) Sxy = crossprodx.train, Y.train)/nrowX.train) Syy = crossprody.train)/nrowy.train) m = lm.fity.train, X.train) Sx.y = crossprodm$residuals)/nrowx.train) Sx.y.valid = crossprodx.valid - Y.valid %*% m$coefficients)/nrowx.valid) # glasso out = glassopaths = Sx.y, rholist = lam, penalize.diagonal = FALSE, trace = 0, thr = 0.001, maxit = 3) # loop over all lambda values for i in 1:lengthlam)) { # lambda lam. = lam[i] # loss for betair deltair = sigma_ridgesx.y, lam.) lossir = sumdeltair * Sx.y.valid) - determinantdeltair, logarithm = TRUE)$modulus[1] betair = betaideltair, Sxy, Syy) # loss for betail deltail = out$wi[,, i] lossil = sumdeltail * Sx.y.valid) - determinantdeltail, logarithm = TRUE)$modulus[1] betail = betaideltail, Sxy, Syy) # loss betar betar. = betarx.train, Y.train, lam.) lossr = MSEbetaR., X.valid, Y.valid) 16

17 # loss betal betal. = betalx.train, Y.train, lam.) lossl = MSEbetaL., X.valid, Y.valid) # if criteria not NULL, use prediction # MSE as criteria for deciding lambda for # betair and betail if!is.nullcrit)) { lossir = MSEbetaIR, X.valid, Y.valid) lossil = MSEbetaIL, X.valid, Y.valid) # designate loss cv.loss[i,, k] = clossir, lossil, lossr, lossl) # if not quiet, then print progress fold if!quiet) cat"finished fold", k, "\n") # accumulate the error over the folds cv.err = applycv.loss, c1, 2), sum) # find the best tuning parameter values best.loc = applycv.err, 2, which.min) best.lam = lam[best.loc] # best betas Sxy = crossprodx, Y)/nrowX) Syy = crossprody)/nrowy) m = lm.fity, X) Sx.y = crossprodm$residuals)/nrowx) 17

18 betair = sigma_ridgesx.y, best.lam[1]) %>% betaisxy, Syy) betail = out$wi[,, best.loc[2]] %>% betaisxy, Syy) betar. = betarx, Y, best.lam[3]) betal. = betalx, Y, best.lam[4]) # compute final estimate at the best # tuning parameter value beta_hat = matrixcbetair, betail, betar., betal.), ncol = 4) colnamesbeta_hat) = c"bir", "BIL", "BR", "BL") returnlistbeta_hat = beta_hat, best.lam = best.lam, cv.err = cv.err, lam = lam)) ## SIMULATION # initialize values lam = 10^seq-4, 8, 0.5) reps = 50 N = 100 # P = c10, 25, 80, 120) P = 120 # Syy = c0.3, 0.7) Syy = 0.7 # Bin = c0.3, 0.9) Bin = 0.3 # allocate memory sim = array0, creps, 5, lengthn), lengthp), 3, lengthsyy), lengthbin)), dimnames = listreps = c1:reps), Beta = c"ir", "IL", "R", "L", "MP"), N = cn), P = cp), criteria = c"mse", "ME", "Boundary"), Sigmay = csyy), Binom = cbin))) # lots of loops for n in 1:lengthN)) { 18

19 for p in 1:lengthP)) { for s in 1:lengthSyy)) { for b in 1:lengthBin)) { for r in 1:reps) { # initialize values set variance for Y syy = Syy[s] # Y ~ N0, Syy) Y = matrixrnormn[n], sd = sqrtsyy)), ncol = 1) # set true alpha alpha = matrixrnormp[p]), nrow = 1) * matrixrbinomp[p], 1, Bin[b]), nrow = 1) alpha[1, 1] = rnorm1) # delta has off-diagonal entries equal to # 0.9 delta = matrixna, nrow = P[p], ncol = P[p]) for j in 1:P[p]) { for k in 1:P[p]) { delta[j, k] = 0.9 * j!= k) + 1 * j == k) # X ~ Nmu, delta) X = Y %*% alpha + matrixrnormn[n] * P[p]), ncol = P[p]) %*% tcholdelta)) # Based on the previous values we can # solve for beta and Sxx beta = qr.solvedelta, talpha))/as.numeric1/syy + alpha %*% qr.solvedelta, talpha)))) Sxx = delta + talpha) %*% alpha * syy 19

20 # run CV to find optimal betas cv = CVX, Y, lam = lam, K = 3, quiet = T) beta_hat = cv$beta_hat # fill in metrics for each estimators for i in 1:4) { # MSE and ME criteria sim[r, i, n, p, 1, s, b] = MSEbeta_hat[, i, drop = F], X, Y) sim[r, i, n, p, 2, s, b] = MEbeta_hat[, i, drop = F], beta, Sxx) # boundary of lambda? if minlam) %in% cv$best.lam[i]) { cat"oops! Lamda on boundary. \n") sim[r, i, n, p, 3, s, b] = 1 sim[r, 5, n, p, 1, s, b] = MSEbetaMPX, Y), X, Y) sim[r, 5, n, p, 2, s, b] = MEbetaMPX, Y), beta, Sxx) if minlam) %in% cv$best.lam[5]) { cat"oops! Lamda on boundary. \n") sim[r, 5, n, p, 3, s, b] = 1 cat"finished rep", r, "bin", Bin[b], "sigma", Syy[s], "P", P[p], "N", N[n], "\n") 20

21 # designate simulation data as table data = sim %>% as.data.frame.tableresponsename = "Error") 21

Indirect multivariate response linear regression

Biometrika (2016), xx, x, pp. 1 22 1 2 3 4 5 6 C 2007 Biometrika Trust Printed in Great Britain Indirect multivariate response linear regression BY AARON J. MOLSTAD AND ADAM J. ROTHMAN School of Statistics,