Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance

Size: px

Start display at page:

Download "Robust Sparse Estimation of Multiresponse Regression and Inverse Covariance Matrix via the L2 distance"

Percival Dalton
5 years ago
Views:

1 Robust Sparse Estimatio of Multirespose Regressio ad Iverse Covariace Matrix via the L2 distace Aurélie C. Lozao IBM Watso Research Ceter Yorktow Heights, New York Huijig Jiag IBM Watso Research Ceter Yorktow Heights, New York Xiwei Deg Virgiia Tech Blacksburg, Virgiia ABSTRACT We propose a robust framework to joitly perform two key modelig tasks ivolvig high dimesioal data: (i) learig a sparse fuctioal mappig from multiple predictors to multiple resposes while takig advatage of the couplig amog resposes, ad (ii) estimatig the coditioal depedecy structure amog resposes while adjustig for their predictors. The traditioal likelihood-based estimators lack resiliece with respect to outliers ad model misspecificatio. This issue is exacerbated whe dealig with high dimesioal oisy data. I this work, we propose istead to miimize a regularized distace criterio, which is motivated by the miimum distace fuctioals used i oparametric methods for their excellet robustess properties. The proposed estimates ca be obtaied efficietly by leveragig a sequetial quadratic programmig algorithm. We provide theoretical justificatio such as estimatio cosistecy for the proposed estimator. Additioally, we shed light o the robustess of our estimator through its liearizatio, which yields a combiatio of weighted lasso ad graphical lasso with the sample weights providig a ituitive explaatio of the robustess. We demostrate the merits of our framework through simulatio study ad the aalysis of real fiacial ad geetics data. Categories ad Subject Descriptors I.2.6 Artificial Itelligece: Learig Keywords Robust estimatio, high dimesioal data, sparse learig, variable selectio, multirespose regressio, iverse covariace, L2E. INTRODUCTION We focus o multirespose regressio where both predictor ad respose spaces may exhibit high dimesios. We propose a robust framework to joitly ad syergistically Permissio to make digital or hard copies of all or part of this work for persoal or classroom use is grated without fee provided that copies are ot made or distributed for profit or commercial advatage ad that copies bear this otice ad the full citatio o the first page. Copyrights for compoets of this work owed by others tha ACM must be hoored. Abstractig with credit is permitted. To copy otherwise, or republish, to post o servers or to redistribute to lists, requires prior specific permissio ad/or a fee. Request permissios from permissios@acm.org. KDD 3, August 4, 203, Chicago, Illiois, USA. Copyright 203 ACM /3/08...$5.00. solve two importat tasks: (i) learig the sparse fuctioal mappig betwee iputs ad outputs while takig advatage of the couplig amog resposes, ad (ii) estimatig the coditioal depedecy structure amog resposes while adjustig for the covariates. This is motivated by the crucial eed of itegratig geomic ad trascriptomic datasets i computatioal biology i order to solve two fudametal problems effectively: idetifyig the geetic variatios i the geome that ifluece gee expressio levels (a.k.a. expressio quatitative trait loci eqtls mappig), ad ucoverig gee expressio etworks. I fact, the accuracy of the first problem ca the be improved by leveragig the gee relatedess, ad similarly the accurate ad faithful estimatio of the gee expressio etworks ca be obtaied by accoutig for the cofoudig geetic effects o gee expressio. Multirespose regressio 5 geeralizes the basic siglerespose regressio to model multiple resposes that might sigificatly correlate with each other. As opposed to treatig each respose idepedetly, oe ca joitly lear multiple regressio mappigs to improve the estimatio ad predictio accuracy by exploitig the coditioal depedecies amog resposes. Variable selectio i multirespose regressio ca be accomplished via the pealized approaches icludig lasso 3 ad multitask lasso 24. Sparse estimatio of iverse covariace matrix is a importat area i the multivariate aalysis with broad applicatios i graphical models. A major focus i this area is that of pealized maximum likelihood formulatios 5, 34,, 6. Alteratively, modified Cholesky decompositios based o the likelihood ca be used to estimate the sparse iverse covariace 7, 4, 2. A simpler approach of eighborhood selectio 22 estimates sparse graphical models usig lasso to regress o each variable with the others as predictors. Combiig multirespose regressio ad iverse covariace estimatio has recetly begu to attract more attetio i the machie learig commuity. Rothma et al. 27 proposed a multivariate regressio with covariace estimatio (MRCE) to joitly estimate the sparse regressio ad iverse covariace matrices. They demostrated that exploitig the correlatio structures ca sigificatly improve the predictio accuracy. The same model was also studied by Lee ad Liu 20 who provided some theoretical properties for their developed method. A alterative parameterizatio was cosidered by Soh ad Kim 29, which is based o the joit distributio of predictors ad resposes ad yields a l -pealized coditioal graphical model (l -CGGM). Aother relevat method is that of covariate adjusted precisio matrix estimatio (CAPME) 7, a two-stage approach 293

2 to estimate the coditioal depedecy structure amog respose variables by adjustig for covariates. The first stage is to estimate the regressio coefficiets via a multivariate extesio of the l Datzig selector 8, ad the secod stage is to estimate the iverse covariace matrix usig l error with a l pealty. Robustess is a importat aspect ofte overlooked i the sparse learig literature, while critical whe dealig with high dimesioal oisy data. Traditioal likelihood-based estimators such as MRCE ad l -CGGM lack resiliece to outliers ad model misspecificatio. Additioally, to the best of our kowledge, estimates based o Datzig selector have ot bee compared to lasso couterparts i terms of robustess. Thus it is uclear whether CAPME, for istace, ca address the robustess issue. There is limited existig work o robust sparse learig methods i high-dimesioal modelig. The LAD-lasso 32 performs sigle respose regressio usig the least absolute deviatio combied with a l pealty. The tlasso 4 performs iverse covariace estimatio usig pealized log-likelihood with the multivariate t distributio. However, either of these methods ca be easily exteded to the settig of this paper. We propose a robust approach to joitly estimate multirespose regressio ad iverse covariace matrix. Our approach is based o a regularized distace criterio motivated by miimum distace estimators. Miimum distace estimators 33 are popularized i oparametric methods ad have exhibited excellet robustess properties 3, 2. Their use for parametric estimatio has bee discussed i 28, 2. I this work, we propose a pealized miimum distace criterio for robust estimatio of sparse parametric models i the high dimesioal settigs. Our key cotributios to this robust approach are as follows. The objective, which is deoted as REG-ISE, is based o the itegrated squared error distace (ISE) betwee the model ad the true distributio, ad imposes the sparse model structure by addig sparsity-iducig pealties to the ISE criterio. Theoretical guaratees are provided o the estimatio cosistecy of the proposed REG-ISE estimator. We leverage a sequetial quadratic programmig algorithm 9 to efficietly solve our objective. We shed light ito the robustess of our framework by liearizig our objective. The liearizatio yields a problem combiig weighted versios of l -pealized regressio (lasso) ad l -pealized iverse covariace estimatio (glasso), where the weights assiged to the istaces are theoretically derived ad ca be iterpreted i terms of outlyig degrees. We propose a modified cross-validatio ad hold-out validatio methods for the choice of tuig parameters, which are also applicable to other pealized regressio methods. The stregth of our method is demostrated via simulatio data with ad without outliers. Our study also cofirms that outliers ca severely ifluece the variable selectio accuracy of some existig sparse learig methods. Experimets o real fiacial ad eqtl data further illustrate the merits of the proposed method. 2. MODEL SETUP Deote the respose vector y = (y,..., y q) R q ad the predictor vector x = (x,..., x p) R p. We cosider a multirespose liear regressio model y = B x + ɛ, ɛ N (0, Σ), () where B = (b ij) is a p q matrix of coefficiets ad the kth colum is the coefficiets associated with kth respose y k regressig o the predictors x. The q q covariace matrix Σ describes the covariace structure of respose vector y give the predictors x. Moreover, its iverse Σ = (c ij) represets the partial covariace structure 9 ad has bee widely used to lear a sparse graphical model uder Gaussia assumptio. Note that Σ i () captures the partial covariaces amog resposes y after adjustig for the effects of covariates x. For simplicity of otatio, we assume the data are cetered so that the model () does ot cotai itercepts. Suppose there are observatioal vectors x i = (x i,..., x ip), i =,..., ad the correspodig respose vectors are y i = (y i,..., y iq). To joitly obtai sparse estimates of coefficiet matrix B ad precisio matrix Σ, we cosider a loss fuctio L (B, Σ) that measures the goodess-of-fit o the multivariate respose. The sparse structures of B ad Σ are ecouraged by usig l pealties. Specifically, the pealized loss fuctio L,λ (B, Σ) is writte as L,λ (B, Σ) = L (B, Σ) + λ Σ + λ 2 B. (2) where Σ = i j cij ad B = i,j bij are l matrix orms. Followig the priciple of parsimoy, we cosider l pealty fuctios to seek a most appropriate model that adequately explais the data. With carefully selected tuig parameters λ ad λ 2, we ca achieve a optimal trade-off betwee the parsimoiousess ad goodess-of-fit of the model. The loss L (B, Σ) is typically derived from a likelihoodbased approach. For istace the MRCE method 27 uses L (B, Σ) = trace ( (Y XB) T (Y XB)Σ ) log Σ. Alteatively, if oe igores the cotributio of the iverse covariace matrix (by implicitely assumig that it is the idetity matrix), oe ca cosider the traditioal squared loss L = Y XB 2 F ad a straighforward geeralizatio of the traditioal Lasso estimator 3 to the mutirespose settig. 3. A REGULARIZED INTEGRATED SQUA- RED ERROR ESTIMATOR We begi this sectio by showig how a miimum distace criterio yields our proposed estimator for achievig robustess uder the model i (). 3. Derivatio of the REG-ISE Objective We first apply the Itegrated Squared Error (ISE) criterio to the coditioal distributio of respose vector y give the predictors x. It leads to a L 2 distace betwee the true coditioal distributio f(y x) ad the parametric 294

3 distributio fuctio f(y x; B, Σ) as follows L(B, Σ) = f(y x; B, Σ) f(y x) 2 dy = f 2 (y x; B, Σ)dy 2 f(y x; B, Σ)f(y x)dy + f 2 (y x)dy = f 2 (y x; B, Σ)dy 2Ef(y x; B, Σ) + costat. where f(y x; B, Σ) is the probability desity fuctio of multivariate ormal N (B x, Σ) ad f(y x) 2 dy is a costat idepedet of B ad Σ. Note that f(y x; B, Σ) f(y B x; Σ) because of the coditioal distributio assumptio. Sice ɛ = y B x are idepedetly ad idetically distributed, oe ca cosider approximatig Ef(y x; B, Σ) by the empirical mea f(y i x i; B, Σ). Similar approximatio techiques have also bee used for Gaussia mixture desity estimatio 28. Now the resultig empirical loss fuctio of L(B, Σ) ca therefore be writte as L (B, Σ) = f 2 (y x; B, Σ)dy 2 f(y i x i; B, Σ) where f 2 (y x; B, Σ)dy = /(2 q π q/2 Σ /2 ). Note that we assume a parametric family for the model while usig a oparametric ISE criterio to measure goodess of fit. From the perspective of the loss fuctio, ISE is a more robust measure of the goodess-of-fit compared with the likelihood-based loss fuctio. It ca match the model with the largest portio of the data because the itegratio i (3) accouts for the whole rage of the squared loss fuctio. Usig ISE criterio as the loss fuctio, the objective fuctio i (2) becomes L,λ (B, Σ) = Σ /2 2 q π q/2 2 f(y i x i; B, Σ) (3) + λ Σ + λ 2 B. (4) However, the miimizatio of the objective i (4) is challegig. To circumvet this difficulty, we cosider miimizig a upper boud of (4) which retais the robustess property. For that purpose, we itroduce a lemma which is essetial for derivig the proposed objective (8) for estimatig the sparse multirespose regressio model i (). Lemma. For a positive defiite matrix Σ with dimesio q, the relatio betwee its determiat value ad l orm ca be described i the followig iequality Σ /2 ( ) q/2 Σ. q The proof of Lemma is provided i the Appedix. Usig Lemma, we ca derive a upper boud for the objective fuctio (4) as follows c Σ q/2 2 f(y i x i; B, Σ) + λ Σ + λ 2 B, where c = 2 q (πq) q/2 is a costat. The above optimizatio problem amouts to miimizig (5) L,λ (B, Σ) = L (B, Σ) + λ Σ + λ 2 B, (6) where L (B, Σ) = 2 f(y i xi; B, Σ) ad λ is appropriately chose. The value of λ here should be slightly larger tha the value of λ i (4). Moreover, the diagoal elemets of Σ are also pealized. Note that L,λ (B, Σ) is a upper boud of L,λ (B, Σ), however, the differece L,λ (B, Σ) L,λ (B, Σ) is well cotrolled by the pealty term λ Σ i (6). By properly adjustig the value of λ, we ca make the differece reasoably small. Therefore, by miimizig L,λ (B, Σ), we expect to approach the solutio of L,λ (B, Σ) ad thus still retai the robustess property i the estimators. Takig the logarithm o L (B, Σ), we obtai the loss L (B, Σ) = log exp( 2 (y i B x i ) Σ (y i B x i )) 2 log Σ. (7) We ote that the logarithm is employed to strike a better balace betwee goodess of fit ad the sparsity iducig pealty (similarly oe cosiders the pealized egative loglikelihood rather tha dealig with the likelihood directly). This yields the estimator proposed ad studied i this paper, the Regularized Itegrated Square Error (REG-ISE) estimator, which miimizes the followig objective fuctio: L,λ (B, Σ) = log exp( 2 (y i B x i) Σ (y i B x i)) 2 log Σ + λ Σ + λ 2 B. (8) For otatioal coveiece, here ad i later sectios, we use λ for λ. Some ituitio ca already be gaied o the objective robustess by cosiderig the ratio betwee data ad model pdf: f(y x)/f(y x; B, Σ). A outlier i the data may drives this ratio to ifiity, i which case the log-likelihood is also ifiity. I cotrast, the differece f(y x) f(y x; B, Σ) is always bouded. This property makes the L 2-distace a favourable choice whe dealig with outliers. We ote that a similar reasoig holds i the cotext of desity estimatio, as poited out i the recet work of 30 o desity-differece estimatio. 3.2 Optimizatio The REG-ISE objective fuctio (8) is o-covex ad o-smooth. I order to solve it, oe could cosider approximatig the log-sum-exp term i the objective, combied with alterate optimizatio for B ad Σ respectively (as doe i MRCE). However, the covergece of alterate optimizatio ca be very slow, as observed i the case of MRCE

4 Istead, we propose to adopt a sequetial quadratic programmig algorithm recetly developed by Curtis & Overto 9 for the o-smooth ad o-covex optimizatio. The basic idea of their algorithm SLQP-GS is to combie sequetial quadratic approximatio with a process of gradiet samplig so that the computatio of the search directio is effective i osmooth regios. The oly requiremet for SLQP-GS to be applicable is that the objective ad costraits (if ay) be cotiuously differetiable o ope dese subsets, which is satisfied i our case. We also beefit from the covergece guaratees of SLQP-GS, amely that the algorithm is guarateed to coverge to a solutio regardless of the iitializatio with probability oe. We employed the Matlab implemetatio of SLQP-GS provided by the authors, which is available from coral.ie.lehigh.edu/~frakecurtis/software. Due to space costraits, we refer the reader to Curtis & Overto 9 for details o the algorithm, its matlab implemetatio ad covergece results. We ote that the gradiet samplig step i SLQP-GS ca be efficietly parallelized for fast computatio i high dimesioal applicatios. Alteratively oe ca perform adaptive samplig of gradiets over the course of the optimizatio process as described i Cosistecy Results The L 2 distace estimators are kow to strike the right balace betwee statistical efficiecy ad robustess 2, 28. I this sectio, we add to this body of evidece by showig that the REG-ISE estimator is root- cosistet for the settigs of the fixed dimesioality p. Deote by B the true regressio coefficiet matrix, ad by Σ the true covariace matrix. We assume the followig coditios: (C) X X A, where A is positive defiite. (C2) There exist -cosistet estimators of B ad Σ. Coditio (C2) ca be replaced by some techical regularity coditios as i 3 so as to guaratee the cosistecy of ordiary maximum likelihood estimators. Theorem. Cosider sequeces λ, ad λ 2, of regularizatio parameters, such that λ, /2 0 ad λ 2, /2 0. The uder the coditios (C) ad (C2), there exists a local miimizer ( ˆB, ˆΣ ) of REG-ISE such that (vec( ˆB), vec( ˆΣ ) ) (vec( B), vec( Σ ) ) = O p(/ ). The proof is provided i the Appedix. The above theoretical results hold for the case where the dimesioality p is fixed, while the sample size is allowed to grow. As a future work we pla to exted our results to the case where p is allowed to grow with the sample size. We cojecture that coditio (C) might be replaced by coditios o sample ad populatio covariace while (C2) is still achieved by certai pealized maximum likelihood estimators 23. We ote however that such a extesio is highly o-trivial. I fact, to our kowledge, o theory is yet available for joit estimatio i the high dimesioal case eve for the stadard maximum likelihood estimator, the o-covexity beig the source of the difficulty. Nevertheless, i view of our superior empirical results, we hope that theoretical results ca be obtaied by makig use of the techiques i 26 which solely deal with iverse covariace estimatio, ad those i 23 which solely cocer regressio. 3.4 Isights ito Robustess We provide some isights for the robustess of REG-ISE by cosiderig a first order approximatio of the log-sumexp term i (8). Defie the parameter set β = (B, Σ ) ad deote g i(β) = 2 (y i B x i) Σ (y i B x i). We cosider a first-order approximatio for log exp(gi(β)) with respect to β as follows: log exp(g i(β)) C 0 + exp(g i(β 0 )) exp(gi(β 0 )) gi(β 0 )T (β β 0 ), where β 0 is a iitial estimate ad C 0 is some costat idepedet of β. Usig the fact that g i(β) g i(β 0 ) + g i(β 0 ) T (β β 0 ), we have the followig log exp(g i(β)) exp(g i(β 0 )) exp(gi(β 0)) gi(β), up to some costat idepedet of β. Therefore, the objectio fuctio (8) ca be approximated by log Σ + w i(y i B x i) Σ (y i B x i) + λ Σ + λ 2 B, (9) up to some costat ad where w i w i(β 0 ) = exp(g i(β 0 )) exp(gi(β 0)). (0) By defiig S = S (β 0 ) = wi(y i B x i)(y i B x i), we ca rewrite (9) as log Σ + traceσ S (β 0 ) + λ Σ + λ 2 B. () Note that S ca be viewed as a weighted sample covariace matrix where weights are with respect to observatios. Oe could the evisio a approximate iterative procedure where give iitial estimates, data are first re-weighted by w i i (0) ad the alterately passed to lasso ad l pealized iverse covariace solvers (e.g. QUIC, glasso) to provide ew estimates, ad the procedure would be repeated util covergece (see details i the appedix). This ituitively elaborates the robustess property of REG-ISE. Ideed the weights w i are proportioal to the likelihood fuctios of idividual data poits, i.e., w i = i L(y x i ;β 0 ) L(y i x. i ;β 0 ) Thus data with high likelihood values are give more weights i the estimatio. Coversely, data with low likelihood values, which are more likely to be outliers, cotribute less to the estimatio. The coectio betwee the likelihood fuctios ad weights icely explais the resiliece of the proposed estimator to outliers. 3.5 Tuig Parameter Selectio Approaches for choosig tuig parameters iclude crossvalidatio (CV) 4, the hold-out validatio set method 2, 296

5 ad iformatio criteria such as Bayesia iformatio criterio (BIC) 34. Here we proposed a modified scheme for the cross-validatio method. The commo K-fold CV cosists i radomly partitioig the data ito K folds, ad the leavig out oe fold of data as validatio set while all the other folds are used as traiig set i each CV iteratio. Note that CV assumes that the data are i.i.d. distributed, ad therefore the validatio set ad traiig set are cosidered statistically equivalet. However, such a assumptio is o loger valid i the presece of outliers sice the proportios of the outliers i the validatio data ad traiig data ca be differet. Cosequetly, the validatio set caot be used to evaluate the model obtaied by the traiig set. To tackle this issue, we develop a modified cross-validatio scheme motivated by the idea of sliced desigs 25. Specifically, we perform K-fold cross-validatio for = mk observatios as follows. Based o iitial estimates of the model parameters, we first rak the observed data accordig to the values of their likelihood fuctios. The the first K data poits are radomly assiged to K folds, oe poit per fold. Subsequetly the ext K data poits are radomly assiged to K folds. This procedure is repeated m times. I this way, the data i each fold are more likely to have similar distributios. This modified scheme ca also be applied to tuig via hold-out validatio set method. 4. SIMULATION STUDY We compare the proposed REG-ISE with MRCE 27, l - CGGM 29, CAPME 7, ad LAD-Lasso 32. Note that LAD-Lasso ca oly estimate regressio coefficiets. I our experimets, the rows of p predictor matrices X are sampled idepedetly from N (0, Σx) where (Σx) i,j = 0.5 i j. We cosider the followig two cases. Case : The covariace matrix is set to Σ i,j = 0.7 i j which correspods to a AR() model with baded Σ. We radomly select 0 percet of the predictors to be irrelevat to all the resposes. The for each respose, we radomly select half of the remaiig predictors to be relevat for that respose. The correspodig o-zero etries i the regressio matrix B are sampled idepedetly from N (0, ). We cosider 60 predictors, 20 resposes ad 00 observatios. Case 2: Σ is the graph Laplacia of a tree with outdegree of 4 ad edge weights uiformly sampled from 0.3,.0. For each respose we radomly select 0 percet of the predictors to be relevat, ad sample the correspodig ozero eties i B idepedetly from N (0, ). We cosider 000 predictors, 00 resposes ad 400 observatios. To address the robustess issue, we cosider various percetages of outliers cotamiatig the resposes. The ucotamiated data are geerated from y N (B x, Σ), where B ad Σ are specified above. Two scearios presetig outliers are cosidered: (i) outliers with respect to the mea: y N (B x + C, Σ) where C is a costat vector of 5 ad (ii) outliers regardig the covariace structure: y N (B x, I) where I is a idetity matrix. To measure variable selectio accuracy, we use the F score defied by F = 2P R/(P + R), where P is precisio (fractio of correctly selected variables amog selected variables) ad R is recall (fractio of correctly selected variables amog true relevat variables). To measure the estimatio accuracy of B, we report the model error defied as ME( ˆB, B) = tr ( ˆB B) T Σx( ˆB B), where ˆB is the estimated regressio coefficiet matrix. The estimatio accuracy for Σ is measured by its l 2 loss, defied as ˆΣ Σ F, uder Frobeius orm where ˆΣ is the estimated iverse covariace matrix. For each of the above settigs, we geerate 50 simulated dataset. For all five compariso methods, for each dataset we use the modified 5-fold cross-validatio described i Sectio 3.5 to tue parameters λ ad λ 2. The choice of iitial parameter estimates is importat for MRCE ad REG-ISE as their respective objective fuctios are o-covex. The iitial estimates for B are obtaied usig ridge regressio. I additio for REG-ISE, Σ is iitialized as the iverse of the sample covariace matrix of ridge regressio residuals perturbed by a positive diagoal matrix. Figures ad 2 preset the results for Case. Results for Case 2 are summarized i Table Similar behavior i terms of robustess is observed for both cases. Our proposed REG- ISE estimator clearly outperforms other methods due to its robustess agaist outliers with respect to both mea ad covariace deviatios. Performace of MRCE, l CGGM ad CAPME seriously degrade oce outliers are itroduced. Surprisigly, LAD-Lasso does ot show much resiliece to outliers. Moreover, with clea data, its estimatio ad variable selectio accuracy is iferior to other methods as it igores the depedecies amog resposes. Iterestigly, eve whe there are o outliers i the data, REG-ISE is competitive, sice it is more likely to distiguish true sigals from various oise amplitudes. Figure : Average model error ME( ˆB, B) (top) ad F scores (bottom) for B estimated by REG-ISE, MRCE, l CGGM, LAD-Lasso, ad CAPME o simulated data of Case. Outliers i terms of the mea (left), ad covariace (right). The x-axis correspods to the percetage of outliers. 297

6 Table : Simulatio results for Case 2. Top: Model error for B/l 2 loss for Σ. Bottom: F for B/F for Σ. Measure Outlier Type Outlier % REG-ISE MRCE l- CGGM CAPME LAD-Lasso ME(B)/l 2 (Σ ) Noe / / / / /NA ME(B)/l 2 (Σ ) Mea / / / / /NA ME(B)/l 2 (Σ ) Mea 0 34/ / / / /NA ME(B)/l 2 (Σ ) Mea / / / / /NA ME(B)/l 2 (Σ ) Cov / / / / /NA ME(B)/l 2 (Σ ) Cov / / / / /NA ME(B)/l 2 (Σ ) Cov / / / / /NA F (B)/F (Σ ) Noe 0 0.8/ / / / /NA F (B)/F (Σ ) Mea / / / / /NA F (B)/F (Σ ) Mea / / / / /NA F (B)/F (Σ ) Mea / / / / /NA F (B)/F (Σ ) Cov / / / / /NA F (B)/F (Σ ) Cov / / / / /NA F (B)/F (Σ ) Cov / / / / /NA model is cosidered as follows y t = B y t + ɛ t, ɛ t N (0, Σ), t = 2,..., T Figure 2: Average estimatio error ˆΣ Σ F (top) ad F scores (bottom) for Σ estimated by REG-ISE, MRCE, l CGGM, LAD-Lasso, ad CAPME o simulated data of Case. Outliers i terms of the mea (left), ad covariace (right). The x-axis correspods to the percetage of outliers. 5. APPLICATIONS I this sectio, we illustrate the usefuless of the proposed robust methods through two motivatig applicatios ad compare the results of our robust estimators with those of existig methods. 5. Asset Retur Predictio As a toy example for multivariate time series, we aalyze a fiacial dataset which has bee studied i 27 ad 34. This dataset cotais weekly log-returs of 9 stocks for year Give multivariate time series data of log-returs y t for weeks t =,..., T, a first-order vector autoregressive where the respose matrix y t is formed by observatios at week t ad the predictor matrix y t cotais observatios at the previous week t. Followig the aalysis i Rothma et al. 27, we used log-returs of the first 26 weeks as traiig set, ad log-returs of the remaiig 26 weeks as testig set. The tuig parameters were selected usig the modified 0-fold cross-validatio described i Sectio 2.4. Table 2 reports the mea squared predictio error (MSPE) of the five compariso methods. Eve though all methods are competitive o this dataset, REG-ISE estimator achieves the smallest predictio error. Figure 3 presets the graphs iduced by the estimates of Σ usig MRCE ad REG- ISE, respectively. Comparig the two graphs, both MRCE ad REG-ISE idicate that compaies from the same idustry are partially correlated, e.g. GE ad IBM (techology), Ford ad GM (auto idustry). AIG (isurace compay) seems to be partially correlated with most of the other compaies. However, there are some discrepacies betwee the two graphs, e.g., GM is foud to be partially correlated to IBM by REG-ISE but to be ucorrelated by MRCE. Overall, the results from REG-ISE have reasoable fiacial iterpretatio. 5.2 eqtl Data Aalysis We aalyze yeast eqtl dataset 6 which cotais geotype data for 2,956 SNPs (predictors) ad microarray data for 6,26 gees (resposes) regardig 2 segregats (istaces). We extracted,260 uique SNPs, ad focused o 25 gees belogig to cell-cycle pathway provided by the KEGG database 8. For all methods, the tuig parameters were chose via 5-fold modified cross-validatio de- Table 2: Predictio accuracy measured by MSPE for various methods for the asset retur dataset. Method MSPE REG-ISE 0.69 ± 0. MRCE 0.7 ± 0.2 l -CGGM 0.72 ± 0.0 CAPME 0.72 ± 0. LAD-Lasso 0.73 ±

7 YJL074C YIL026C YJL074C YIL026C YFL008W YOR95W YER47C Walmart YDL028C Exxo AIG Exxo AIG YOR95W YHR52W YGL90C YAL06W YDL34C YDL88C YGR098C YER47C YDR0W YPL94W YGL086W YDR325W YBR33C YKL0W YIL046W YGL003C YFR028C YIL06W YPL53C GM YBL023C YLR274W YGL20C YGR092W YPRW YJR053W YMR055C YLR03C YEL032W YDR052CYBR202W YPR09W YDL07W YDR46C YLR079W YFL029C YML064C YJL94W YDL056W YLR82W YML065W YBR060C YDL32W YDR328C YHR8C YOL33W YJR046W YLL004W YFL009W YOR083W YPR62C YBR60W YNL26W YMR99W YPL256C YBR35W YML027W YJL57C YAL040C YPL03C YERC YJR090C YNL289W YDL27W YMR036C Citigroup YDR80W YGR3W YBL097W YLR086W YBR093C YMR043W YGR233C YDL06C YDR45C Ford YAL024C YFR034C YBL06W Citigroup YBR2C YCR084C YDL55W YMR043W YGR233C YBL06W GE CoocoPhillips YPL53C YGR092W YDL0C YLR76C YAR09C YBR2C YCR084C YAL024C YDR80W YBL097W YGR3W YLR086W YFR034C YDR45C YOL00W Ford YIL06W YFR028C YBR093C YDL55W YDL06C YOL00W YFR03C YFR03C YLR272C YLR272C CoocoPhillips YGL003C YOR368W YLR288C YCL06C YDR499W YMR00C YGR08W YPR9W YGR09C YPR20C YPRW YJR053W YDR507C YMR055C YJL87C YML064C YDR46C YFL029C YLR079W YJL94W YDL056W YLR82W YML065W YBR060C YDL32W YDR328C YBL023C YJR046W YHR8C YFL009W YLR274W YGL20C YLL004W YOL33W YPR62C YOR083W YBR60W YLR03C YEL032W YNL26W YPL256C YMR99W YDR052C YBR202W YBR35W YPR09W YDL07W YJL57C YML027W YERC YAL040C YPL03C YJR090C YNL289W YDL27W YMR036C YDL0C YAR09C YBR36W YDR27C YDR8W YHR66C YKL022C YLR27C YOR249C YFR036W YBR33C YIL046W YKL0W YLR76C GM YER73W YJL076W YBR274W YJL03CYDR3C YGL6W YBL084C YGL240W YNL72W YDR260C YDL008W YLR02C YLR288C YCL06C YJL87C IBM IBM YPL94W YJL030W YOR026W YLR20W YOR368W YDR499W YMR00C YGR08W YPR9W YPR20C YGR09C YDR507C YBR36W YDR27C YDR8W YFR036W YHR66C YKL022C YLR27C YOR249C YAL06W YDL34C YDR0W YDL88C YGR098C YGL086W YGR88C YER73W YJL03CYDR3C YJL076W YBR274W YJL030W YOR026W YGL6W YBL084C YDR260C YGL240W YDL008W YNL72W YLR02C YHR52W YGL90C YDL028C YDR325W YGR88C YLR20W YFL008W YDL003W YDL003W Walmart GE YJL074C YIL026C YJL074C YFL008W YIL026C Figure 3: Graphs from the estimates of the iverse covariace matrix Σ obtaied by MRCE (left) ad REG-ISE(right). YOR95W YHR52W YGL90C YAL06W YDL34C YDL88C YDR0W YGR098C YDL028C YGL086W YDR325W YJL03C YDR3C YJL030W YOR026W YBR274W YGL6W YDR8W YBL084C YHR66C YFR036W YGL240W YLR27C YDR260C YDL008W YNL72W YLR02C YIL046W YDR499W YML065W YBR060C YBL023CYHR8C YLR274WYJR046W YGL20CYLL004W YLR03C YEL032WYPR62C YNL26W YBR202W YDR052C YPR09W YDL07W YJR053W YMR055C YDR45C YDR052C YBR202W YPR09W YDL07W YGR3W YBL097W YFL029C YLR079W YDL056W YJL94W YDL32W YDR46C YDL0C YLR76C YBR2C YCR084C YML064C YLR82W YDR328C YOL33W YFL009W YAL024C YPL256C YBR60W YMR99W YOR083W YBR35W YML027W YDR80W YJL57C YLR086W YGR233C YCL06C YPL53C YIL06W YGR092W YJR053W YPRW YMR055C YAR09C YJL87C YML065W YBL023C YBR060C YLL004W YHR8C YLR274WYJR046W YGL20C YPR62C YLR03C YEL032WYNL26W YDR80W YMR043W YDR499W YMR00C YGR08W YPR9W YGR09C YPR20C YKL0W YIL046W YDR507C YDL0C YBR36W YDR27C YOR368W YLR288C YOR249C YLR27C YGL003C YFR028C YBR2C YCR084C YAL024C YFR034C YBR274W YDR8W YBL084C YFR036W YKL022C YHR66C YLR76C YML027W YERC YAL040C YPL03C YBL06W scribed i Sectio 3.5. We first evaluated the predictive accuracy of our method ad the compariso methods by radomly partitioig the data ito traiig ad test sets, usig 90 observatios for traiig ad the remaider for testig. We computed the MSPE for the testig set. The average MSPEs based o 20 radom partitios are preseted i Table 3. We ca see that overall the predictive performace of REG-ISE is superior to the other methods. Figure 4 shows the cell-cycle pathways estimated by the proposed REG-ISE, MRCE method, CAPME method alog with the bechmark KEGG pathway. MRCE teds to estimate may spurious liks. Similar observatio holds for l CGGM, so its estimated graph is ot represeted here. CAPME recovers some of the liks but ot as accurately as REG-ISE. This ca be partly explaied by the fact that CAPME does ot take ito accout the covariace structure i its regressio stage ad does ot have ay feedback loop. This ca result i poor estimatio of the regressio matrix B, which i tur may egatively impact the estimatio of precisio matrix Σ. I additio, lack of robustess ca also result i iaccurate etwork recostructio. Certai discrepacies betwee true ad estimated graphs may also be caused by iheret limitatios i this dataset. For istace, some edges i cell-cycle pathway may ot be observable from gee expressio data. Additioally i this dataset, the perturbatio of cellular systems may ot be sigificat eough to eable accurate iferece of some of the liks. Usig the KEGG pathway as the groud truth, we also computed the F scores for the estimates of Σ show i Table 4. As a saity check, we aalyzed the microarray data without SNPs as predictors usig glasso. The resultig graph was extremely dese with F score to be This idicates the disadvatage of procedures like glasso which is uable to adjust for predictors (hereby the geetic variats) i iverse covariace matrix estimatio. YGR092W YAR09C YML064C YJR090C YNL289W YDL27W YMR036C YDR3C YGL6W YGL240W YLR02C YDR260C YDL008W YNL72W YBR33C YBR60W YPL256C YMR99WYOR083W YBR35W YJL57C YER73W YJL03C YJL030W YOR026W YLR20W YLR288C YPRW YDR328C YOL33W YDL32W YFL009W YPL94W YJL076W YGR88C YCL06C YIL06W YFR028C YFL029C YDR46C YLR079W YDL056W YLR82W YJL87C YGL086W YDR325W YER73W YOR368W YOR95W YHR52W YGL90C YAL06W YDL88C YDL34C YDR0W YGR098C YDL028C YPL53C YDR507C YJL94W YBR36W YDR27C YKL022C YOR249C YGL003C YPR9W YGR08W YMR00C YPR20C YGR09C YBR33C YKL0W YER47C YPL94W YJL076W YGR88C YLR20W YFL008W YDL003W YDL003W YER47C YGR3W YBL097W YJR090C YDL27W YMR036C YBL06W YERC YAL040C YPL03C YFR034C YBR093C YMR043W YGR233C YDL55W YBR093C YDL06C YDR45C YOL00W YLR086W YNL289W YDL06C YDL55W YOL00W YFR03C YFR03C YLR272C YLR272C Figure 4: Yeast cell cycle etwork provided i the KEGG database (top left), estimated by REG-ISE (top right), MRCE (bottom left), ad CAPME (bottom right). Table 4: F work (higher REG-ISE ±0.009 scores of the estimated cell-cycle etvalues idicate higher accuracy). MRCE l -CGGM CAPME Glasso ±0.008 ±0.0 ±0.034 ±0.04 From recostructed etwork ad F scores, we coclude that REG-ISE most faithfully estimates the cell-cycle etwork compared to the other methods, which clearly demostrates the value of embracig the robustess. 6. CONCLUDING REMARKS I this work, we have developed a robust framework to joitly estimate multirespose regressio ad iverse covariace matrix from high dimesioal data. Our formulatio is readily applicable to deal with sigle regressio ad sparse iverse covariace estimatio by itself, as well as to the parameterizatio used i 29, which will be a iterestig future work. The proposed methodology is valuable for may applicatios beyod the itegratio of geomic ad trascriptomic data ad fiatial data aalysis. Additioal iterestig future work icludes extedig the proposed method to directed graph modelig via vector autoregressive models, ad extedig our theoretical results to the high-dimesioal settig. Table 3: MSPEs uder differet methods based o 20 radom partitios of the eqtl ito traiig ad test sets. REG-ISE MRCE l -CGGM CAPME 2.36 ± ± ± ±

8 APPENDIX A. PROOF OF LEMMA Suppose that the eigevalues of Σ are d i, i =,..., q. Usig the fact that q q di q q di, oe ca have ( q ) q/2 Σ /2 di. q For each eigevalue d i, we apply the Gershgori s circle theorem to obtai its upper boud. That is d i c ii q c ij d i c ij j i j= where c ij, i =,..., q; j =,..., q are the elemets i matrix Σ. Therefore, we have q di q q j= cij = Σ, which leads to ( ) Σ Σ /2 q/2. q This eds the proof of Lemma. B. PROOF SKETCH OF THEOREM Recall our objective fuctio: L,λ (B, Σ ) = log exp( 2 (y i B x i) Σ (y i B x i)) 2 log Σ + λ Σ + λ 2 B (2) Let B ad Σ be the true regressio ad iverse covariace matrices. We follow the same reasoig as i the proof of Theorem i 3. The key idea is that it s eough to show that for ay δ > 0 there exists a large costat C, such that P { sup L,λ ( B+ U, Σ + U 2 ) > L,λ ( B, Σ )} δ U =C with U = (vec(u ), vec(u 2) ). Defie Q(B, Σ ) as For completeess we provide the details of the Q(B, Σ ) = log exp( approximate iterative procedure discussed i Sectio (y i B x i) Σ (y i B x i)). Algorithm Approximate Iterative Procedure By usig a similar reasoig as i Sectio 3.4, oe ca show that aroud the true parameters ( B, Σ ), the differece Q( B + U, Σ + U 2 ) Q( B, Σ ) ca be lower-bouded by w i(y i ( B + U ) x i) ( Σ + U 2 )(y i ( B + U ) x i) w i(y i B x i) Σ (y i B x i) + o() where w i w i(β 0 ) = exp(g i (β 0 )) exp(g i (β 0 )). Eve though Q is globally o-covex, such a approximatio is valid as Q is locally bi-covex i a eighborhood of the true parameters ( B, Σ ) with asymptotic probability oe. Briefly, oe ca show that the evets U T 2 BQ( B + t U ) U < 0 ad U T 2 2 Σ Q( Σ + t U 2 ) U 2 < 0 for t (0, ) are small uiformly o U ad U 2 as log as U is small eough. Briefly, this ca be doe by brute force calculatio of the Hessias aroud the true solutio, oticig that the expected value of U T 2 BQ( B+t U ) U ad U T 2 2 Σ Q( Σ + t U 2 ) U 2 are both o-egative for U small eough ad the upper-boudig the large deviatios from the expected values usig Azuma-Hoeffdig s iequality. Let us defie V (U) as V (U) = L,λ ( B + U, Σ + U 2 ) L,λ ( B, Σ ). For otatio coveiece, deote Ỹ = diag( w,..., w )Y, ad X = diag( w,,..., w )X. Notig that B jk + u j,k B jk = u j,k for B jk = 0 ad Σ st + u 2s,t Σ st = u 2s,t for Σ st = 0, we the have V (U) log ( Σ + U 2 ) Σ + trace{( Σ + U 2 )(Ỹ X( B + U )) (Ỹ X( B + U ))} trace{ Σ (Ỹ X B) (Ỹ X B)} + λ ( B u j,k jk + B jk ) B kj 0 + λ 2 ( Σ st + u2s,t Σ jk ). Σ st 0 Followig the same derivatios as the proof of Lemma 3 i 20, we ca show that for a sufficietly large C, V (U) > 0 uiformly o {U : U = C} with probability greater tha δ. This completes the proof. C. APPROXIMATE ITERATIVE PROCEDURE Step : Give a iitial estimate Σ 0 ad a iitial estimate B 0. Step 2: Compute w i based o (0) ad obtai S = wi(y i B 0x i)(y i B 0x i). Step 3: Estimate Σ by miimizig () give B 0: ˆΣ = arg mi log Σ + traceσ S + λ Σ. Step 4: Estimate B by miimizig (9) give Σ 0 : ˆB = arg mi w i (y i B x i ) Σ 0 (y i B x i ) + λ 2 B. Step 5: If ˆΣ Σ 0 2 F δ ad ˆB B 0 2 F δ 2, stop. Else set Σ 0 = ˆΣ ad B 0 = ˆB ad go back to Step

9 Refereces O. Baerjee, L. Ghaoui, ad A. d Aspremot. Model selectio through sparse maximum likelihood estimatio. JMLR, 9:485 56, A. Basu, I. R. Harris, N. L. Hjort, ad M. C. Joes. Robust ad efficiet estimatio by miimisig a desity power divergece. Biometrika, 85, R. Bera. Robust locatio estimates. Aals of Statistics, 5:43 444, P. J. Bickel ad E. Levia. Regularized estimatio of large covariace matrices. Aals of Statistics, 36():99 227, L. Breima ad J. H. Friedma. Predictig multivariate resposes i multiple liear regressio. JRSS Series B., R. Brem ad L. Kruglyak. The ladscape of geetic complexity across 5,700 gee expressio traits i yeast. Proc. Natl. Acad. Sci. USA, 02(5): , T. T. Cai, H. Li, W. Liu, ad J. Xie. Covariate adjusted precisio matrix estimatio with a applicatio i geetical geomics. Biometrika, E. Cades ad T. Tao. The datzig selector: Statistical estimatio whe p is much larger tha. The Aals of Statistics, 35: , F. E. Curtis ad M. L. Overto. A Sequetial Quadratic Programmig Algorithm for Nocovex, Nosmooth Costraied Optimizatio. SIAM Joural o Optimizatio, 22(2): , F. E. Curtis ad X. Que. A Adaptive Gradiet Samplig Algorithm for Nosmooth Optimizatio. Optimizatio Methods ad Software, 20. D. L. Dooho ad R. C. Liu. The automatic robustess of miimum distace fuctioal. Aals of Statistics, 6: , J. Fa ad R. Li. Variable selectio via ococave pealized likelihood ad its oracle properties. JASA, 96: , M. Fiegold ad M. Drto. Robust graphical modelig of gee etworks usig classical ad alterative t-distributio. Aals of Applied Statistics, 5(2A): , J. Friedma, T. Hastie, ad R. Tibshirai. Sparse iverse covariace estimatio with the graphical lasso. Biostatistics, 9(3):432 44, C. Hsieh, M. Sustik, I. Dhillo, ad P. Ravikumar. Sparse iverse covariace matrix estimatio usig quadratic approximatio. I NIPS, J. Huag, N. Liu, M. Pourahmadi, ad L. Liu. Covariace matrix selectio ad estimatio via pealised ormal likelihood. Biometrika, 93():85 98, M. Kaehisa, S. Goto, M. Furumichi, M. Taabe, ad M. Hirakawa. Kegg for represetatio ad aalysis of molecular etworks ivolvig diseases ad drugs. Nucleic Acid Research, 3(): , S. L. Lauritze. Graphical Models. Oxford: Claredo Press, W. Lee ad Y. Liu. Simultaeous multiple respose regressio ad iverse covariace matrix estimatio via pealized gaussia maximum likelihood. J. Multivar. Aal., E. Levia, A. J. Rothma, ad J. Zhu. Sparse estimatio of large covariace matrices via a ested lasso pealty. Aals of Applied Statistics, 2(), N. Meishause ad P. Buhlma. High-dimesioal graphs ad variable selectio with the lasso. Aals of Statistics, 34(3): , S. Negahba, P. Ravikumar, M. Waiwright, ad B. Yu. A uified framework for high-dimesioal aalysis of M-estimators with decomposable regularizers. CoRR, abs/00.273, G. Oboziski, B. Taskar, ad M. Jorda. Multi-task feature selectio. Tech. report, Uiversity of Califoria, Berkeley, P. Z. G. Qia ad C. F. J. Wu. Sliced space-fillig desigs. Biometrika, 96(4): , P. Ravikumar, M. Waiwright, G. Raskutti, ad B. Yu. High-dimesioal covariace estimatio by miimizig l-pealized log-determiat divergece. Electro. J. Statist., 5: , A. Rothma, E. Levia, ad J. Zhu. Sparse multivariate regressio with covariace estimatio. JCGS, 9(4): , D. Scott. Parametric statistical modelig by miimum itegrated square error. Techometrics, 43: , K. Soh ad S. Kim. Joit estimatio of structured sparsity ad output structure i multiple-output regressio via iverse-covariace regularizatio. AISTATS, M. Sugiyama, T. Suzuki, T. Kaamori, M. C. Du Plessis, S. Liu, ad I. Takeuchi. Desity-differece estimatio. NIPS, 25: , R. Tibshirai. Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society, Series B, 58(): , H. Wag, G. Li, ad G. Jiag. Robust regressio shrikage ad cosistet variable selectio through the LAD-lasso. J. Busiess ad Ecoomics Statistics, 25: , J. Wolfowitz. The miimum distace method. Aals of Mathematical Statistics, 28():75 88, M. Yua ad Y. Li. Model selectio ad estimatio i the gaussia graphical model. Biometrika, 94():9 35,

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short