A Dirty Model for Multi-task Learning

Size: px

Start display at page:

Download "A Dirty Model for Multi-task Learning"

Philip Sharp
6 years ago
Views:

1 A Dirty Model for Multi-task Learig Ali Jalali Uiversity of Texas at Austi Suay Saghavi Uiversity of Texas at Austi Pradeep Ravikumar Uiversity of Texas at Asuti Chao Rua Uiversity of Texas at Austi Abstract We cosider multi-task learig i the settig of multiple liear regressio, ad where some relevat features could be shared across the tasks. Recet research has studied the use ofl /l q orm block-regularizatios withq > for such blocksparse structured problems, establishig strog guaratees o recovery eve uder high-dimesioal scalig where the umber of features scale with the umber of observatios. However, these papers also cautio that the performace of such block-regularized methods are very depedet o the extet to which the features are shared across tasks. Ideed they show [8] that if the extet of overlap is less tha a threshold, or eve if parameter values i the shared features are highly ueve, the block l /l q regularizatio could actually perform worse tha simple separate elemetwise l regularizatio. Sice these caveats deped o the ukow true parameters, we might ot kow whe ad which method to apply. Eve otherwise, we are far away from a realistic multi-task settig: ot oly do the set of relevat features have to be exactly the same across tasks, but their values have to as well. Here, we ask the questio: ca we leverage parameter overlap whe it exists, but ot pay a pealty whe it does ot? Ideed, this falls uder a more geeral questio of whether we ca model such dirty data which may ot fall ito a sigle eat structural bracket all block-sparse, or all low-rak ad so o). With the explosio of such dirty high-dimesioal data i moder settigs, it is vital to develop tools dirty models to perform biased statistical estimatio tailored to such data. Here, we take a first step, focusig o developig a dirty model for the multiple regressio problem. Our method uses a very simple idea: we estimate a superpositio of two sets of parameters ad regularize them differetly. We show both theoretically ad empirically, our method strictly ad oticeably outperforms both l or l /l q methods, uder high-dimesioal scalig ad over the etire rage of possible overlaps except at boudary cases, where we match the best method). Itroductio: Motivatio ad Setup High-dimesioal scalig. I fields across sciece ad egieerig, we are icreasigly faced with problems where the umber of variables or features p is larger tha the umber of observatios. Uder such high-dimesioal scalig, for ay hope of statistically cosistet estimatio, it becomes vital to leverage ay potetial structure i the problem such as sparsity e.g. i compressed sesig [3] ad LASSO [4]), low-rak structure [3, 9], or sparse graphical model structure []. It is i such high-dimesioal cotexts i particular that multi-task learig [4] could be most useful. Here,

2 multiple tasks share some commo structure such as sparsity, ad estimatig these tasks oitly by leveragig this commo structure could be more statistically efficiet. Block-sparse Multiple Regressio. A commo multiple task learig settig, ad which is the focus of this paper, is that of multiple regressio, where we haver > respose variables, ad a commo set of p features or covariates. The r tasks could share certai aspects of their uderlyig distributios, such as commo variace, but the settig we focus o i this paper is where the respose variables have simultaeously sparse structure: the idex set of relevat features for each task is sparse; ad there is a large overlap of these relevat features across the differet regressio problems. Such simultaeous sparsity arises i a variety of cotexts [5]; ideed, most applicatios of sparse sigal recovery i cotexts ragig from graphical model learig, kerel learig, ad fuctio estimatio have atural extesios to the simultaeous-sparse settig [,, ]. It is useful to represet the multiple regressio parameters via a matrix, where each colum correspods to a task, ad each row to a feature. Havig simultaeous sparse structure the correspods to the matrix beig largely block-sparse where each row is either all zero or mostly o-zero, ad the umber of o-zero rows is small. A lot of recet research i this settig has focused o l /l q orm regularizatios, for q >, that ecourage the parameter matrix to have such blocksparse structure. Particular examples iclude results usig the l /l orm [6, 5, 8], ad the l /l orm [7, 0]. Dirty Models. Block-regularizatio is heavy-haded i two ways. By strictly ecouragig sharedsparsity, it assumes that all relevat features are shared, ad hece suffers uder settigs, arguably more realistic, where each task depeds o features specific to itself i additio to the oes that are commo. The secod cocer with such block-sparse regularizers is that the l /l q orms ca be show to ecourage the etries i the o-sparse rows takig early idetical values. Thus we are far away from the origial goal of multitask learig: ot oly do the set of relevat features have to be exactly the same, but their values have to as well. Ideed recet research ito such regularized methods [8, 0] cautio agaist the use of block-regularizatio i regimes where the supports ad values of the parameters for each task ca vary widely. Sice the true parameter values are ukow, that would be a worrisome caveat. We thus ask the questio: ca we lear multiple regressio models by leveragig whatever overlap of features there exist, ad without requirig the parameter values to be ear idetical? Ideed this is a istace of a more geeral questio o whether we ca estimate statistical models where the data may ot fall clealy ito ay oe structural bracket sparse, block-sparse ad so o). With the explosio of dirty high-dimesioal data i moder settigs, it is vital to ivestigate estimatio of correspodig dirty models, which might require ew approaches to biased high-dimesioal estimatio. I this paper we take a first step, focusig o such dirty models for a specific problem: simultaeously sparse multiple regressio. Our approach uses a simple idea: while ay oe structure might ot capture the data, a superpositio of structural classes might. Our method thus searches for a parameter matrix that ca be decomposed ito a row-sparse matrix correspodig to the overlappig or shared features) ad a elemetwise sparse matrix correspodig to the o-shared features). As we show both theoretically ad empirically, with this simple fix we are able to leverage ay extet of shared features, while allowig disparities i support ad values of the parameters, so that we are always better tha both the Lasso or block-sparse regularizers at times remarkably so). The rest of the paper is orgaized as follows: I Sec. basic defiitios ad setup of the problem are preseted. Mai results of the paper is discussed i sec 3. Experimetal results ad simulatios are demostrated i Sec 4. Notatio: For ay matrix M, we deote its th row as M, ad its k-th colum as M k). The set of all o-zero rows i.e. all rows with at least oe o-zero elemet) is deoted by RowSuppM) ad its support by SuppM). Also, for ay matrix M, let M, :=,k Mk), i.e. the sums of absolute values of the elemets, ad M, := M where, M := max k M k).

3 Problem Set-up ad Our Method Multiple regressio. We cosider the followig stadard multiple liear regressio model: y k) = X k) θk) +w k), k =,...,r, where y k) R is the respose for the k-th task, regressed o the desig matrix X k) R p possibly differet across tasks), while w k) R is the oise vector. We assume each w k) is draw idepedetly from N0,σ ). The total umber of tasks or target variables is r, the umber of features isp, while the umber of samples we have for each task is. For otatioal coveiece, we collate these quatities ito matrices Y R r for the resposes, Θ R p r for the regressio parameters ad W R r for the oise. Dirty Model. I this paper we are iterested i estimatig the true parameter Θ from data by leveragig ay ukow) extet of simultaeous-sparsity. I particular, certai rows of Θ would have may o-zero etries, correspodig to features shared by several tasks shared rows), while certai rows would be elemetwise sparse, correspodig to those features which are relevat for some tasks but ot all o-shared rows ), while certai rows would have all zero etries, correspodig to those features that are ot relevat to ay task. We are iterested i estimators Θ that automatically adapt to differet levels of sharedess, ad yet eoy the followig guaratees: Support recovery: We say a estimator Θ successfully recovers the true siged support if sigsupp Θ)) = sigsupp Θ)). We are iterested i derivig sufficiet coditios uder which the estimator succeeds. We ote that this is stroger tha merely recoverig the row-support of Θ, which is uio of its supports for the differet tasks. I particular, deotig for the support of the k-th colum of Θ, ad U = k. Error bouds: We are also iterested i providig bouds o the elemetwisel orm error of the estimator Θ, Θ Θ = max max Θ k) k) Θ. =,...,p. Our Method k=,...,r Our method explicitly models the dirty block-sparse structure. We estimate a sum of two parameter matrices B ad S with differet regularizatios for each: ecouragig block-structured row-sparsity ib ad elemetwise sparsity is. The correspodig clea models would either ust use blocksparse regularizatios [8, 0] or ust elemetwise sparsity regularizatios [4, 8], so that either method would perform better i certai suited regimes. Iterestigly, as we will see i the mai results, by explicitly allowig to have both block-sparse ad elemetwise sparse compoet, we are able to outperform both classes of these clea models, for all regimes Θ. Algorithm Dirty Block Sparse Solve the followig covex optimizatio problem: Ŝ, B) argmi S,B r y k) X k) S k) +B k)) +λ s S, +λ b B,. ) k= The output Θ = B +Ŝ. 3 Mai Results ad Their Cosequeces We ow provide precise statemets of our mai results. A umber of recet results have show that the Lasso [4, 8] ad l /l block-regularizatio [8] methods succeed i recoverig siged supports with cotrolled error bouds uder high-dimesioal scalig regimes. Our first two theorems exted these results to our dirty model settig. I Theorem, we cosider the case of determiistic desig matrices X k), ad provide sufficiet coditios guarateeig siged support recovery, ad elemetwisel orm error bouds. I Theorem, we specialize this theorem to the case where the 3

4 rows of the desig matrices are radom from a geeral zero mea Gaussia distributio: this allows us to provide scalig o the umber of observatios required i order to guaratee siged support recovery ad bouded elemetwise l orm error. Our third result is the most iterestig i that it explicitly quatifies the performace gais of our method vis-a-vis Lasso ad the l /l block-regularizatio method. Sice this etailed fidig the precise costats uderlyig earlier theorems, ad a correspodigly more delicate aalysis, we follow Negahba ad Waiwright [8] ad focus o the case where there are two-tasks i.e. r = ), ad where we have stadard Gaussia desig matrices as i Theorem. Further, while each of two tasks depeds osfeatures, oly a fractioαof these are commo. It is the iterestig to see how the behaviors of the differet regularizatio methods vary with the extet of overlap α. Comparisos. Negahba ad Waiwright [8] show that there is actually a phase trasitio i the scalig of the probability of successful siged support-recovery with the umber of observatios. Deote a particular rescalig of the sample-size θ Lasso,p,α) = slogp s). The as Waiwright [8] show, whe the rescaled umber of samples scales as θ Lasso > + δ for ay δ > 0, Lasso succeeds i recoverig the siged support of all colums with probability covergig to oe. But whe the sample size scales asθ Lasso < δ for ayδ > 0, Lasso fails with probability covergig to oe. For the l /l -reguralized multiple liear regressio, defie a similar rescaled sample size slogp α)s) θ,,p,α) =. The as Negahba ad Waiwright [8] show there is agai a trasitio i probability of success from ear zero to ear oe, at the rescaled sample size ofθ, = 4 3α). Thus, for α < /3 less sharig ) Lasso would perform better sice its trasitio is at a smaller sample size, while for α > /3 more sharig ) the l /l regularized method would perform better. As we show i our third theorem, the phase trasitio for our method occurs at the rescaled sample size of θ, = α), which is strictly before either the Lasso or the l /l regularized method except for the boudary cases: α = 0, i.e. the case of o sharig, where we match Lasso, ad for α =, i.e. full sharig, where we match l /l. Everywhere else, we strictly outperform both methods. Figure 3 shows the empirical performace of each of the three methods; as ca be see, they agree very well with the theoretical aalysis. Further details i the experimets Sectio 4). 3. Sufficiet Coditios for Determiistic Desigs We first cosider the case where the desig matrices X k) for k =,,r are determiistic, ad start by specifyig the assumptios we impose o the model. We ote that similar sufficiet coditios for the determiistic X k) s case were imposed i papers aalyzig Lasso [8] ad block-regularizatio methods [8, 0]. A0 Colum Normalizatio X k) for all =,...,p,k =,...,r. Let deote the support of the k-th colum of Θ, ad U = k deote the uio of supports for each task. The we require that A Icoherece Coditio γ b := max U c k= r X k),x k) X k),x k) ) > 0. We will also fid it useful to defieγ s := max k r max U c k X k),x k) X k),x k) ). Note that by the icoherece coditio A, we have γ s > 0. ) A Eigevalue Coditio C mi := mi k r λmi X k) U k,x k) > 0. ) A3 Boudedess Coditio D max := max X k) U k r k,x k) <., Further, we require the regularizatio pealties be set as λ s > γs)σ logpr) ad λ b > γ b)σ logpr). ) γ s γ b 4

5 Probability of Success Dirty Model LASSO L/Lif Reguralizer p=8 p=56 p=5 Probability of Success Dirty Model L/Lif Reguralizer LASSO p=8 p=56 p=5 0 Cotrol Parameter θ a) α = Cotrol Parameter θ b) α = Dirty Model Probability of Success L/Lif Reguralizer LASSO p=8 p=56 p= Cotrol Parameter θ c) α = 0.8 Figure : Probability of success i recoverig the true siged support usig dirty model, Lasso ad l /l regularizer. For a -task problem, the probability of success for differet values of feature-overlap fractio α is plotted. As we ca see i the regimes that Lasso is better tha, as good as ad worse tha l /l regularizer a), b) ad c) respectively), the dirty model outperforms both of the methods, i.e., it requires less umber of observatios for successful recovery of the true siged support compared to Lasso adl /l regularizer. Here s = p 0 always. Theorem. Suppose A0-A3 hold, ad that we obtai estimate Θ from our algorithm with regularizatio parameters chose accordig to ). The, with probability at least c exp c ), we are guarateed that the covex program ) has a uique optimum ad a) The estimate Θ has o false iclusios, ad has bouded l orm error so that Supp Θ) Supp Θ), ad Θ Θ 4σ logpr), +λ sd max. C } mi {{} b mi b) sigsupp Θ)) = sig Supp Θ) ) provided that mi θ k) > b mi.,k) Supp Θ) Here the positive costatsc,c deped oly oγ s,γ b,λ s,λ b adσ, but are otherwise idepedet of,p,r, the problem dimesios of iterest. Remark: Coditio a) guaratees that the estimate will have o false iclusios; i.e. all icluded features will be relevat. If i additio, we require that it have o false exclusios ad that recover the support exactly, we eed to impose the assumptio i b) that the o-zero elemets are large eough to be detectable above the oise. 3. Geeral Gaussia Desigs Ofte the desig matrices cosist of samples from a Gaussia esemble. Suppose that for each task k =,...,r the desig matrix X k) R p is such that each row X k) i R p is a zero-mea Gaussia radom vector with covariace matrix Σ k) R p p, ad is idepedet of every other row. Let Σ k) V,U R V U be the submatrix of Σ k) with rows correspodig to V ad colums to U. We require these covariace matrices to satisfy the followig coditios: r ) C Icoherece Coditio γ b := max,, Σ k), > 0 U c Σk) k= 5

6 C Eigevalue Coditio C mi := mi is bouded away from zero. C3 Boudedess Coditio D max := ) k r λmi Σ k), Σ k), ), <. > 0 so that the miimum eigevalue These coditios are aalogues of the coditios for determiistic desigs; they are ow imposed o the covariace matrix of the radomly geerated) rows of the desig matrix. Further, defiig s := max k, we require the regularizatio pealties be set as 4σ C ) / milogpr) λ s > γ s Cmi slogpr) 4σ C ) / mirrlog)+logp)) ad λ b > γ b Cmi srrlog)+logp)). 3) Theorem. Suppose assumptios )) C-C3 hold, ad that the umber of samples scale as > s logpr) max C miγ, sr r log)+logp) s C miγ. Suppose we obtai estimate Θ from algorithm 3). The, b with probability at least c exp c rlog)+logp))) c 3 exp c 4 logrs)) for some positive umbers c c 4, we are guarateed that the algorithm estimate Θ is uique ad satisfies the followig coditios: a) the estimate Θ has o false iclusios, ad has bouded l orm error so that ) Supp Θ) Supp Θ), ad Θ Θ 50σ logrs) 4s, +λ s +D max. C mi C mi }{{} g mi b) sigsupp Θ)) = sig Supp Θ) ) provided that mi θ k),k) Supp Θ) > g mi. 3.3 Sharp Trasitio for -Task Gaussia Desigs This is oe of the most importat results of this paper. Here, we perform a more delicate ad fier aalysis to establish precise quatitative gais of our method. We focus o the special case where r = ad the desig matrix has rows geerated from the stadard Gaussia distributio N0,I ), so that C C3 hold, with C mi = D max =. As we will see both aalytically ad experimetally, our method strictly outperforms both Lasso ad l /l -block-regularizatio over for all cases, except at the extreme edpoits of o support sharig where it matches that of Lasso) ad full support sharig where it matches that ofl /l ). We ow preset our aalytical results; the empirical comparisos are preseted ext i Sectio 4. The results will be i terms of a particular rescalig of the sample sizeas θ,p,s,α) := We will also require the assumptios that F λ s > F λ b > α)slogp α)s). 4σ ) / s/)logr) + logp α)s)), ) / s) / α) s logr) + logp α)s))) / 4σ ) / s/)rrlog) + logp α)s)). ) / s) / α/) sr rlog) + logp α)s))) / Theorem 3. Cosider a -task regressio problem, p, s, α), where the desig matrix has rows geerated from the stadard Gaussia distributio N0,I ). Suppose max B Θ ) 6

7 Θ ) = oλs), where B is the submatrix of Θ with rows where both etries are o-zero. The the estimate Θ of the problem ) satisfies the followig: Success) Suppose the regularizatio coefficiets satisfy F F. Further, assume that the umber of samples scales asθ,p,s,α) >. The, with probability at least c exp c ) for some positive umbers c ad c, we are guarateed that Θ satisfies the support-recovery ad l error boud coditios a-b) i Theorem. Failure) If θ,p,s,α) ) < there is o solutio ˆB,Ŝ) for ay choices of λ s ad λ b such that sig Supp Θ) = sig Supp Θ) ). We ote that we require the gap Θ ) Θ ) to be small oly o rows where both etries are o-zero. As we show i a more geeral theorem i the appedix, eve i the case where the gap is large, the depedece of the sample scalig o the gap is quite weak. 4 Empirical Results I this sectio, we ivestigate the performace of our dirty block sparse estimator o sythetic ad real-world data. The sythetic experimets explore the accuracy of Theorem 3, ad compare our estimator with LASSO ad the l /l regularizer. We see that Theorem 3 is very accurate ideed. Next, we apply our method to a real world datasets cotaiig had-writte digits for classificatio. Agai we compare agaist LASSO ad thel /l. a multi-task regressio dataset) with r = tasks. I both of this real world dataset, we show that dirty model outperforms both LASSO ad l /l practically. For each method, the parameters are chose via cross-validatio; see supplemetal material for more details. 4. Sythetic Data Simulatio We cosider a r = -task regressio problem as discussed i Theorem 3, for a rage of parameters,p,s,α). The desig matrices X have each etry beig i.i.d. Gaussia with mea 0 ad variace. For each fixed set of,s,p,α), we geerate 00 istaces of the problem. I each istace, give p,s,α, the locatios of the o-zero etries of the true Θ are chose at radomly; each ozero etry is the chose to be i.i.d. Gaussia with mea 0 ad variace. samples are the geerated from this. We the attempt to estimate usig three methods: our dirty model, l /l regularizer ad LASSO. I each case, ad for each istace, the pealty regularizer coefficiets are foud by cross validatio. After solvig the three problems, we compare the siged support of the solutio with the true siged support ad decide whether or ot the program was successful i siged support recovery. We describe these process i more details i this sectio. Performace Aalysis: We ra the algorithm for five differet values of the overlap ratio α {0.3, 3,0.8} with three differet umber of features p {8,56,5}. For ay istace of the problem,p,s,α), if the recovered matrix ˆΘ has the same sig support as the true Θ, the we cout it as success, otherwise failure eve if oe elemet has differet sig, we cout it as failure). As Theorem 3 predicts ad Fig 3 shows, the right scalig for the umber of oservatios is slogp α)s), where all curves stack o the top of each other at α. Also, the umber of observatios required by dirty model for true siged support recovery is always less tha both LASSO ad l /l regularizer. Fig a) shows the probability of success for the case α = 0.3 whe LASSO is better tha l /l regularizer) ad that dirty model outperforms both methods. Whe α = 3 see Fig b)), LASSO ad l /l regularizer performs the same; but dirty model require almost 33% less observatios for the same performace. As α grows toward, e.g. α = 0.8 as show i Fig c), l /l performs better tha LASSO. Still, dirty model performs better tha both methods i this case as well. 7

8 Phase Trasitio Threshold Dirty Model L/Lif Regularizer p=8 p=56 p=5 LASSO Shared Support Parameter α Figure : Verificatio of the result of the Theorem 3 o the behavior of phase trasitio threshold by chagig the parameter α i a -task,p,s,α) problem for dirty model, LASSO ad l /l regularizer. The y-axis is, where is the umber of samples at which threshold was observed. Here s = p. Our slogp α)s) 0 dirty model method shows a gai i sample complexity over the etire rage of sharig α. The pre-costat i Theorem 3 is also validated. Our Model l /l LASSO 0 Average Classificatio Error 8.6% 9.9% 0.8% Variace of Error 0.53% 0.64% 0.5% Average Row Support Size B:65 B + S: Average Support Size S:8 B + S: Average Classificatio Error 3.0% 3.5% 4.% Variace of Error 0.56% 0.6% 0.68% Average Row Support Size B: B + S: Average Support Size S:34 B + S: Average Classificatio Error.% 3.%.8% Variace of Error 0.57% 0.68% 0.85% Average Row Support Size B:70 B + S: Average Support Size S:67 B + S: Table : Hadwritig Classificatio Results for our model, l /l ad LASSO Scalig Verificatio: To verify that the phase trasitio threshold chages liearly with α as predicted by Theorem 3, we plot the phase trasitio threshold versus α. For five differet values of α {0.05,0.3, 3,0.8,0.95} ad three differet values of p {8,56,5}, we fid the phase trasitio threshold for dirty model, LASSO ad l /l regularizer. We cosider the poit where the probability of success i recovery of siged support exceeds 50% as the phase trasitio threshold. We fid this poit by iterpolatio o the closest two poits. Fig shows that phase trasitio threshold for dirty model is always lower tha the phase trasitio for LASSO ad l /l regularizer. 4. Hadwritte Digits Dataset We use the hadwritte digit dataset [], cotaiig features of hadwritte umerals 0-9) extracted from a collectio of Dutch utility maps. This dataset has bee used by a umber of papers [7, 6] as a reliable dataset for hadwritte recogitio algorithms. There are thus r = 0 tasks, ad each hadwritte sample cosists of p = 649 features. Table shows the results of our aalysis for differet sizes of the traiig set. We measure the classificatio error for each digit to get the 0-vector of errors. The, we fid the average error ad the variace of the error vector to show how the error is distributed over all tasks. We compare our method with l /l reguralizer method ad LASSO. Agai, i all methods, parameters are chose via cross-validatio. For our method we separate out the B ad S matrices that our method fids, so as to illustrate how may features it idetifies as shared ad how may as o-shared. For the other methods we ust report the straight row ad support umbers, sice they do ot make such a separatio. Ackowledgemets We ackowledge support from NSF grat IIS-084, ad NSF CAREER program, Grat

9 Refereces [] A. Asucio ad D.J. Newma. UCI Machie Learig Repository, mlear/mlrepository.html. Uiversity of Califoria, School of Iformatio ad Computer Sciece, Irvie, CA, 007. [] F. Bach. Cosistecy of the group lasso ad multiple kerel learig. Joural of Machie Learig Research, 9:79 5, 008. [3] R. Baraiuk. Compressive sesig. IEEE Sigal Processig Magazie, 44):8, 007. [4] R. Caruaa. Multitask learig. Machie Learig, 8:4 75, 997. [5] C.Zhag ad J.Huag. Model selectio cosistecy of the lasso selectio i high-dimesioal liear regressio. Aals of Statistics, 36: , 008. [6] X. He ad P. Niyogi. Locality preservig proectios. I NIPS, 003. [7] K. Louici, A. B. Tsybakov, M. Potil, ad S. A. va de Geer. Takig advatage of sparsity i multi-task learig. I d Coferece O Learig Theory COLT), 009. [8] S. Negahba ad M. J. Waiwright. Joit support recovery uder high-dimesioal scalig: Beefits ad perils of l, -regularizatio. I Advaces i Neural Iformatio Processig Systems NIPS), 008. [9] S. Negahba ad M. J. Waiwright. Estimatio of ear) low-rak matrices with oise ad high-dimesioal scalig. I ICML, 00. [0] G. Oboziski, M. J. Waiwright, ad M. I. Jorda. Support uio recovery i high-dimesioal multivariate regressio. Aals of Statistics, 00. [] P. Ravikumar, H. Liu, J. Lafferty, ad L. Wasserma. Sparse additive models. Joural of the Royal Statistical Society, Series B. [] P. Ravikumar, M. J. Waiwright, ad J. Lafferty. High-dimesioal isig model selectio usig l -regularized logistic regressio. Aals of Statistics, 009. [3] B. Recht, M. Fazel, ad P. A. Parrilo. Guarateed miimum-rak solutios of liear matrix equatios via uclear orm miimizatio. I Allerto Coferece, Allerto House, Illiois, 007. [4] R. Tibshirai. Regressio shrikage ad selectio via the lasso. Joural of the Royal Statistical Society, Series B, 58):67 88, 996. [5] J. A. Tropp, A. C. Gilbert, ad M. J. Strauss. Algorithms for simultaeous sparse approximatio. Sigal Processig, Special issue o Sparse approximatios i sigal ad image processig, 86:57 60, 006. [6] B. Turlach, W.N. Veables, ad S.J. Wright. Simultaeous variable selectio. Techo- metrics, 7: , 005. [7] M. va Breukele, R.P.W. Dui, D.M.J. Tax, ad J.E. de Hartog. Hadwritte digit recogitio by combied classifiers. Kyberetika, 344):38 386, 998. [8] M. J. Waiwright. Sharp thresholds for oisy ad high-dimesioal recovery of sparsity usig l -costraied quadratic programmig lasso). IEEE Trasactios o Iformatio Theory, 55: 83 0,

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short