The deterministic Lasso
|
|
- Stewart Morgan
- 5 years ago
- Views:
Transcription
1 The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality is presented, under a so called compatibility condition Our aim is three fold: to proof a result announced in van de Geer 2007, to provide a simple proof with simple constants, and to separate the stochastic problem from the deterministic one KEY WORDS: coherence, Lasso, oracle, sparsity Introduction We consider the Lasso for high-dimensional generalized linear models, introducing three new elements Firstly, we complete the proof of an oracle inequality, which was announced in van de Geer 2007 The result is shown to hold under a so-called compatibility condition for certain norms This condition is a weaker version of Assumption C* in van de Geer 2007, and is related to coherence conditions in eg Candes and Tao, and Bunea et al 2006, 2007 Secondly, we state the result with simple constants, and give a new and rather straightforward proof Thirdly, in Subsection 3, we separate the stochastic part of the problem from the deterministic part In the subsequent sections, we then only concentrate on the deterministic part whence the title of this paper The advantage of this approach is that the behavior of the Lasso under various sampling schemes eg, dependent observations, can immediately be deduced once the stochastic problem a probability inequality for the empirical process is clear Consider data {Z i } n i=, with for i =,, n Z i in some space Z Let F := F, be a normed real vector space, and, for each f F, ρ f : Z R be a loss function We assume that the map f ρ f z is convex for all z Z For example, in a regression setup, one has for i =,, n Z i = X i, Y i, with covariables X i X and response variable Y i Y R, and f : X R is a regression function Examples are quadratic loss, or logistic loss, etc ρ f, y = y f 2, ρ f, y = yf + log + exp[f], The target We denote, for a function ρ : Z R, the theoretical measure by n P ρ := EρZ i /n, i= and the empirical measure by Define the target P n ρ := n ρz i /n i= f 0 := arg min f F P ρ f We assume for simplicity that the minimum is attained, and that it is unique For f F, the excess risk is 2 The Lasso Ef := P ρ f ρ f 0 Consider a given linear subspace F := {f β : β R p } F, where β f β is linear The collection F will be the model class over which we perform empirical risk minimization When F is high-dimensional, a complexity penalty can prevent overfitting The Lasso is where ˆβ = arg min β { Pn ρ fβ + λ β }, β := p β j, j= ie, is the l -norm The parameter λ is a smoothing - or tuning - parameter We examine the excess risk Ef ˆβ of the Lasso, and show that it behaves as if it knew which variables are relevant for a linear approximation of the target f 0 Section 2 does this under a so-called compatibility condition Section 3 examines under what circumstances the compatibility condition is met In, the penalty weights all coefficients equally In practice, certain terms eg, the constant term will not be penalized, and a weighted version of the l norm is used, with eg, more weight assigned to variables with large sample variance In essence possibly random weights do not alter the main issues in the theory We therefore consider the non-weighted l penalty to clarify these issues More details on the case of empirical weights are in Bunea et al 2006 and van de Geer 2007
2 3 The empirical process Throughout, the study of the stochastic part of the problem involving the empirical process is set aside It can be handled using empirical process theory In the current paper, we simply restrict ourselves to a set S where the empirical process behaves well see 6 in Lemma 2 This allows us to proceed with purely deterministic, and relatively straightforward, arguments More precisely, write for M > 0, Z M := sup P n P ρ fβ ρ fβ 2 β β M Here, β is some fixed parameter value, which will be specified in Section 2 Our results hold on the set S = {Z M ε}, with M and ε appropriately chosen The ratio ε/m will play a major role: it has to be chosen large enough so that S has large probability On the other hand, ε/m will form be a lower bound for the smoothing parameter λ see 5 in Lemma 2 It is shown in van de Geer 2007, that for independent observations Z i, under mild conditions, and with ε/m log2p/n, the set S has large probability in the case of Lipschitz loss functions, that is, for f ρ f z is Lipschitz for all z For other loss functions eg quadratic loss, similar behavior can be obtained, but only for small enough values of M In the result of Lemma 2, one may replace P n by any other random probability measure, say ˆP, in the definition of ˆβ, provided one also adjusts the definition 2 of Z M The result for the Lasso goes through on S One should then ensure that S contains most of the probability mass by taking ε/m appropriately, and this will then put lower bounds on the smoothing parameter 4 The margin behavior We will make use of the so-called margin condition see below, which is assumed for a neighborhood, denoted by F η, of the target f 0 Some of the effort goes in proving that the estimator is indeed in this neighborhood Here, we use arguments that rely on the assumed convexity of f ρ f These arguments also show that the stochastic part of the problem needs only to be studied locally ie, 2 for small values of M The margin condition requires that in the neighborhood F η F of f 0, the excess risk E is bounded from below by a strictly convex function This is true in many particular cases for an L neighborhood F η = { f f 0 η} for F being a class of functions Definition We say that the margin condition holds with strictly convex function G, if for all f F η, we have Ef G f f 0 Indeed, in typical cases, the margin condition holds with quadratic function G, that is, Gu = cu 2, u 0, where c is a positive constant We also need the notion of convex conjugate of a function Definition Let G be a strictly convex function on [0,, with G0 = 0 The convex conjugate H of G is defined as Hv = sup {uv Gu}, v 0 u Note that when G is quadratic, its convex conjugate H is quadratic as well The function H will appear in the estimation error term 2 An oracle inequality Our goal is now, to show that the estimator has oracle properties, namely, to show, for a set B R p as large as possible, that Ef ˆβ const min β B { Ef β + H λ } J β, 3 with large probability Here, H is the convex conjugate of the function G appearing in the margin condition The left hand side of 3 is small if the target f 0 can be well approximated by an f β with only a few of the coefficients β j j =,, p non-zero, that is, by a sparse f β To arrive at such an inequality, we need a certain amount of compatibility between the l norm and the metric on the vector space F 2 The compatibility condition Let J {,, p} be an index set Definition We say that the compatibility condition is met for the set J, with constants φj > 0 and νj > 0, if for all β R p that satisfy j / J β j 3 β j, it holds that β j J f β /φj + j / J β j / + νj 4 More details on this condition are in Section 3 22 The result Let 0 < δ < be fixed but otherwise arbitrary For β R p, set J β := {j : β j 0} and J β := J β The oracle β is defined as { } 2λ β Jβ := arg min + δef β + 2δH, β B φj δ Here, the minimum is over the set B of all β, for which the compatibility condition holds for J β Let J := {j : βj 0}, and let φ = φj and = νj Set 2λ ε Jβ := + δef β + 2δH φ δ
3 Lemma 2 Lasso under the compatibility condition Consider a constant λ 0 satisfying the inequality λ λ, 5 and let M = ε /λ 0 Assume the margin condition with strictly convex function G, and that the oracle f β is in the neighborhood F η of the target, as well as f β F η for all β β M Then on the set we have either or S := {Z M ε }, 6 δef ˆβ + ν 2 + λ ˆβ β 2ε, Ef ˆβ + λ ˆβ 42 + ν β ε 23 Asymptotics in n In the case of independent observations, the set S has large probability under general conditions, for λ 0 log2p/n see van de Geer 2007 Taking λ also of this order, and assuming a a quadratic margin, and in addition that φ stays away from zero, the estimation error H2λ J β /φ δ behaves like J β log2p/n 24 The case = We observe that often λ 0 can be taken independently of oracle values, whereas 5 requires a lower bound on the smoothing parameter λ which depends on the unknown oracle value The best possible situation is when is arbitrarily large Now, the case > 2 corresponds after a reparametrization to the case = Indeed, if νj > 2 and j / J β j 3 β j, then 4 implies νj 2 J β j fβ /φj + νj With =, Lemma 2 reads Corollary 2 Let λ 0 λ/8, and M := ε /λ 0 Assume the conditions of Lemma 2 with = Then on the set S given in 6, we have either or δef ˆβ + λ ˆβ β 2ɛ, Ef ˆβ + λ ˆβ β 4ɛ 3 On the compatibility condition Let f β 2 := β T Σβ, with Σ a p p matrix The entries of Σ are denoted by σ j,k j, k {,, p} We first present a simple lemma Lemma 3 Suppose that Σ has smallest eigenvalue φ > 0 Then the compatibility condition holds for all index sets J, with φj = φ and νj = 3 The coherence lemma For a vector v, and for q, we shall write v q = j v j q /q, with the usual extension for q = We now fix an L {,, p} Consider a partition of the form Σ, Σ Σ :=,2 Σ 2, Σ 2,2 where Σ, is an L L matrix, Σ 2, is a p L L matrix and Σ,2 = Σ T 2, is its transpose, and where Σ 2,2 is a p L p L matrix Then for any b R L and b 2 R p L, and for b T := b T, b T 2, one has b T Σ, b b T Σb + 2 Σ 2, q b 2 b 2 r, with /q + /r =, and Σ 2, q := sup Σ 2, a q a R L, a 2= Consider now a subset L {,, p} We introduce the L L matrix Σ, L := σ j,k j,k L, and the p L L matrix Σ 2, L := σ j,k j / L,k L, We let ρ 2 L be the smallest eigenvalue of Σ, L and take ϑ 2 ql := Σ 2, L q We moreover define and φj := θ q J := min ρl, L J : L =L max ϑ ql L J : L =L Lemma 32 Coherence lemma Fix J {,, p}, with cardinality J := J L Then we have for any β R p, J β j φj f β 2 + β j / + νj j / J with, for /q + /r =, + νj = 2 Jq /r θ 2 q J L+ J /q φ 2 J 2 J θ2 J φ 2 J q < q = The coherence lemma uses an auxiliary lemma Lemma 4 which is based on ideas in Candes and Tao 2007
4 32 Relation with other coherence conditions In the coherence lemma, the choice of L J and q is open They are allowed to depend on J, and for our purposes should be taken in such a way that φj and νj are as large as possible To gain insight into what the coherence lemma can bring us, let us again consider the partition Σ, Σ Σ :=,2 Σ 2, Σ 2,2 Note first that Σ 2, 2 2 is the largest eigenvalue of the matrix Σ,2 Σ 2, The coherence lemma with q = 2 is along the lines of the coherence condition in Candes and Tao 2007 They choose L not larger than 3J It is easy to see that Σ 2, q p j=l+ L k= σ 2 j,k q /q 7 with β j,in = β j l{j J } and β j,out = β j l{j / J } Note thus that βin = β and βout 0 Throughout the proof, we assume we are on the set S Let M s := M + ˆβ, β and β := s ˆβ + sβ Write f := f β, Ẽ := E f and f := f β, E := Ef The proof is structured as follows We first note that β β M and prove the result with ˆβ replaced by β As a byproduct we obtain that ˆβ β M The latter allows us to repeat the proof with β replaced by ˆβ By the convexity of f ρ f, and of, Thus, P n ρ f + λ β s[p n ρ f ˆβ + λ ˆβ ] + s[p n ρ f + λ β ] P n ρ f + λ β Ẽ + λ β 8 Thus, Σ 2, q p L max σ j,k q k L j=l+ /q = P n P ρ f ρ f + P n ρ f ρ f + E + λ β P n P ρ f ρ f + E + λ β Z M + E + λ β So on S For q =, this reads Ẽ + λ β ε + E + λ β 9 L Σ 2, max max σ j,k L+ j p k L Note also that with q =, the choice L = J in the coherence lemma gives the best result With this choice of L, the coherence lemma with q = gives a positive value for νj required in the compatibility conditions, if 2J max max σ j,k /φ 2 J < j / J k J This is in the spirit of the maximal local coherence condition in Bunea et al 2007 It also follows from 7, that Σ 2, q p j=l+ L q σ j,k k= Invoking this bound with q = and L = J in the coherence lemma leads to a condition which is similar to the cumulative local coherence condition in Bunea et al Proofs Proof of Lemma 2 We write β = β in + β out, /q Therefore, we have Ẽ + λ β out 0 ε + E + λ β in β 2ε + λ β in β Case If then 0 implies λ β in β Ẽ + λ β out 22 + ε, ε, and hence Ẽ + λ β 42 + ν β ε But then β β 42 + ν ε λ = 42 + ν λ 0 λ M M 2, since λ 82 + λ 0 / This implies that ˆβ β M Case 2 If λ β 22 + ν in β ε,
5 we get from 0 and using 22 + = 4 + ν, that λ β out 2ε + λ β in β 3λ β in β This means that we can apply the weak compatibility condition Recall that inequality 9 implies We have and So Ẽ + λ β out ε + E + λ β λ β in ε + E + λ β in β β out = β out + ν 2 + β out, β out = β β β in β Ẽ ν λ β out + ν 2 + λ β β ε + E ν 2 + λ β in β With the compatibility condition on J, we find β in β J f β f β /φ + β out / + Thus Ẽ λ β out + ν 2 + λ β β ε +E ν 2 + J f β f β /φ λ β out The term with β out cancels out Moreover, we may use the bound + /2 + This allows us to conclude that Ẽ + ν 2 + λ β β ε + E + 2λ J f β f β /φ Now, because we assumed f β F η, and since also f β F η, we can invoke the margin condition to arrive at 2λ J f βn f β n /φ 2λ J 2δH φ + δẽ δ + δe It follows that Ẽ + ν 2 + λ β β ε + + δe + 2δH 2λ J φ + δẽ δ = ε + ε + δẽ = 2ε + δẽ, or δẽ + ν 2 + λ β β 2ε 2 This yields β β 22 + ν λ 0 M M λ 4, where the last inequality is ensured by the assumption λ 82 + / λ 0 But β β M /4 implies ˆβ β M 3 M So in both Case and Case 2, we arrive at ˆβ β M Now, repeat the above arguments with β replaced by ˆβ Let Ê = Ef ˆβ Then, either from Case with this replacement: Ê + λ ˆβ 42 + ν β ε Or we are in Case 2 and can apply the compatibility assumption Then we arrive at 2 with replacement: δê + ν 2 + λ ˆβ β 2ε Proof of Lemma 3 Let β R p be arbitrary It is clear that β 2 2 f β 2 /φ 2 Hence, β j J β j 2 J β 2 J f β /φ Proof of Lemma 32 Let β = β in + β out, with β j,in := β j l and β j,out := β j l j / J, j =,, p Let L J, with L\J being the set of the L J largest β j with j / J Define b j, := β j l j L and b j,2 := β j l j / L We have b T Σb 2 θqj 2 b 2 b 2 r /2 By the auxiliary lemma below, for q <, This yields Hence b 2 r β out /L + J /q q /r 3 b T Σb 2 θ 2 qj b 2 β out /L + J /q q /r f b 2 f β b T Σb 2 f β 2 + 2θ 2 qj b 2 β out /L + J /q q /r It follows that b 2 2 f b 2 φ 2 J
6 f β 2 φ 2 J + 2 θ2 qj φ 2 J b 2 β out /L + J /q q /r This implies b 2 f β φj + 2 θ2 qj φ 2 J β out /L J /q We thus arrive at J fβ φj β in J β in J θ2 qj φ 2 J β out /L J /q The result with q = follows in the same manner, using instead of 3, the trivial bound b β out Lemma 4 Auxiliary lemma Let b b 2 0 For < r, /q + /r =, and L {0,, }, we have b j r /r q /r L + /q b Proof We have j r L + r + Moreover, for all j, b so that b j b /j Hence b r j b r L+ x r dx q L + r j b l jb j, l= REFERENCES j r q b r L + r Bunea, F, Tsybakov, AB and Wegkamp, MH 2006 Sparsity oracle inequalities for the Lasso submitted Bunea, F, Tsybakov, AB and Wegkamp, MH 2007 Sparse density estimation and aggregation with l penalties, COLT Candes, E and Tao, T 2007 The Dantzig selector: statistical estimation when p is much larger than n To appear in Ann Statist van de Geer, SA 2007, High-dimensional generalized linear models and the Lasso To appear in Ann Statist
(Part 1) High-dimensional statistics May / 41
Theory for the Lasso Recall the linear model Y i = p j=1 β j X (j) i + ɛ i, i = 1,..., n, or, in matrix notation, Y = Xβ + ɛ, To simplify, we assume that the design X is fixed, and that ɛ is N (0, σ 2
More informationTHE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich
Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional
More informationarxiv: v2 [math.st] 12 Feb 2008
arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig
More informationThe adaptive and the thresholded Lasso for potentially misspecified models (and a lower bound for the Lasso)
Electronic Journal of Statistics Vol. 0 (2010) ISSN: 1935-7524 The adaptive the thresholded Lasso for potentially misspecified models ( a lower bound for the Lasso) Sara van de Geer Peter Bühlmann Seminar
More informationarxiv: v1 [math.st] 5 Oct 2009
On the conditions used to prove oracle results for the Lasso Sara van de Geer & Peter Bühlmann ETH Zürich September, 2009 Abstract arxiv:0910.0722v1 [math.st] 5 Oct 2009 Oracle inequalities and variable
More informationLeast squares under convex constraint
Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption
More informationConcentration behavior of the penalized least squares estimator
Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar
More informationReconstruction from Anisotropic Random Measurements
Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013
More informationLASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA
The Annals of Statistics 2009, Vol. 37, No. 1, 246 270 DOI: 10.1214/07-AOS582 Institute of Mathematical Statistics, 2009 LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA BY NICOLAI
More informationStatistics for high-dimensional data: Group Lasso and additive models
Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional
More informationStepwise Searching for Feature Variables in High-Dimensional Linear Regression
Stepwise Searching for Feature Variables in High-Dimensional Linear Regression Qiwei Yao Department of Statistics, London School of Economics q.yao@lse.ac.uk Joint work with: Hongzhi An, Chinese Academy
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationarxiv: v2 [math.st] 15 Sep 2015
χ 2 -confidence sets in high-dimensional regression Sara van de Geer, Benjamin Stucky arxiv:1502.07131v2 [math.st] 15 Sep 2015 Abstract We study a high-dimensional regression model. Aim is to construct
More informationHigh Dimensional Inverse Covariate Matrix Estimation via Linear Programming
High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011 Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω
More informationl 1 -Regularized Linear Regression: Persistence and Oracle Inequalities
l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and
More informationSparsity oracle inequalities for the Lasso
Electronic Journal of Statistics Vol. 1 (007) 169 194 ISSN: 1935-754 DOI: 10.114/07-EJS008 Sparsity oracle inequalities for the Lasso Florentina Bunea Department of Statistics, Florida State University
More informationTheoretical results for lasso, MCP, and SCAD
Theoretical results for lasso, MCP, and SCAD Patrick Breheny March 2 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/23 Introduction There is an enormous body of literature concerning theoretical
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationTECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection
DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model
More informationORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY. By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B.
Submitted to the Annals of Statistics arxiv: 007.77v ORACLE INEQUALITIES AND OPTIMAL INFERENCE UNDER GROUP SPARSITY By Karim Lounici, Massimiliano Pontil, Sara van de Geer and Alexandre B. Tsybakov We
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationHigh-dimensional Ordinary Least-squares Projection for Screening Variables
1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationarxiv: v1 [math.st] 8 Jan 2008
arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department
More informationSTAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song
STAT 992 Paper Review: Sure Independence Screening in Generalized Linear Models with NP-Dimensionality J.Fan and R.Song Presenter: Jiwei Zhao Department of Statistics University of Wisconsin Madison April
More informationHigh-dimensional covariance estimation based on Gaussian graphical models
High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,
More informationPeter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationMinwise hashing for large-scale regression and classification with sparse data
Minwise hashing for large-scale regression and classification with sparse data Nicolai Meinshausen (Seminar für Statistik, ETH Zürich) joint work with Rajen Shah (Statslab, University of Cambridge) Simons
More informationPaper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)
Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001) Presented by Yang Zhao March 5, 2010 1 / 36 Outlines 2 / 36 Motivation
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationClassification with Reject Option
Classification with Reject Option Bartlett and Wegkamp (2008) Wegkamp and Yuan (2010) February 17, 2012 Outline. Introduction.. Classification with reject option. Spirit of the papers BW2008.. Infinite
More informationModel Selection and Geometry
Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model
More informationRelaxed Lasso. Nicolai Meinshausen December 14, 2006
Relaxed Lasso Nicolai Meinshausen nicolai@stat.berkeley.edu December 14, 2006 Abstract The Lasso is an attractive regularisation method for high dimensional regression. It combines variable selection with
More informationNonconcave Penalized Likelihood with A Diverging Number of Parameters
Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationThe Iterated Lasso for High-Dimensional Logistic Regression
The Iterated Lasso for High-Dimensional Logistic Regression By JIAN HUANG Department of Statistics and Actuarial Science, 241 SH University of Iowa, Iowa City, Iowa 52242, U.S.A. SHUANGE MA Division of
More informationAdditive Isotonic Regression
Additive Isotonic Regression Enno Mammen and Kyusang Yu 11. July 2006 INTRODUCTION: We have i.i.d. random vectors (Y 1, X 1 ),..., (Y n, X n ) with X i = (X1 i,..., X d i ) and we consider the additive
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationD I S C U S S I O N P A P E R
I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive
More informationregression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered,
L penalized LAD estimator for high dimensional linear regression Lie Wang Abstract In this paper, the high-dimensional sparse linear regression model is considered, where the overall number of variables
More informationEmpirical Processes: General Weak Convergence Theory
Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated
More informationLecture 24 May 30, 2018
Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso
More informationHigh dimensional thresholded regression and shrinkage effect
J. R. Statist. Soc. B (014) 76, Part 3, pp. 67 649 High dimensional thresholded regression and shrinkage effect Zemin Zheng, Yingying Fan and Jinchi Lv University of Southern California, Los Angeles, USA
More informationIEOR 265 Lecture 3 Sparse Linear Regression
IOR 65 Lecture 3 Sparse Linear Regression 1 M Bound Recall from last lecture that the reason we are interested in complexity measures of sets is because of the following result, which is known as the M
More informationPenalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms
university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional
More informationLECTURE 15: COMPLETENESS AND CONVEXITY
LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other
More informationA Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees
Journal of Machine Learning Research 17 (2016) 1-20 Submitted 11/15; Revised 9/16; Published 12/16 A Practical Scheme and Fast Algorithm to Tune the Lasso With Optimality Guarantees Michaël Chichignoud
More informationAsymptotic efficiency of simple decisions for the compound decision problem
Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com
More informationEigenvalues and Eigenfunctions of the Laplacian
The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationFunctional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...
Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................
More information1. f(β) 0 (that is, β is a feasible point for the constraints)
xvi 2. The lasso for linear models 2.10 Bibliographic notes Appendix Convex optimization with constraints In this Appendix we present an overview of convex optimization concepts that are particularly useful
More informationPeter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More information21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)
10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC
More informationBayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA)
Bayesian Perspectives on Sparse Empirical Bayes Analysis (SEBA) Natalia Bochkina & Ya acov Ritov February 27, 2010 Abstract We consider a joint processing of n independent similar sparse regression problems.
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationConvex functions, subdifferentials, and the L.a.s.s.o.
Convex functions, subdifferentials, and the L.a.s.s.o. MATH 697 AM:ST September 26, 2017 Warmup Let us consider the absolute value function in R p u(x) = (x, x) = x 2 1 +... + x2 p Evidently, u is differentiable
More informationLecture 21 Theory of the Lasso II
Lecture 21 Theory of the Lasso II 02 December 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Class Notes Midterm II - Available now, due next Monday Problem Set 7 - Available now, due December 11th
More informationDimension Reduction in Abundant High Dimensional Regressions
Dimension Reduction in Abundant High Dimensional Regressions Dennis Cook University of Minnesota 8th Purdue Symposium June 2012 In collaboration with Liliana Forzani & Adam Rothman, Annals of Statistics,
More informationarxiv: v1 [math.st] 10 May 2009
ESTIMATOR SELECTION WITH RESPECT TO HELLINGER-TYPE RISKS ariv:0905.1486v1 math.st 10 May 009 YANNICK BARAUD Abstract. We observe a random measure N and aim at estimating its intensity s. This statistical
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationDISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO. By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich
Submitted to the Annals of Statistics DISCUSSION OF A SIGNIFICANCE TEST FOR THE LASSO By Peter Bühlmann, Lukas Meier and Sara van de Geer ETH Zürich We congratulate Richard Lockhart, Jonathan Taylor, Ryan
More informationDoes Compressed Sensing have applications in Robust Statistics?
Does Compressed Sensing have applications in Robust Statistics? Salvador Flores December 1, 2014 Abstract The connections between robust linear regression and sparse reconstruction are brought to light.
More informationFriedrich symmetric systems
viii CHAPTER 8 Friedrich symmetric systems In this chapter, we describe a theory due to Friedrich [13] for positive symmetric systems, which gives the existence and uniqueness of weak solutions of boundary
More informationVariable Selection for Highly Correlated Predictors
Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu arxiv:1709.04840v1 [stat.me] 14 Sep 2017 Abstract Penalty-based variable selection methods are powerful in selecting relevant covariates
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationLasso-type recovery of sparse representations for high-dimensional data
Lasso-type recovery of sparse representations for high-dimensional data Nicolai Meinshausen and Bin Yu Department of Statistics, UC Berkeley December 5, 2006 Abstract The Lasso (Tibshirani, 1996) is an
More informationLecture 14: Variable Selection - Beyond LASSO
Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)
More informationNONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION
NONLINEAR LEAST-SQUARES ESTIMATION DAVID POLLARD AND PETER RADCHENKO ABSTRACT. The paper uses empirical process techniques to study the asymptotics of the least-squares estimator for the fitting of a nonlinear
More informationThis manuscript is for review purposes only.
1 2 3 4 5 6 7 8 9 10 11 12 THE USE OF QUADRATIC REGULARIZATION WITH A CUBIC DESCENT CONDITION FOR UNCONSTRAINED OPTIMIZATION E. G. BIRGIN AND J. M. MARTíNEZ Abstract. Cubic-regularization and trust-region
More informationSmoothly Clipped Absolute Deviation (SCAD) for Correlated Variables
Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)
More informationMinimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls
6976 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 10, OCTOBER 2011 Minimax Rates of Estimation for High-Dimensional Linear Regression Over -Balls Garvesh Raskutti, Martin J. Wainwright, Senior
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationNONLINEAR FREDHOLM ALTERNATIVE FOR THE p-laplacian UNDER NONHOMOGENEOUS NEUMANN BOUNDARY CONDITION
Electronic Journal of Differential Equations, Vol. 2016 (2016), No. 210, pp. 1 7. ISSN: 1072-6691. URL: http://ejde.math.txstate.edu or http://ejde.math.unt.edu NONLINEAR FREDHOLM ALTERNATIVE FOR THE p-laplacian
More informationStatistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation
Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS
ASYMPTOTIC PROPERTIES OF BRIDGE ESTIMATORS IN SPARSE HIGH-DIMENSIONAL REGRESSION MODELS Jian Huang 1, Joel L. Horowitz 2, and Shuangge Ma 3 1 Department of Statistics and Actuarial Science, University
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationCENTRAL LIMIT THEOREMS AND MULTIPLIER BOOTSTRAP WHEN p IS MUCH LARGER THAN n
CENTRAL LIMIT THEOREMS AND MULTIPLIER BOOTSTRAP WHEN p IS MUCH LARGER THAN n VICTOR CHERNOZHUKOV, DENIS CHETVERIKOV, AND KENGO KATO Abstract. We derive a central limit theorem for the maximum of a sum
More informationLasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices
Article Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices Fei Jin 1,2 and Lung-fei Lee 3, * 1 School of Economics, Shanghai University of Finance and Economics,
More informationHow Correlations Influence Lasso Prediction
How Correlations Influence Lasso Prediction Mohamed Hebiri and Johannes C. Lederer arxiv:1204.1605v2 [math.st] 9 Jul 2012 Université Paris-Est Marne-la-Vallée, 5, boulevard Descartes, Champs sur Marne,
More informationINDUSTRIAL MATHEMATICS INSTITUTE. B.S. Kashin and V.N. Temlyakov. IMI Preprint Series. Department of Mathematics University of South Carolina
INDUSTRIAL MATHEMATICS INSTITUTE 2007:08 A remark on compressed sensing B.S. Kashin and V.N. Temlyakov IMI Preprint Series Department of Mathematics University of South Carolina A remark on compressed
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationSPECTRAL GAP FOR ZERO-RANGE DYNAMICS. By C. Landim, S. Sethuraman and S. Varadhan 1 IMPA and CNRS, Courant Institute and Courant Institute
The Annals of Probability 996, Vol. 24, No. 4, 87 902 SPECTRAL GAP FOR ZERO-RANGE DYNAMICS By C. Landim, S. Sethuraman and S. Varadhan IMPA and CNRS, Courant Institute and Courant Institute We give a lower
More informationAdaptive estimation of the copula correlation matrix for semiparametric elliptical copulas
Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas Department of Mathematics Department of Statistical Science Cornell University London, January 7, 2016 Joint work
More informationRegularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008
Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationA Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations
A Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations arxiv:1510.08986v2 [math.st] 23 Jun 2016 Matey Neykov Yang Ning Jun S. Liu Han Liu Abstract We propose a new
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationExtended Bayesian Information Criteria for Gaussian Graphical Models
Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical
More informationMinimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.
Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of
More informationHigh-dimensional Joint Sparsity Random Effects Model for Multi-task Learning
High-dimensional Joint Sparsity Random Effects Model for Multi-task Learning Krishnakumar Balasubramanian Georgia Institute of Technology krishnakumar3@gatech.edu Kai Yu Baidu Inc. yukai@baidu.com Tong
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More information