Statistics for high-dimensional data

Size: px

Start display at page:

Download "Statistics for high-dimensional data"

Ashley Bradley
5 years ago
Views:

1 Statistics for high-dimensional data Vincent Rivoirard Université Paris-Dauphine Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 1 / 105

2 Introduction High-dimensional data It has now become part of folklore to claim that the 21st century will be the century of data. Our society invests more and more in the collection and processing of data of all kinds (big data phenomenon). Now, data have then a strong impact on almost every branch of human activities including science, medicine, business or humanities. In traditional statistics, we assumed we had many observations and a few well-chosen variables. In modern science, we collect more observations but we also collect radically larger numbers of variables, which consist in thousands up to millions of features voraciously recorded on objects or individuals. Such data are said high-dimensional. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 2 / 105

3 Introduction Examples of high-dimensional data Consumers preferences data: Websites gather informations about browsing and shopping behaviors of consumers. For example, recommendation systems collect consumer s preferences on various products, together with some personal data (age, sex, location,...) and predict which products could be of interest for a given consumer. Traffic jams: Many cities (for instance Boston) have developed programs, to improve traffic, based on big data collection (crowdsourced data) and their analyses. Biotech data: Recent technologies enable to acquire high-dimensional data on single individuals. For example, DNA microarrrays measure the transcription level of thousands of genes simultaneously. Images and videos: Large images or videos are continuously collected all around the world. Each image is made of thousands up to millions of pixels. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 3 / 105

4 Introduction Characterization and problems of high-dimensional data Previous examples show that we are in the era of massive automatic data collection. For previous examples, the number of variables or parameters p is much larger than the number of observations n. Being able to collect a large amount of information on each individual seems to be good news. Unfortunately the mathematical and statistical reality clashes with this optimistic statement: Separating the signal from the noise is a very hard task for high-dimensional data, in full generality impossible. Extracting the good information is more than challenging, consisting in finding a needle in a haystack. This phenomenon is often called the curse of dimensionality, terminology introduced by Richard Bellman, in Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 4 / 105

5 Introduction Curse of dimensionality The volume V p (r) of a p-dimensional ball of radius r for the euclidian distance satisfies ( ) V p (r) p + 2πer 2 p/2 (pπ) 1/2. p So, if (X (i) ) i=1...,n are i.i.d with uniform distribution on the hypercube [ 0.5, 0.5] p, then P( i {1,..., n} : X (i) B p (0, r)) n P(X (1) B p (0, r)) nv p (r). So if n = o(v p (r) 1 ), then the last probability goes to 0. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 5 / 105

6 Introduction Curse of dimensionality Example: Classical regression problem. Estimation of the conditional expectation of a random variable. Data consist of n i.i.d. observations (Y i, X (i) ) i=1...,n with the same distribution as (Y, X ) R R p. We wish to estimate the function m where E[Y X ] = m(x ). We consider the Nadaraya-Watson estimate: n i=1 ˆm(x) = K h(x X (i) )Y i n i=1 K h(x X (i) ), x Rp, K h (x) = ( 1 x1 p j=1 h K,..., x ) p, h = (h j ) j=1,...,p j h 1 h p and K is a kernel (with at least one vanishing moment), i.e. K(x) = p 1 x [ 0.5;0.5] (x j ), K(x) = e 2, K(x) = 1 {B p(0,r)}(x) (2π) p/2 Vol(B p (0, r)) j=1 We have to determine the tuning parameter h which allows to select the variables Y i associated with the neighbors of x among the X (i) s. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 6 / 105

7 Introduction Curse of dimensionality We have to determine the tuning parameter h which allows to select the variables Y i associated with the neighbors of x among the X (i) s. Two problems: 1 We have no neighbor in high dimensions 2 All the points are at a similar distance one from the others. Illustration: Assume that coordinates of X are i.i.d. and x = (x 1, x 1,..., x 1 ), p E [ x X 1 ] = E x j X j = p E[ x 1 X 1 ] j=1 sd( x X 1 ) = Var ( x X 1 ) = p Var( x 1 X 1 ) So, any estimator based on a local averaging will fail. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 7 / 105

8 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l 1 -distance between x = (0.5,..., 0.5) and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 8 / 105

9 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l 2 -distance between x = (0.5,..., 0.5) p and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 9 / 105

10 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l -distance between x = (0.5,..., 0.5) p and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 10 / 105

11 Introduction Curse of dimensionality - Other problems Other strange phenomena in high dimensions : The multivariate Gaussian standard density is very flat: sup f (x) = (2π) p/2 x R p The diagonal of the hypercube [0, 1] p is almost orthogonal to its edges Other problems : Accumulation of small fluctuations in many directions can produce a large global fluctuation. An accumulation of rare events may not be rare. Computational complexity. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 11 / 105

12 Introduction Circumventing the curse of dimensionality At first sight, the high-dimensionality of the data seems to be good news but as explained previously, it is a major issue for extracting information. In light of the few examples described above, the situation may appear hopeless. Fortunately, high-dimensional data are not uniformly spread in R p (for instance, pixel intensities of an image are not purely random and images have geometrical structures). Data are concentrated around low-dimensional structures (many variables have a negligible or even a null impact) but this low-dimensional structure is much of the time unknown. The goal of high-dimensional statistics is to identify these structures and to provide statistical procedures with a low computational complexity. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 12 / 105

13 Introduction Take-home message Whereas classical statistics provide a very rich theory for analyzing data with a small number p of parameters and a large number n of observations, in many fields, current data have different characteristics: a huge number p of parameters a sample size n, which is of the same size as p or sometimes much smaller than p. The asymptotic classical analysis with p fixed and n going to + does not make sense anymore. We must change our point of view. We face with the curse of dimensionality. Fortunately, the useful information usually concentrates around low-dimensional structures (that has to be identified), which allows us to circumvent the course of dimensionality. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 13 / 105

14 Introduction References - Bühlmann, P. and van de Geer, S. Statistics for high-dimensional data. Methods, theory and applications. Springer Series in Statistics. Springer, Heidelberg, Donoho, D. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. American Math. Society Math Challenges of the 21st Century, Giraud, C. Introduction to high-dimensional statistics. Monographs on Statistics and Applied Probability, 139. CRC Press, Boca Raton, FL, Härdle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. Wavelets, approximation, and statistical applications. Lecture Notes in Statistics, 129. Springer-Verlag, New York, Hastie, T., Tibshirani, R. and Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, Hastie, T., Tibshirani, R. and Wainwright, M. Statistical learning with sparsity. The lasso and generalizations. Monographs on Statistics and Applied Probability, 143. CRC Press, Boca Raton, FL, Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58, no. 1, , 1996 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 14 / 105

15 Introduction Overview of the course The goal of this course is to present modern statistical tools and some of their theoretical properties for estimation in the high-dimensional setting, including 1 Wavelets and thresholding rules. 2 Penalized estimators: model selection procedures, Ridge and Lasso estimates. 3 Generalizations and variations of the Lasso estimate: Group-Lasso, Fused-Lasso, elastic-net and Dantzig selectors. Links with Bayesian rules. 4 Statistical properties of Lasso estimators: study in the classical regression model. Extensions for the generalized linear model. I shall concentrate on simple settings in order to avoid unessential technical details. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 15 / 105

16 Introduction What is (unfortunately) not mentioned and notations Due to lack of time or skill, I won t speak about some important themes (fortunately, some of them will be dealt with by Franck or Tristan): 1 Optimizations aspects 2 Matrix completion 3 Testing approaches 4 Graphical models 5 Multivariate methods (sparse PCA, etc.) 6 Classification methods Notations: n: size of observations. p: dimension of the involved unknown quantity. q : l q -norm in R p. For short if there is no ambiguity, = 2. For any vector β, β 0 = card{j : β j 0}. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 16 / 105

17 Introduction 1 Introduction Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

18 Introduction 1 Introduction 2 Model selection Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

19 Introduction 1 Introduction 2 Model selection 3 From Ridge estimate to Lasso estimate Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

20 Introduction 1 Introduction 2 Model selection 3 From Ridge estimate to Lasso estimate 4 Generalized linear models and related models Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

21 Model selection Chapter 1: Model selection Contents of the chapter: 1 Linear regression setting 2 Sparsity and oracle approach 3 Model selection procedures 4 Take-home message 5 References Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 18 / 105

22 Model selection Linear regression setting Consider the linear regression model Y = X β + ɛ, with Y = (Y i ) i=1,...,n a vector of observations (response variable) X = (X ij ) i=1,...,n, j=1,...,p a known n p-matrix. β = (βj ) j=1,...,p an unknown vector ɛ = (ɛ i ) i=1,...,n the vector of errors. It is assumed that E[ɛ] = 0, Var(ɛ) = σ 2 I n and σ 2 is known. Columns of X, denoted X j, are explanatory variables or predictors. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 19 / 105

23 Model selection Linear regression setting The regression model can be rewritten as p Y = βj X j + ɛ. Several problems can be investigated: The estimation problem: Estimate β The prediction problem: Estimate X β The selection problem: Determine non-zero coordinates of β Why linear regression? It models various concrete situations It is simple to use from the mathematical point of view It allows to introduce and to present new methodologies j=1 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 20 / 105

24 Model selection Classical estimation We naturally estimate β by considering the ordinary least squares estimate defined by ˆβ ols := arg min Y X β. β Rp ˆβ ols Proposition If rank(x ) = p, then ˆβ ols = (X T X ) 1 X T Y and Furthermore E[ ˆβ ols ] = β, Var( ˆβ ols ) = σ 2 (X T X ) 1. [ E ˆβ ols β 2] = σ 2 Tr((X T X ) 1 ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 21 / 105

25 Model selection Classical estimation Lemma We have for any matrix A with n columns: [ E Aɛ 2] = σ 2 Tr(AA T ) Some remarks: rank(x ) = p implies p n If the predictors are orthonormal [ E ˆβ ols β 2] = pσ 2, which may be large in high dimensions. Up to now, structural assumptions are very mild. In the sequel, we shall first consider sparsity assumptions. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 22 / 105

26 Model selection Sparsity Loosely speaking, a sparse statistical model is a model in which only a relatively small number of parameters play an important role. In the regression model, Y = p βj X j + ɛ j=1 we assume that m the support of β is small, with Note that m is unknown. In general, ˆβ ols is not sparse. m = { j {1,..., p} : β j 0 }. Model selection is a natural approach to select a good estimator in this setting. We describe and study this methodology in the oracle approach. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 23 / 105

27 Model selection Oracle approach We now consider the prediction risk and set f = X β R n the unknown vector of interest. So, we have: If m were known, a natural estimate of f would be Y = f + ɛ. (2.1) ˆf m = Π S Y, with Π S : R n R n the projection matrix on S and S = span(x j : j m ). Note that if ɛ N (0, σ 2 I n ) then ˆf m is the maximum likelihood estimate in the model (2.1) under the constraint that the estimate of f belongs to S. Of course m is unknown and ˆf m cannot be used. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 24 / 105

28 Model selection Oracle approach For any model m {1,..., p}, we set ˆf m = Π Sm Y, with Π Sm : R n R n the projection matrix on S m and S m = span(x j : j m). With a slight abuse, we also call S m model. Given M, a collection of models, we wish to select ˆm M such that the risk of ˆf ˆm is as small as possible. We introduce the oracle model m 0 as ˆf m0 [ m 0 := arg min E ˆf m f 2]. m M is called the (pseudo) oracle estimate. More precisely, we wish to select ˆm M such that [ E ˆf ˆm f 2] [ E ˆf m0 f 2]. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 25 / 105

29 Model selection Oracle approach Oracle model: Some remarks: [ m 0 := arg min E ˆf m f 2]. m M m may be different from m 0. We may have f S m0 and even f m M S m. The oracle model m 0 is not random but depends on β. So, it cannot be used in practice. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 26 / 105

30 Model selection Model selection procedure Our approach is based on the minimization of R(ˆf m ) on M, with [ R(ˆf m ) := E ˆf m f 2]. The following lemma based on the simple bias-variance decomposition gives an explicit expression of R(ˆf m ). We denote d m := dim(s m ). Lemma We have: R(ˆf m ) = (I n Π Sm )f 2 + σ 2 d m. The first term is a bias term which decreases when m increases, whereas the second term (a variance term) increases when m increases The oracle model m 0 is the model which achieves the best trade-off between these two terms. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 27 / 105

31 Model selection Mallows C p Mallows Recipe: Since we wish to minimize m R(ˆf m ), it s natural to choose ˆm as the minimizer of an estimate of R(ˆf m ). We denote the latter ˆR m that will be based on ˆf m Y 2 (replacing f with Y and removing the expectation). The following lemma gives the last ingredient of the recipe. Lemma We have: [ E ˆf m Y 2] = R(ˆf m ) σ 2 (2d m n) Using the lemma, an unbiased estimate of R(ˆf m ) is given by ˆf m Y 2 + σ 2 (2d m n). It leads to the model selection procedure based on minimization of Mallows criterion defined by: C p (m) := ˆf m Y 2 + 2σ 2 d m Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 28 / 105

32 Model selection Mallows C p Definition Mallows estimate of f is ˆf := ˆf ˆm with ˆm = arg min m M C p(m), C p (m) := ˆf m Y 2 + 2σ 2 d m Assumptions are very mild. In particular the Mallows criterion is distribution-free. It s a very popular criterion. Only based on unbiased estimation, this approach does not take into account fluctuations of C p (m) around its expectation. The larger M, the larger the probability to have min m M C p (m) far from min m M R(ˆf m ) + σ 2 n. In particular, we may have for some m M, C p (m) R(ˆf m ) + σ 2 n and C p (m) < C p (m 0 ), R(ˆf m ) > R(ˆf m0 ). The last situation occurs when we have a large number of models for each dimension. ˆm is much larger than m 0 leading to overfitting. It s the main drawback of Mallows C p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 29 / 105

33 Model selection Other popular criteria When the distribution of observations is known, we can consider AIC and BIC criteria which are based on the likelihood. For any model m M, we set L(m) as the maximum of the log-likelihood on S m. We still consider with ˆm := arg min m M C(m), for the Akaike Information Criterion (AIC) C(m) = L(m) + d m for the Bayesian Information Criterion (BIC) C(m) = L(m) + log n 2 d m In the Gaussian setting, AIC and Mallows C p are equivalent. The use of BIC tends to prevent overfitting (larger penalty). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 30 / 105

34 Model selection Penalization for Gaussian regression We assume ɛ N (0, σ 2 I n ). Mallows approach shows that for l 2 estimation, a criterion of the form C(m) = ˆf m Y 2 + σ 2 pen(m), is suitable with pen, called the penalty, satisfying pen(m) 2d m. We now investigate good choices of penalties. It has to depend on M. Recall our benchmark: The oracle risk R(ˆf m0 ) with [ m 0 := arg min R(ˆf m ), R(ˆf m ) := E ˆf m f 2]. m M We wish R(ˆf ) R(ˆf m0 ). We have R(ˆf m ) = (I n Π Sm )f 2 + σ 2 d m. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 31 / 105

35 Model selection Penalty Since for any m M, C( ˆm) C(m), we have: f ˆf ˆm ɛ, f ˆf ˆm + σ 2 pen( ˆm) f ˆf m ɛ, f ˆf m + σ 2 pen(m) Taking expectation, since pen(m) is deterministic, [ R(ˆf ) R(ˆf m ) +2 E ɛ, f }{{} ˆf ] [ m + σ 2 pen(m) + E 2 ɛ, }{{}}{{} ˆf ] f σ 2 pen( ˆm) }{{} I III II IV Each term can be analyzed: I is ok. [ II := E ɛ, f ˆf ] [ ] m = E ɛ, f Π Sm Y [ = E Π Sm ɛ 2] = σ 2 d m 0. The function pen( ) has to be large enough so that IV is negligible but small enough to have III := σ 2 pen(m) R(ˆf m ). Then, R(ˆf )) inf R(ˆf m ) + negl. term. m M Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 32 / 105

36 Model selection Analysis of the forth term For any 0 < δ < 1, with S ˆm = span(s ˆm, f ), And, with χ 2 (m) := Π S m (σ 1 ɛ) 2, 2 ɛ, ˆf f = 2 Π S ˆm ɛ, ˆf f δ 1 Π S ˆm ɛ 2 + δ ˆf f 2. [ ] IV := E 2 ɛ, ˆf f σ 2 pen( ˆm) [ ] δ 1 σ 2 E χ 2 ( ˆm) δpen( ˆm) + δr(ˆf ) [ ] δ 1 σ 2 E max m M {χ2 (m) δpen(m)} + δr(ˆf ) δ 1 σ ( [ ] ) 2 E χ 2 (m) δpen(m) + δr(ˆf ) m M Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 33 / 105

37 Model selection Penalty Definition To the collection of models M, we associate (π m ) m M such that 0 < π m 1 and π m = 1. m M Then, for any constant K > 1, we set pen(m) := K( dm + 2 log(π m )) 2. (2.2) If K > 1, taking e.g. δ = K 1, concentration inequalities lead to IV C(K)σ 2 + K 1 R(ˆf ) III := σ 2 pen(m) 2Kσ 2 d m + 4Kσ 2 log(π 1 2KR(ˆf m ) + 4Kσ 2 log(π 1 m ) m ) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 34 / 105

38 Model selection Theoretical result Theorem (Birgé and Massart) We consider the linear regression model Y = f + ɛ and assume that ɛ N (0, σ 2 I n ), with σ 2 known. Given K > 1, we define the penalty function as in (2.2) and estimate f with ˆf = ˆf ˆm such that { } ˆm := arg min ˆf m Y 2 + σ 2 pen(m). m M Then, there exists C K > 0 only depending on K such that [ E ˆf f 2] { [ C K E ˆf m f 2] + σ 2 log(πm 1 ) + σ 2}. min m M If log(π 1 m ) αd m then ˆf achieves the same risk as the oracle. Mallows C p will be suitable if K > 1 s.t. K( dm + 2 log(π m )) 2 2dm. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 35 / 105

39 Model selection First illustration with the full collection of models We first consider the case where M = P({1,..., p}). We can take π m = (e 1) (1 e p ) d m!(p d m )! e dm. p! Then and ( ) p log(πm 1 ) d m log d m log(p) d m [ E ˆf f 2] [ log(p) min E ˆf m f 2] m M The log(p)-term is unavoidable We can prove that by taking K < 1, we select a very big model, leading to overfitting. It s the reason why Mallows C p is not suitable for this case. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 36 / 105

40 Model selection Second illustration with a poor collection of models We now consider the case where M = {{1,..., J}, 1 J p}. We can take for any constant α > 0, π m = (eα 1) (1 e αp ) e αdm. Then Since log(π 1 m ) αd m + const. [ E ˆf m f 2] = (I n Π Sm )f 2 + σ 2 d m. we have [ E ˆf f 2] [ min E ˆf m f 2] m M Under convenient choices of α and K > 1, we have pen(m) 2d m. Therefore, Mallows C p is suitable for this case. The choice K < 1 leads to overfitting Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 37 / 105

41 Model selection Pros and cons of model selection Under a convenient choice of penalty (based on concentration inequalities), the model selection methodology is able to select the best predictors to explain a response variable by only using data. The model selection methodology (due to Birgé and Massart) has been presented in the Gaussian linear regression setting. But it can be extended to other settings: for density estimation, Markov models, counting processes, segmentation, classification, etc. It is based on minimization of a penalized l 2 -criterion over a collection of models. Note that if M = P({1,..., p}), card(m) = 2 p. When p is large, this approach is intractable due to a prohibitive computational complexity (2 20 > 10 6 ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 38 / 105

42 Model selection The orthogonal case Assume that the matrix X is orthogonal: X T X = I p. We have d m := dim(s m ) = card(m). Consider a penalty proportional to d m : pen(m) = 2Kd m log(p). Then, since ˆf m = Π Sm Y = j m ˆβ j X j, ˆβj := Xj T Y we obtain: { } ˆm := arg min ˆf m Y 2 + σ 2 pen(m) m M = arg min m M = arg min m M { ˆβ j 2 + 2Kσ 2 card(m) log(p) j m { ( )} ˆβ2 j 2Kσ 2 log(p) j m } Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 39 / 105

43 Model selection The orthogonal case and M = P({1,..., p}) In this case, we have: ˆm = { j {1,..., p} : ˆβ j > σ } 2K log(p) and ˆf = ˆf HT,K := p ˆβ j 1 { Xj ˆβ j >σ 2K log(p)} j=1 Model selection corresponds to hard thresholding and implementation is easy. Assume that f = 0. Consider Mallows C p, BIC and hard thresholding alternatively. The first two are overfitting procedures: if p +, 1 with pen(m) = 2d m, E[card( ˆm Mallows)] 0.16p. 2 if n = p and pen(m) = log(n)d m, E[card( ˆm BIC)] 2p π log(p) 3 if K > 1, P(ˆf HT,K 0) = o(1) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 40 / 105

44 Model selection Take-home message This chapter presents in the Gaussian linear setting the model selection methodology, which consists in minimizing an l 0 -penalized criterion. Such procedures are very popular in the moderately large dimensions setting and can be extended to many statistical models. Using concentration inequalities, penalties can be designed to obtain adaptive and optimal procedures in the oracle setting and to overperform classical procedures, such as AIC, BIC and Mallows C p. When p is large and the model collection is wealthy, this approach may be intractable due to a prohibitive computational complexity. Alternatives have to be developed in very high dimensions. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 41 / 105

45 Model selection References - Birgé, Lucien and Massart, Pascal Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3, no. 3, , Giraud, Christophe Introduction to high-dimensional statistics. Monographs on Statistics and Applied Probability, 139. CRC Press, Boca Raton, FL, Mallows, Colin Some Comments on C p. Technometrics, 15, , Massart, Pascal Concentration inequalities and model selection, volume 6, Springer, Verzelen, Nicolas Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electron. J. Stat., 6, 38 90, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 42 / 105

46 From Ridge estimate to Lasso estimate Chapter 2: From Ridge estimate to Lasso estimate 1 The Ridge estimate 2 The Bridge estimate 3 The Lasso (a) General study of the Lasso (b) The orthogonal case (c) Tuning the Lasso - Cross-validation - Degrees of freedom (d) Generalizations of the Lasso - The Dantzig selector - The Adaptive Lasso - The Relaxed Lasso - The Square-root Lasso - The Elastic net - The Fused Lasso - The Group Lasso - The Hierarchical group Lasso - The Bayesian Lasso (e) Theoretical guarantees - Support recovery - Prediction risk bound 4 Take-home message 5 References Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 43 / 105

47 From Ridge estimate to Lasso estimate Ridge estimates We still consider the linear regression model Y = X β + ɛ, with E[ɛ] = 0, Var(ɛ) = σ 2 I n and σ 2 is known. If rank(x ) = p, then ˆβ ols = (X T X ) 1 X T Y which satisfies E[ ˆβ ols ] = β, Var( ˆβ ols ) = σ 2 (X T X ) 1. [ E ˆβ ols β 2] = σ 2 Tr((X T X ) 1 ). In high dimensions, the matrix X T X can be ill-conditioned (i.e. may have small eigenvalues) leading to coordinates of ˆβ ols with large variance. To overcome this problem while preserving linearity, we modify the OLS estimate and set ˆβ ridge λ = (X T X + λi p ) 1 X T Y, λ > 0 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 44 / 105

48 From Ridge estimate to Lasso estimate Ridge estimates Since ˆβ ridge λ = (X T X + λi p ) 1 X T Y, λ > 0, the tuning parameter λ balances the bias and variance terms. E[ ˆβ ridge λ ] β 2 ] = λ 2 β T (X T X + λi p ) 2 β [ ridge ridge E ˆβ λ E[ ˆβ λ ] 2] = σ 2 with (µ j ) j=1,...,p := eigenvalues(x T X ). Pros and cons: p j=1 We can consider very high dimensions: p n Linearity: Easy to compute for many (?) problems µ j (µ j + λ) 2, The choice of the regularization parameter λ is sensitive Automatic selection is not possible Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 45 / 105

49 From Ridge estimate to Lasso estimate Bridge estimates Definition For λ 0 and γ 0, we set: C λ,γ (β) := Y X β 2 + λ β γ γ with and β γ γ = { p j=1 β j γ, if γ > 0 p j=1 1 {β j 0}, if γ = 0 ˆβ λ,γ := arg min β R p C λ,γ(β). (3.1) Three interesting cases (λ > 0): 1 γ = 0: model selection 2 γ = 2: Ridge estimation 3 γ = 1: Lasso Estimation Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 46 / 105

50 From Ridge estimate to Lasso estimate Bridge estimates Assume that γ = 0. Then the bridge estimate exists if X is one-to-one. Assume that γ > 0. Then the bridge estimate exists. If 0 γ < 1, then C λ,γ is not convex and it may be very hard to minimize it in high dimensions. Assume that γ = 1. The penalized criterion C λ,1 is then convex and C λ,1 has one minimizer if X is one-to-one. Assume that γ > 1. The penalized criterion C λ,γ is then strictly convex and C λ,γ has only one minimizer. Almost surely, all coordinates of the bridge estimate are non-zero. For γ 1, one-to-one correspondence between the Lagragian problem ˆβ λ,γ := arg min C λ,γ(β), C λ,γ (β) := Y X β 2 + λ β γ β R p γ and the following constrained problem arg min {β R p : β γ γ t} Y X β. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 47 / 105

51 From Ridge estimate to Lasso estimate Bridge estimates y γ = 6 γ = infty γ = 2 γ = 1 γ = 0.5 γ = x Constraints regions β γ γ 1 for different values of γ. The region is convex if and only if γ 1. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 48 / 105

52 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 We take X T X = Note that ( ) and t = 1. Y X β 2 = (β ˆβ ols ) T X T X (β ˆβ ols ) + Y X ˆβ ols 2 and the constrained problem becomes { arg min (β ˆβ ols ) T X T X (β ˆβ } ols ). {β R p : β γ γ t} We compare the Ridge estimate { } ˆβ ridge λ := arg min (β ˆβ ols ) T X T X (β ˆβ ols ) {β R p : β 2 t} and the Lasso estimate { ˆβ λ lasso := arg min (β ˆβ ols ) T X T X (β ˆβ } ols ) {β R p : β 1 t} Of course both estimates are close (for same values of t) but, depending on ˆβ ols, Lasso estimate may have null coordinates. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 49 / 105

53 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 y Lasso β^ = (X T X) 1 X T Y Ridge x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 50 / 105

54 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 y Lasso Ridge β^ = (X T X) 1 X T Y x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 51 / 105

55 From Ridge estimate to Lasso estimate Specific study of the case γ = 1 (the Lasso) The Lasso, proposed by Tibshirani (1996), is the bridge estimate with γ = 1: ˆβ λ lasso { := arg min Y X β 2 } + λ β 1 β R p It has two specific properties: 1 It is obtained from the minimization of a convex criterion (so, with low computational cost) 2 It may provide sparse solutions if the tuning parameter λ (resp. t) is large (resp. small) enough and allows for automatic selection. Theorem (Characterization of the Lasso) A vector ˆβ R p is a global minimizer of C λ,1 if and only if ˆβ satisfies following conditions: For any j, if ˆβ j 0, 2X T j (Y X ˆβ) = λsign( ˆβ j ) if ˆβ j = 0, 2Xj T (Y X ˆβ) λ Furthermore, ˆβ is the unique minimizer { if X E is one to one } with E := j : 2Xj T (Y X ˆβ) = λ Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 52 / 105

56 From Ridge estimate to Lasso estimate Specific study of the case γ = 1 (the Lasso) Sketch of the proof: 1 For a convex function f, ˆβ is a minimum of f if and only if 0 f ( ˆβ), with f ( ˆβ) { := s R p : f (y) f ( ˆβ) } + s, y x, y. 2 If f is differentiable at ˆβ, f ( ˆβ) = { f ( ˆβ)} 3 If f ( ˆβ) = ˆβ 1 } f ( ˆβ) = {g R p : g 1, g, ˆβ = ˆβ 1 } Note that {j Ŝ := : ˆβj 0 E. So if XŜ is one to one and j / Ŝ, 2X T j (Y X ˆβ) < λ, then we have uniqueness. Indeed, in this case, Ŝ = E. If ˆβ and ˆβ are two global minimizers of C λ,1, then X ˆβ = X ˆβ and ˆβ 1 = ˆβ 1. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 53 / 105

57 From Ridge estimate to Lasso estimate The orthogonal case Assume that the matrix X is orthogonal: X T X = I p. ˆβ λ lasso { := arg min Y X β 2 } + λ β 1 β R p p ( = arg min β 2 β R p j 2(Xj T Y )β j + λ β j ). j=1 Orthogonality allows for a coordinatewise study of the minimization problem. Straightforward computations lead to ( ˆβ λ,j lasso = sign(xj T Y ) Xj T Y λ ) 2 + = X T j Y λ 2 if X T j Y λ 2 0 if λ 2 X T j Y λ 2 X T j Y + λ 2 if X T j Y λ 2 The LASSO (Least Absolute Shrinkage and Selection Operator) procedure corresponds to a soft thresholding algorithm. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 54 / 105

58 From Ridge estimate to Lasso estimate y The orthogonal case - Comparison We assume that the matrix X is orthogonal: X T X = I p. We compare (with a j =: X T j Y ): The OLS estimate: ˆβ j ols = a j The Ridge estimate (γ = 2): ˆβ ridge λ,j = (1 + λ) 1 a j The Lasso estimate or soft-thresholding rule (γ = 1): ( ˆβ λ,j lasso = sign(a j ) a j λ ) 2 The Model Selection estimate or hard-thresholding rule (γ = 0): ˆβ m.s. λ,j = a j 1 { aj > λ} OLS Ridge Lasso Model election Comparison of 4 estimates for the orthogonal case with λ = 1. x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 55 / 105

59 From Ridge estimate to Lasso estimate Tuning the Lasso - V -fold Cross-validation We write the model Y i = x T i β + ɛ i, i = 1,..., n with x i R p and ɛ i i.i.d. E[ɛ i ] = 0, Var(ɛ i ) = σ 2. For a number V, we split the training pairs into V parts (or folds ). Commonly, V = 5 or V = 10. V -fold cross-validation considers training on all but the kth part, and then validating on the kth part, iterating over k = 1,..., V. When V = n, we call this leave-one-out cross-validation, because we leave out one data point at a time. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 56 / 105

60 From Ridge estimate to Lasso estimate Tuning the Lasso - V -fold Cross-validation 1 Choose V and a discrete set Λ of possible values for λ. 2 Split the training set {1,..., n} into V subsets, B 1,..., B V, of roughly the same size. ˆβ ( k) 3 For each value of λ Λ, for k = 1,..., V, compute the estimate λ on the training set ((x i, Y i ) i Bl ) l k and record the total error on the validation set B k : 1 ( e k (λ) := Yi xi T ˆβ ( k) ) 2. λ card(b k ) i B k 4 Compute the average error over all folds, CV (λ) := 1 V V e k (λ) = 1 V k=1 V k=1 1 ( Yi xi T card(b k ) i B k ˆβ ( k) ) 2. λ 5 We choose the value of tuning parameter that minimizes this function CV on Λ: ˆλ := argmin CV (λ). λ Λ Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 57 / 105

61 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom We write the model Y i = x T i β + ɛ i, ɛ i iid N (0, σ 2 ) i = 1,..., n. Definition (Efron (1986)) The degrees of freedom of a function g : R n R n with coordinates g i is defined by df(g) = 1 n σ 2 cov(g i (Y ), Y i ). i=1 The degrees of freedom may be viewed as the true number of independent pieces of informations on which an estimate is based. Example with rank(x ) = p: We estimate X β with g(y ) = X (X T X ) 1 X T Y df(g) = σ 2 n i=1 E[xi T (X T X ) 1 X T ɛ ɛ i ] = p Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 58 / 105

62 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom Efron s degrees of freedom is the main ingredient to generalize Mallows C p in high dimensions: Proposition Let ˆβ an estimate of β. If C p := Y X ˆβ 2 nσ 2 + 2σ 2 df(x ˆβ), then we have: E[C p ] = E[ X ˆβ X β 2 ]. Assume that for any λ > 0, we have df an estimate of df(x ˆβ λ ), where ˆβ λ is the Lasso estimate associated with λ. Then, we can choose λ by minimizing λ Y X ˆβ λ 2 + 2σ 2 df Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 59 / 105

63 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom Theorem (Zou, Hastie and Tibshirani (2007)) Assume rank(x ) = p. Then, with Ŝ λ := { } j : ˆβλ,j 0, we have E[card(Ŝ λ )] = df (X ˆβ λ ). Theorem (Tibshirani and Taylor (2012)) With we have { E λ := j : 2X T j (Y X ˆβ } λ ) = λ, E[rank(X Eλ )] = df (X ˆβ λ ), E[rank(XŜλ )] = df (X ˆβ λ ) This gives three possible estimates for df (X ˆβ λ ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 60 / 105

64 From Ridge estimate to Lasso estimate Illustration on real data Analysis of the famous prostate data, which records the prostate specific antigen, the cancer volume, the prostate weight, the age, the benign prostatic hyperplasia amount, the seminal vesicle invasion, the capsular penetration, the Gleason score, the percentage Gleason scores 4 or 5, for n = 97 patients. install.packages("elemstatlearn") install.packages("glmnet") library(glmnet) data("prostate", package = "ElemStatLearn") Y = prostate$lpsa X = as.matrix(prostate[,names(prostate)!=c("lpsa","train")]) ridge.out = glmnet(x=x,y=y,alpha=0) plot(ridge.out) lasso.out = glmnet(x=x,y=y,alpha=1) plot(lasso.out) Theses R commands produce a plot of the values of the coordinates of the Ridge and Lasso estimates when λ decreases. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 61 / 105

65 From Ridge estimate to Lasso estimate Illustration on real data Coefficients lcavol lweight age lbph svi lcp gleason pgg45 Coefficients lcavol lweight age lbph svi lcp gleason pgg L1 Norm L1 Norm The x-axis corresponds to ˆβ λ 1. The left-hand side corresponds to λ = +, the right-hand side corresponds to λ = 0. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 62 / 105

66 From Ridge estimate to Lasso estimate Generalizations of the Lasso - the Dantzig selector Remember that the Lasso estimate satisfies the constraint max 2X j T (Y X j=1,...,p We then introduce the convex set { D := β R p : max 2X j T j=1,...,p ˆβ lasso λ ) λ. } (Y X β) λ, which contains β with high probability if λ is well tuned. Remember also that we investigate sparse vectors where sparsity is measured by using the l 1 -norm. Therefore, Candès and Tao (2007) have suggested to use the Dantzig selector ˆβ Dantzig λ := argmin β D β 1. Note that ˆβ Dantzig λ 1 ˆβ λ lasso 1. Numerical and theoretical performances of Dantzig and Lasso estimates are very close. In some cases, they may even coincide. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 63 / 105

67 From Ridge estimate to Lasso estimate Generalization of the Lasso - Adaptive Lasso Due its soft-thresholding nature, the Lasso estimation of large coefficients may suffer from a large bias. We can overcome this problem by introducing data-driven weights. Zou (2006) proposed an adaptive version of the classical Lasso: p ˆβ λ Zou := arg min β R p Y X β 2 + λ w j β j, with The larger w j = 1 ˆβ j ols. j=1 ols ˆβ j, the smaller w j, which encourages large values for Instead of ˆβ ols, other preliminary estimates can be considered. ˆβ Zou λ,j. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 64 / 105

68 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Relaxed Lasso Instead of introducing weights, Meinshausen (2007) suggests a two-step procedure: 1 Compute { } ˆβ λ lasso = arg min Y X β 2 + λ β 1 β R p and set 2 For δ [0, 1], If X is orthogonal, ˆβ λ,δ,j relaxed = Ŝ λ := { } j : ˆβλ,j 0. { } ˆβ λ,δ relaxed := arg min Y X β 2 + δλ β 1 β R p, supp(β) Ŝ λ The value δ = 0 is commonly used. X T j Y δλ 2 if X T j Y λ 2 0 if λ 2 X T j Y λ 2 X T j Y + δλ 2 if X T j Y λ 2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 65 / 105

69 From Ridge estimate to Lasso estimate Generalizations of the Lasso - The square-root Lasso A natural property of the Lasso estimate would be to satisfy for any s > 0 { arg min Y X β 2 } + λ β 1 β R p a.e. = arg min β R p { sy sx β 2 + λ sβ 1 }. If the tuning parameter is chosen independently of σ, the standard deviation of Y, then the Lasso estimate is not scaled invariant. The estimate { arg min σ 1 Y X β 2 } + λ β 1 β R p is scaled invariant but is based on the knowledge of σ. Alternatively, you can consider the square-root Lasso: which also enjoys nice properties. arg min β R p { Y X β + λ β 1}, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 66 / 105

70 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Elastic net In the model Y = X β + ɛ, consider ˆβ lasso λ If we consider X = [X, X p ] and if with α [0, 1], is a solution of = arg min β R p { Y X β 2 + λ β 1 } ˆβ lasso λ,p 0, then any vector β λ such that ˆβ λ,j lasso if j p lasso β λ,j = α ˆβ λ,p if j = p (1 α) ˆβ λ,p lasso if j = p + 1 We have an infinite number of solutions. { } arg min Y X β 2 + λ β 1. β R p+1, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 67 / 105

71 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Elastic net In practice, predictors are different but they may be strongly correlated. In this case, the Lasso estimate may hide the relevance of one of them, just because it is highly correlated to another one. Coefficients of two correlated predictors should be close. The elastic net procedure proposed by Zou and Hastie (2005) makes a compromise between Ridge and Lasso penalties: given λ 1 > 0 and λ 2 > 0, ˆβ λ e.n. { 1,λ 2 := arg min Y X β 2 + λ 1 β 1 + λ 2 β 2}. β R p The criterion is strictly convex, so there is a unique minimizer. If columns of X are centered and renormalized and if Y is centered, then for j k such that ˆβ λ e.n. 1,λ 2,j ˆβ λ e.n. > 0 then 1,λ 2,k e.n. e.n. ˆβ λ 1,λ 2,j ˆβ λ 1,λ 2,k Y 1 2(1 Xj X k). λ 2 We can improve e.n. e.n. ˆβ λ 1,λ 2 and consider (1 + λ 2 ) ˆβ λ 1,λ 2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 68 / 105

72 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Fused Lasso For change point detection, for instance, for which coefficients remain constant over large portions of segments, Tibshirani, Saunders, Rosset, Zhu and Knight (2005) have introduced the fused Lasso: given λ 1 > 0 and λ 2 > 0, p p ˆβ λ fused 1,λ 2 := arg min β R p Y X β 2 + λ 1 β j + λ 2 β j β j 1. The first penalty is the familiar Lasso penalty which regularizes the signal. The second penalty encourages neighboring coefficients to be identical. We can generalize the notion of neighbors from a linear ordering to more general neighborhoods, for examples adjacent pixels in image. This leads to a penalty of the form λ 2 j j β j β j. j=1 j=2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 69 / 105

73 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Group Lasso To select simultaneously a group of variables, Yuan and Lin (2006) suggest to use the group-lasso procedure. For this purpose, we assume we are given K known non-overlapping groups G 1, G 2,..., G K and we set for λ > 0, { } ˆβ group := arg min β R p Y X β 2 + λ K β (k) where β (k)j = β j if j G k and 0 otherwise. As for the Lasso, the group-lasso can be characterized: k, 2X(k) T (Y X ˆβ group group ˆβ (k) ) = λ group if ˆβ group ˆβ (k) 2 (k) 0 2X(k) T (Y X ˆβ group group ) λ if ˆβ (k) = 0 The procedure keeps or discards all the coefficients within a block and can increase estimation accuracy by using information about coefficients of the same block. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 70 / 105 k=1,

74 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Hierarchical group Lasso We consider 2 predictors X 1 et X 2. Suppose we want X 1 to be included in the model before X 2. This hierarchy can be induced by defining the overlapping groups: We take G 1 = {1, 2} et G 2 = {2}. This leads to ˆβ overlap = arg min β R p { Y X β 2 + λ ( β 1, β 2 + β 2 ) } Theorem (Zhao, Rocha et Yu (2009)) The contour plots of this penalty function is We assume we are given K known groups G 1, G 2,..., G K. Let I 1 and I 2 {1,..., p} be two subsets of indices. We assume: 1 For all 1 k K, I 1 G k I 2 G k. 2 There exists k 0 such that I 2 G k0 and I 1 G k0. Then, almost surely, ˆβ G I 2 0 ˆβ G I 1 0. u 2 u 1 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 71 / 105

75 From Ridge estimate to Lasso estimate Generalizations of the Lasso - The Bayesian Lasso In the Bayesian approach, the parameter is random and we write: Y β, σ 2 N (X β, σ 2 I n ) Park and Casella (2008) suggest to consider a Laplace distribution for β: p [ λ ( β λ, σ 2σ exp λ j ) ] σ β. Then, the posterior density is ( exp 1 2σ 2 Y X β 2 λ ) σ β 1 j=1 and the posterior mode coincides with the Lasso estimate with smoothing parameter σλ. The posterior distribution provides more than point estimates since it provides the entire posterior distribution. The procedure is tuned by including priors for σ 2 and λ. Most of Lasso-type procedures have a Bayesian interpretation. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 72 / 105

From Ridge estimate to Lasso estimate Geometric constraints area Exercice : Connect each methodology to its associated geometric constraints area.

76 From Ridge estimate to Lasso estimate Geometric constraints area Exercice : Connect each methodology to its associated geometric constraints area. 1 Lasso 2 Elastic net 3 Group-Lasso 4 Overlap group-lasso Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 73 / 105

77 From Ridge estimate to Lasso estimate Theoretical guarantees - Support recovery Question: Are the non-zero entries of the Lasso estimate positions as the true regression vector β? We set S := { j : βj 0 } {, Ŝ λ := j : We can identify conditions under which Ŝλ = S. Theorem We assume that for some γ > 0, K > 0 and c min > 0, max (X T j / S S X S ) 1 XS T X j 1 1 γ, ˆβ lasso λ } ˆβ λ,j lasso 0 in the same max X j K, j=1,...,p eig(x T S X S ) c min. Then, if λ 8Kσ log p γ, with probability larger than 1 p A, Ŝ λ S, lasso ˆβ λ β λ ( 4σ cmin + (X T S X S ) 1 ) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 74 / 105

78 From Ridge estimate to Lasso estimate Theoretical guarantees - Prediction bounds Theorem (Bunea et al. (2007)) Let us consider λ 3 max j=1,...,p (X T ɛ) j. For any β R p, let Then, if κ(β) > 0, X κ(β) := X ν 2 min ν C(β) ν 2, C(β) := {ν R p : 20 ν 1,Supp(β) > ν 1,Supp(β) c }. ˆβ lasso λ { } X β 2 inf 3 X β X β λ2 β 0. β R p κ(β) Deriving λ such that the first bound is satisfied with high probability is easy by using concentration inequalities. The Restricted Eigenvalues Condition expresses the lack of orthogonality of columns of X. Milder conditions can be used (see Negahban et al. (2012) or Jiang et al. (2017)) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 75 / 105

79 From Ridge estimate to Lasso estimate Theoretical guarantees - Prediction bounds Lemma From the previous theorem, we can deduce estimation bounds for l 2 and l 1 norms for estimating sparse vectors β (see Jiang et al. (2017)) : lasso ˆβ λ β 2 λ 2 β 0 lasso ˆβ λ β 1 λ β 0 The proof of the Theorem is based on the following lemma. Let λ 3 max i=1,...,n (X T ɛ) i and β R p. Then, lasso λ ˆβ λ β 1,Supp(β) c 3 X β X β 2 + 5λ X ˆβ lasso λ ˆβ lasso λ β 1,Supp(β) X β 2 X β X β 2 lasso + 2λ ˆβ λ β 1,Supp(β) Better constants can be obtained via a more involved proof. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 76 / 105

80 From Ridge estimate to Lasso estimate Take-home message To overcome prohibitive computational complexity of model selection, convex critera can be considered leading, in particular, to Lasso-type estimates. By doing so, we introduce some bias but reduce the variance of predicted values. Moreover, we can identify a small number of predictors that have the strongest effects and then makes interpretation easier for the practitioner. By varying the basic Lasso l 1 -penalty, we can reduce problems encountered by the standard Lasso or incorporate some prior knowledge about the model. In the linear regression setting, these estimates, which can be easily computed, are very popular for high dimensional statistics. They achieve nice theoretical and numerical properties. Even if some standard recipes can be used to tune the Lasso, its calibration remains an important open problem. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 77 / 105

81 From Ridge estimate to Lasso estimate References - Bunea, F., Tsybakov, A. and Wegkamp, M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1, , Candes, E. and Tao, T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), , Chen, S., Donoho, D. and Saunders, M. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 (1998), no. 1, 33 61, Efron, B. How biased is the apparent error rate of a prediction rule?, Journal of the American Statistical Association: Theory and Methods 81 (394), , Jiang X., Reynaud-Bouret P., Rivoirard V., Sansonnet L. and Willet R. A data-dependent weighted LASSO under Poisson noise. Submitted, Meinshausen, N. Relaxed Lasso. Comput. Statist. Data Anal., 52(1), , Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 78 / 105

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear