Statistics for high-dimensional data

Size: px
Start display at page:

Download "Statistics for high-dimensional data"

Transcription

1 Statistics for high-dimensional data Vincent Rivoirard Université Paris-Dauphine Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 1 / 105

2 Introduction High-dimensional data It has now become part of folklore to claim that the 21st century will be the century of data. Our society invests more and more in the collection and processing of data of all kinds (big data phenomenon). Now, data have then a strong impact on almost every branch of human activities including science, medicine, business or humanities. In traditional statistics, we assumed we had many observations and a few well-chosen variables. In modern science, we collect more observations but we also collect radically larger numbers of variables, which consist in thousands up to millions of features voraciously recorded on objects or individuals. Such data are said high-dimensional. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 2 / 105

3 Introduction Examples of high-dimensional data Consumers preferences data: Websites gather informations about browsing and shopping behaviors of consumers. For example, recommendation systems collect consumer s preferences on various products, together with some personal data (age, sex, location,...) and predict which products could be of interest for a given consumer. Traffic jams: Many cities (for instance Boston) have developed programs, to improve traffic, based on big data collection (crowdsourced data) and their analyses. Biotech data: Recent technologies enable to acquire high-dimensional data on single individuals. For example, DNA microarrrays measure the transcription level of thousands of genes simultaneously. Images and videos: Large images or videos are continuously collected all around the world. Each image is made of thousands up to millions of pixels. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 3 / 105

4 Introduction Characterization and problems of high-dimensional data Previous examples show that we are in the era of massive automatic data collection. For previous examples, the number of variables or parameters p is much larger than the number of observations n. Being able to collect a large amount of information on each individual seems to be good news. Unfortunately the mathematical and statistical reality clashes with this optimistic statement: Separating the signal from the noise is a very hard task for high-dimensional data, in full generality impossible. Extracting the good information is more than challenging, consisting in finding a needle in a haystack. This phenomenon is often called the curse of dimensionality, terminology introduced by Richard Bellman, in Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 4 / 105

5 Introduction Curse of dimensionality The volume V p (r) of a p-dimensional ball of radius r for the euclidian distance satisfies ( ) V p (r) p + 2πer 2 p/2 (pπ) 1/2. p So, if (X (i) ) i=1...,n are i.i.d with uniform distribution on the hypercube [ 0.5, 0.5] p, then P( i {1,..., n} : X (i) B p (0, r)) n P(X (1) B p (0, r)) nv p (r). So if n = o(v p (r) 1 ), then the last probability goes to 0. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 5 / 105

6 Introduction Curse of dimensionality Example: Classical regression problem. Estimation of the conditional expectation of a random variable. Data consist of n i.i.d. observations (Y i, X (i) ) i=1...,n with the same distribution as (Y, X ) R R p. We wish to estimate the function m where E[Y X ] = m(x ). We consider the Nadaraya-Watson estimate: n i=1 ˆm(x) = K h(x X (i) )Y i n i=1 K h(x X (i) ), x Rp, K h (x) = ( 1 x1 p j=1 h K,..., x ) p, h = (h j ) j=1,...,p j h 1 h p and K is a kernel (with at least one vanishing moment), i.e. K(x) = p 1 x [ 0.5;0.5] (x j ), K(x) = e 2, K(x) = 1 {B p(0,r)}(x) (2π) p/2 Vol(B p (0, r)) j=1 We have to determine the tuning parameter h which allows to select the variables Y i associated with the neighbors of x among the X (i) s. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 6 / 105

7 Introduction Curse of dimensionality We have to determine the tuning parameter h which allows to select the variables Y i associated with the neighbors of x among the X (i) s. Two problems: 1 We have no neighbor in high dimensions 2 All the points are at a similar distance one from the others. Illustration: Assume that coordinates of X are i.i.d. and x = (x 1, x 1,..., x 1 ), p E [ x X 1 ] = E x j X j = p E[ x 1 X 1 ] j=1 sd( x X 1 ) = Var ( x X 1 ) = p Var( x 1 X 1 ) So, any estimator based on a local averaging will fail. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 7 / 105

8 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l 1 -distance between x = (0.5,..., 0.5) and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 8 / 105

9 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l 2 -distance between x = (0.5,..., 0.5) p and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 9 / 105

10 Introduction Curse of dimensionality Histogram with p= 1 Histogram with p= 10 Histogram with p= 100 Frequency Frequency Frequency d d d Histogram with p= 1000 Histogram with p= Histogram with p= 1e+05 Frequency Frequency Frequency d d d Histograms of the l -distance between x = (0.5,..., 0.5) p and n = 100 random uniform variables on [0, 1] p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 10 / 105

11 Introduction Curse of dimensionality - Other problems Other strange phenomena in high dimensions : The multivariate Gaussian standard density is very flat: sup f (x) = (2π) p/2 x R p The diagonal of the hypercube [0, 1] p is almost orthogonal to its edges Other problems : Accumulation of small fluctuations in many directions can produce a large global fluctuation. An accumulation of rare events may not be rare. Computational complexity. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 11 / 105

12 Introduction Circumventing the curse of dimensionality At first sight, the high-dimensionality of the data seems to be good news but as explained previously, it is a major issue for extracting information. In light of the few examples described above, the situation may appear hopeless. Fortunately, high-dimensional data are not uniformly spread in R p (for instance, pixel intensities of an image are not purely random and images have geometrical structures). Data are concentrated around low-dimensional structures (many variables have a negligible or even a null impact) but this low-dimensional structure is much of the time unknown. The goal of high-dimensional statistics is to identify these structures and to provide statistical procedures with a low computational complexity. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 12 / 105

13 Introduction Take-home message Whereas classical statistics provide a very rich theory for analyzing data with a small number p of parameters and a large number n of observations, in many fields, current data have different characteristics: a huge number p of parameters a sample size n, which is of the same size as p or sometimes much smaller than p. The asymptotic classical analysis with p fixed and n going to + does not make sense anymore. We must change our point of view. We face with the curse of dimensionality. Fortunately, the useful information usually concentrates around low-dimensional structures (that has to be identified), which allows us to circumvent the course of dimensionality. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 13 / 105

14 Introduction References - Bühlmann, P. and van de Geer, S. Statistics for high-dimensional data. Methods, theory and applications. Springer Series in Statistics. Springer, Heidelberg, Donoho, D. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality. American Math. Society Math Challenges of the 21st Century, Giraud, C. Introduction to high-dimensional statistics. Monographs on Statistics and Applied Probability, 139. CRC Press, Boca Raton, FL, Härdle, W., Kerkyacharian, G., Picard, D. and Tsybakov, A. Wavelets, approximation, and statistical applications. Lecture Notes in Statistics, 129. Springer-Verlag, New York, Hastie, T., Tibshirani, R. and Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, Hastie, T., Tibshirani, R. and Wainwright, M. Statistical learning with sparsity. The lasso and generalizations. Monographs on Statistics and Applied Probability, 143. CRC Press, Boca Raton, FL, Tibshirani, R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58, no. 1, , 1996 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 14 / 105

15 Introduction Overview of the course The goal of this course is to present modern statistical tools and some of their theoretical properties for estimation in the high-dimensional setting, including 1 Wavelets and thresholding rules. 2 Penalized estimators: model selection procedures, Ridge and Lasso estimates. 3 Generalizations and variations of the Lasso estimate: Group-Lasso, Fused-Lasso, elastic-net and Dantzig selectors. Links with Bayesian rules. 4 Statistical properties of Lasso estimators: study in the classical regression model. Extensions for the generalized linear model. I shall concentrate on simple settings in order to avoid unessential technical details. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 15 / 105

16 Introduction What is (unfortunately) not mentioned and notations Due to lack of time or skill, I won t speak about some important themes (fortunately, some of them will be dealt with by Franck or Tristan): 1 Optimizations aspects 2 Matrix completion 3 Testing approaches 4 Graphical models 5 Multivariate methods (sparse PCA, etc.) 6 Classification methods Notations: n: size of observations. p: dimension of the involved unknown quantity. q : l q -norm in R p. For short if there is no ambiguity, = 2. For any vector β, β 0 = card{j : β j 0}. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 16 / 105

17 Introduction 1 Introduction Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

18 Introduction 1 Introduction 2 Model selection Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

19 Introduction 1 Introduction 2 Model selection 3 From Ridge estimate to Lasso estimate Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

20 Introduction 1 Introduction 2 Model selection 3 From Ridge estimate to Lasso estimate 4 Generalized linear models and related models Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 17 / 105

21 Model selection Chapter 1: Model selection Contents of the chapter: 1 Linear regression setting 2 Sparsity and oracle approach 3 Model selection procedures 4 Take-home message 5 References Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 18 / 105

22 Model selection Linear regression setting Consider the linear regression model Y = X β + ɛ, with Y = (Y i ) i=1,...,n a vector of observations (response variable) X = (X ij ) i=1,...,n, j=1,...,p a known n p-matrix. β = (βj ) j=1,...,p an unknown vector ɛ = (ɛ i ) i=1,...,n the vector of errors. It is assumed that E[ɛ] = 0, Var(ɛ) = σ 2 I n and σ 2 is known. Columns of X, denoted X j, are explanatory variables or predictors. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 19 / 105

23 Model selection Linear regression setting The regression model can be rewritten as p Y = βj X j + ɛ. Several problems can be investigated: The estimation problem: Estimate β The prediction problem: Estimate X β The selection problem: Determine non-zero coordinates of β Why linear regression? It models various concrete situations It is simple to use from the mathematical point of view It allows to introduce and to present new methodologies j=1 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 20 / 105

24 Model selection Classical estimation We naturally estimate β by considering the ordinary least squares estimate defined by ˆβ ols := arg min Y X β. β Rp ˆβ ols Proposition If rank(x ) = p, then ˆβ ols = (X T X ) 1 X T Y and Furthermore E[ ˆβ ols ] = β, Var( ˆβ ols ) = σ 2 (X T X ) 1. [ E ˆβ ols β 2] = σ 2 Tr((X T X ) 1 ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 21 / 105

25 Model selection Classical estimation Lemma We have for any matrix A with n columns: [ E Aɛ 2] = σ 2 Tr(AA T ) Some remarks: rank(x ) = p implies p n If the predictors are orthonormal [ E ˆβ ols β 2] = pσ 2, which may be large in high dimensions. Up to now, structural assumptions are very mild. In the sequel, we shall first consider sparsity assumptions. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 22 / 105

26 Model selection Sparsity Loosely speaking, a sparse statistical model is a model in which only a relatively small number of parameters play an important role. In the regression model, Y = p βj X j + ɛ j=1 we assume that m the support of β is small, with Note that m is unknown. In general, ˆβ ols is not sparse. m = { j {1,..., p} : β j 0 }. Model selection is a natural approach to select a good estimator in this setting. We describe and study this methodology in the oracle approach. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 23 / 105

27 Model selection Oracle approach We now consider the prediction risk and set f = X β R n the unknown vector of interest. So, we have: If m were known, a natural estimate of f would be Y = f + ɛ. (2.1) ˆf m = Π S Y, with Π S : R n R n the projection matrix on S and S = span(x j : j m ). Note that if ɛ N (0, σ 2 I n ) then ˆf m is the maximum likelihood estimate in the model (2.1) under the constraint that the estimate of f belongs to S. Of course m is unknown and ˆf m cannot be used. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 24 / 105

28 Model selection Oracle approach For any model m {1,..., p}, we set ˆf m = Π Sm Y, with Π Sm : R n R n the projection matrix on S m and S m = span(x j : j m). With a slight abuse, we also call S m model. Given M, a collection of models, we wish to select ˆm M such that the risk of ˆf ˆm is as small as possible. We introduce the oracle model m 0 as ˆf m0 [ m 0 := arg min E ˆf m f 2]. m M is called the (pseudo) oracle estimate. More precisely, we wish to select ˆm M such that [ E ˆf ˆm f 2] [ E ˆf m0 f 2]. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 25 / 105

29 Model selection Oracle approach Oracle model: Some remarks: [ m 0 := arg min E ˆf m f 2]. m M m may be different from m 0. We may have f S m0 and even f m M S m. The oracle model m 0 is not random but depends on β. So, it cannot be used in practice. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 26 / 105

30 Model selection Model selection procedure Our approach is based on the minimization of R(ˆf m ) on M, with [ R(ˆf m ) := E ˆf m f 2]. The following lemma based on the simple bias-variance decomposition gives an explicit expression of R(ˆf m ). We denote d m := dim(s m ). Lemma We have: R(ˆf m ) = (I n Π Sm )f 2 + σ 2 d m. The first term is a bias term which decreases when m increases, whereas the second term (a variance term) increases when m increases The oracle model m 0 is the model which achieves the best trade-off between these two terms. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 27 / 105

31 Model selection Mallows C p Mallows Recipe: Since we wish to minimize m R(ˆf m ), it s natural to choose ˆm as the minimizer of an estimate of R(ˆf m ). We denote the latter ˆR m that will be based on ˆf m Y 2 (replacing f with Y and removing the expectation). The following lemma gives the last ingredient of the recipe. Lemma We have: [ E ˆf m Y 2] = R(ˆf m ) σ 2 (2d m n) Using the lemma, an unbiased estimate of R(ˆf m ) is given by ˆf m Y 2 + σ 2 (2d m n). It leads to the model selection procedure based on minimization of Mallows criterion defined by: C p (m) := ˆf m Y 2 + 2σ 2 d m Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 28 / 105

32 Model selection Mallows C p Definition Mallows estimate of f is ˆf := ˆf ˆm with ˆm = arg min m M C p(m), C p (m) := ˆf m Y 2 + 2σ 2 d m Assumptions are very mild. In particular the Mallows criterion is distribution-free. It s a very popular criterion. Only based on unbiased estimation, this approach does not take into account fluctuations of C p (m) around its expectation. The larger M, the larger the probability to have min m M C p (m) far from min m M R(ˆf m ) + σ 2 n. In particular, we may have for some m M, C p (m) R(ˆf m ) + σ 2 n and C p (m) < C p (m 0 ), R(ˆf m ) > R(ˆf m0 ). The last situation occurs when we have a large number of models for each dimension. ˆm is much larger than m 0 leading to overfitting. It s the main drawback of Mallows C p. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 29 / 105

33 Model selection Other popular criteria When the distribution of observations is known, we can consider AIC and BIC criteria which are based on the likelihood. For any model m M, we set L(m) as the maximum of the log-likelihood on S m. We still consider with ˆm := arg min m M C(m), for the Akaike Information Criterion (AIC) C(m) = L(m) + d m for the Bayesian Information Criterion (BIC) C(m) = L(m) + log n 2 d m In the Gaussian setting, AIC and Mallows C p are equivalent. The use of BIC tends to prevent overfitting (larger penalty). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 30 / 105

34 Model selection Penalization for Gaussian regression We assume ɛ N (0, σ 2 I n ). Mallows approach shows that for l 2 estimation, a criterion of the form C(m) = ˆf m Y 2 + σ 2 pen(m), is suitable with pen, called the penalty, satisfying pen(m) 2d m. We now investigate good choices of penalties. It has to depend on M. Recall our benchmark: The oracle risk R(ˆf m0 ) with [ m 0 := arg min R(ˆf m ), R(ˆf m ) := E ˆf m f 2]. m M We wish R(ˆf ) R(ˆf m0 ). We have R(ˆf m ) = (I n Π Sm )f 2 + σ 2 d m. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 31 / 105

35 Model selection Penalty Since for any m M, C( ˆm) C(m), we have: f ˆf ˆm ɛ, f ˆf ˆm + σ 2 pen( ˆm) f ˆf m ɛ, f ˆf m + σ 2 pen(m) Taking expectation, since pen(m) is deterministic, [ R(ˆf ) R(ˆf m ) +2 E ɛ, f }{{} ˆf ] [ m + σ 2 pen(m) + E 2 ɛ, }{{}}{{} ˆf ] f σ 2 pen( ˆm) }{{} I III II IV Each term can be analyzed: I is ok. [ II := E ɛ, f ˆf ] [ ] m = E ɛ, f Π Sm Y [ = E Π Sm ɛ 2] = σ 2 d m 0. The function pen( ) has to be large enough so that IV is negligible but small enough to have III := σ 2 pen(m) R(ˆf m ). Then, R(ˆf )) inf R(ˆf m ) + negl. term. m M Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 32 / 105

36 Model selection Analysis of the forth term For any 0 < δ < 1, with S ˆm = span(s ˆm, f ), And, with χ 2 (m) := Π S m (σ 1 ɛ) 2, 2 ɛ, ˆf f = 2 Π S ˆm ɛ, ˆf f δ 1 Π S ˆm ɛ 2 + δ ˆf f 2. [ ] IV := E 2 ɛ, ˆf f σ 2 pen( ˆm) [ ] δ 1 σ 2 E χ 2 ( ˆm) δpen( ˆm) + δr(ˆf ) [ ] δ 1 σ 2 E max m M {χ2 (m) δpen(m)} + δr(ˆf ) δ 1 σ ( [ ] ) 2 E χ 2 (m) δpen(m) + δr(ˆf ) m M Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 33 / 105

37 Model selection Penalty Definition To the collection of models M, we associate (π m ) m M such that 0 < π m 1 and π m = 1. m M Then, for any constant K > 1, we set pen(m) := K( dm + 2 log(π m )) 2. (2.2) If K > 1, taking e.g. δ = K 1, concentration inequalities lead to IV C(K)σ 2 + K 1 R(ˆf ) III := σ 2 pen(m) 2Kσ 2 d m + 4Kσ 2 log(π 1 2KR(ˆf m ) + 4Kσ 2 log(π 1 m ) m ) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 34 / 105

38 Model selection Theoretical result Theorem (Birgé and Massart) We consider the linear regression model Y = f + ɛ and assume that ɛ N (0, σ 2 I n ), with σ 2 known. Given K > 1, we define the penalty function as in (2.2) and estimate f with ˆf = ˆf ˆm such that { } ˆm := arg min ˆf m Y 2 + σ 2 pen(m). m M Then, there exists C K > 0 only depending on K such that [ E ˆf f 2] { [ C K E ˆf m f 2] + σ 2 log(πm 1 ) + σ 2}. min m M If log(π 1 m ) αd m then ˆf achieves the same risk as the oracle. Mallows C p will be suitable if K > 1 s.t. K( dm + 2 log(π m )) 2 2dm. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 35 / 105

39 Model selection First illustration with the full collection of models We first consider the case where M = P({1,..., p}). We can take π m = (e 1) (1 e p ) d m!(p d m )! e dm. p! Then and ( ) p log(πm 1 ) d m log d m log(p) d m [ E ˆf f 2] [ log(p) min E ˆf m f 2] m M The log(p)-term is unavoidable We can prove that by taking K < 1, we select a very big model, leading to overfitting. It s the reason why Mallows C p is not suitable for this case. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 36 / 105

40 Model selection Second illustration with a poor collection of models We now consider the case where M = {{1,..., J}, 1 J p}. We can take for any constant α > 0, π m = (eα 1) (1 e αp ) e αdm. Then Since log(π 1 m ) αd m + const. [ E ˆf m f 2] = (I n Π Sm )f 2 + σ 2 d m. we have [ E ˆf f 2] [ min E ˆf m f 2] m M Under convenient choices of α and K > 1, we have pen(m) 2d m. Therefore, Mallows C p is suitable for this case. The choice K < 1 leads to overfitting Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 37 / 105

41 Model selection Pros and cons of model selection Under a convenient choice of penalty (based on concentration inequalities), the model selection methodology is able to select the best predictors to explain a response variable by only using data. The model selection methodology (due to Birgé and Massart) has been presented in the Gaussian linear regression setting. But it can be extended to other settings: for density estimation, Markov models, counting processes, segmentation, classification, etc. It is based on minimization of a penalized l 2 -criterion over a collection of models. Note that if M = P({1,..., p}), card(m) = 2 p. When p is large, this approach is intractable due to a prohibitive computational complexity (2 20 > 10 6 ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 38 / 105

42 Model selection The orthogonal case Assume that the matrix X is orthogonal: X T X = I p. We have d m := dim(s m ) = card(m). Consider a penalty proportional to d m : pen(m) = 2Kd m log(p). Then, since ˆf m = Π Sm Y = j m ˆβ j X j, ˆβj := Xj T Y we obtain: { } ˆm := arg min ˆf m Y 2 + σ 2 pen(m) m M = arg min m M = arg min m M { ˆβ j 2 + 2Kσ 2 card(m) log(p) j m { ( )} ˆβ2 j 2Kσ 2 log(p) j m } Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 39 / 105

43 Model selection The orthogonal case and M = P({1,..., p}) In this case, we have: ˆm = { j {1,..., p} : ˆβ j > σ } 2K log(p) and ˆf = ˆf HT,K := p ˆβ j 1 { Xj ˆβ j >σ 2K log(p)} j=1 Model selection corresponds to hard thresholding and implementation is easy. Assume that f = 0. Consider Mallows C p, BIC and hard thresholding alternatively. The first two are overfitting procedures: if p +, 1 with pen(m) = 2d m, E[card( ˆm Mallows)] 0.16p. 2 if n = p and pen(m) = log(n)d m, E[card( ˆm BIC)] 2p π log(p) 3 if K > 1, P(ˆf HT,K 0) = o(1) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 40 / 105

44 Model selection Take-home message This chapter presents in the Gaussian linear setting the model selection methodology, which consists in minimizing an l 0 -penalized criterion. Such procedures are very popular in the moderately large dimensions setting and can be extended to many statistical models. Using concentration inequalities, penalties can be designed to obtain adaptive and optimal procedures in the oracle setting and to overperform classical procedures, such as AIC, BIC and Mallows C p. When p is large and the model collection is wealthy, this approach may be intractable due to a prohibitive computational complexity. Alternatives have to be developed in very high dimensions. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 41 / 105

45 Model selection References - Birgé, Lucien and Massart, Pascal Gaussian model selection. J. Eur. Math. Soc. (JEMS) 3, no. 3, , Giraud, Christophe Introduction to high-dimensional statistics. Monographs on Statistics and Applied Probability, 139. CRC Press, Boca Raton, FL, Mallows, Colin Some Comments on C p. Technometrics, 15, , Massart, Pascal Concentration inequalities and model selection, volume 6, Springer, Verzelen, Nicolas Minimax risks for sparse regressions: ultra-high dimensional phenomenons. Electron. J. Stat., 6, 38 90, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 42 / 105

46 From Ridge estimate to Lasso estimate Chapter 2: From Ridge estimate to Lasso estimate 1 The Ridge estimate 2 The Bridge estimate 3 The Lasso (a) General study of the Lasso (b) The orthogonal case (c) Tuning the Lasso - Cross-validation - Degrees of freedom (d) Generalizations of the Lasso - The Dantzig selector - The Adaptive Lasso - The Relaxed Lasso - The Square-root Lasso - The Elastic net - The Fused Lasso - The Group Lasso - The Hierarchical group Lasso - The Bayesian Lasso (e) Theoretical guarantees - Support recovery - Prediction risk bound 4 Take-home message 5 References Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 43 / 105

47 From Ridge estimate to Lasso estimate Ridge estimates We still consider the linear regression model Y = X β + ɛ, with E[ɛ] = 0, Var(ɛ) = σ 2 I n and σ 2 is known. If rank(x ) = p, then ˆβ ols = (X T X ) 1 X T Y which satisfies E[ ˆβ ols ] = β, Var( ˆβ ols ) = σ 2 (X T X ) 1. [ E ˆβ ols β 2] = σ 2 Tr((X T X ) 1 ). In high dimensions, the matrix X T X can be ill-conditioned (i.e. may have small eigenvalues) leading to coordinates of ˆβ ols with large variance. To overcome this problem while preserving linearity, we modify the OLS estimate and set ˆβ ridge λ = (X T X + λi p ) 1 X T Y, λ > 0 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 44 / 105

48 From Ridge estimate to Lasso estimate Ridge estimates Since ˆβ ridge λ = (X T X + λi p ) 1 X T Y, λ > 0, the tuning parameter λ balances the bias and variance terms. E[ ˆβ ridge λ ] β 2 ] = λ 2 β T (X T X + λi p ) 2 β [ ridge ridge E ˆβ λ E[ ˆβ λ ] 2] = σ 2 with (µ j ) j=1,...,p := eigenvalues(x T X ). Pros and cons: p j=1 We can consider very high dimensions: p n Linearity: Easy to compute for many (?) problems µ j (µ j + λ) 2, The choice of the regularization parameter λ is sensitive Automatic selection is not possible Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 45 / 105

49 From Ridge estimate to Lasso estimate Bridge estimates Definition For λ 0 and γ 0, we set: C λ,γ (β) := Y X β 2 + λ β γ γ with and β γ γ = { p j=1 β j γ, if γ > 0 p j=1 1 {β j 0}, if γ = 0 ˆβ λ,γ := arg min β R p C λ,γ(β). (3.1) Three interesting cases (λ > 0): 1 γ = 0: model selection 2 γ = 2: Ridge estimation 3 γ = 1: Lasso Estimation Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 46 / 105

50 From Ridge estimate to Lasso estimate Bridge estimates Assume that γ = 0. Then the bridge estimate exists if X is one-to-one. Assume that γ > 0. Then the bridge estimate exists. If 0 γ < 1, then C λ,γ is not convex and it may be very hard to minimize it in high dimensions. Assume that γ = 1. The penalized criterion C λ,1 is then convex and C λ,1 has one minimizer if X is one-to-one. Assume that γ > 1. The penalized criterion C λ,γ is then strictly convex and C λ,γ has only one minimizer. Almost surely, all coordinates of the bridge estimate are non-zero. For γ 1, one-to-one correspondence between the Lagragian problem ˆβ λ,γ := arg min C λ,γ(β), C λ,γ (β) := Y X β 2 + λ β γ β R p γ and the following constrained problem arg min {β R p : β γ γ t} Y X β. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 47 / 105

51 From Ridge estimate to Lasso estimate Bridge estimates y γ = 6 γ = infty γ = 2 γ = 1 γ = 0.5 γ = x Constraints regions β γ γ 1 for different values of γ. The region is convex if and only if γ 1. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 48 / 105

52 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 We take X T X = Note that ( ) and t = 1. Y X β 2 = (β ˆβ ols ) T X T X (β ˆβ ols ) + Y X ˆβ ols 2 and the constrained problem becomes { arg min (β ˆβ ols ) T X T X (β ˆβ } ols ). {β R p : β γ γ t} We compare the Ridge estimate { } ˆβ ridge λ := arg min (β ˆβ ols ) T X T X (β ˆβ ols ) {β R p : β 2 t} and the Lasso estimate { ˆβ λ lasso := arg min (β ˆβ ols ) T X T X (β ˆβ } ols ) {β R p : β 1 t} Of course both estimates are close (for same values of t) but, depending on ˆβ ols, Lasso estimate may have null coordinates. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 49 / 105

53 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 y Lasso β^ = (X T X) 1 X T Y Ridge x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 50 / 105

54 From Ridge estimate to Lasso estimate Graphical illustration for p = 2 y Lasso Ridge β^ = (X T X) 1 X T Y x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 51 / 105

55 From Ridge estimate to Lasso estimate Specific study of the case γ = 1 (the Lasso) The Lasso, proposed by Tibshirani (1996), is the bridge estimate with γ = 1: ˆβ λ lasso { := arg min Y X β 2 } + λ β 1 β R p It has two specific properties: 1 It is obtained from the minimization of a convex criterion (so, with low computational cost) 2 It may provide sparse solutions if the tuning parameter λ (resp. t) is large (resp. small) enough and allows for automatic selection. Theorem (Characterization of the Lasso) A vector ˆβ R p is a global minimizer of C λ,1 if and only if ˆβ satisfies following conditions: For any j, if ˆβ j 0, 2X T j (Y X ˆβ) = λsign( ˆβ j ) if ˆβ j = 0, 2Xj T (Y X ˆβ) λ Furthermore, ˆβ is the unique minimizer { if X E is one to one } with E := j : 2Xj T (Y X ˆβ) = λ Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 52 / 105

56 From Ridge estimate to Lasso estimate Specific study of the case γ = 1 (the Lasso) Sketch of the proof: 1 For a convex function f, ˆβ is a minimum of f if and only if 0 f ( ˆβ), with f ( ˆβ) { := s R p : f (y) f ( ˆβ) } + s, y x, y. 2 If f is differentiable at ˆβ, f ( ˆβ) = { f ( ˆβ)} 3 If f ( ˆβ) = ˆβ 1 } f ( ˆβ) = {g R p : g 1, g, ˆβ = ˆβ 1 } Note that {j Ŝ := : ˆβj 0 E. So if XŜ is one to one and j / Ŝ, 2X T j (Y X ˆβ) < λ, then we have uniqueness. Indeed, in this case, Ŝ = E. If ˆβ and ˆβ are two global minimizers of C λ,1, then X ˆβ = X ˆβ and ˆβ 1 = ˆβ 1. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 53 / 105

57 From Ridge estimate to Lasso estimate The orthogonal case Assume that the matrix X is orthogonal: X T X = I p. ˆβ λ lasso { := arg min Y X β 2 } + λ β 1 β R p p ( = arg min β 2 β R p j 2(Xj T Y )β j + λ β j ). j=1 Orthogonality allows for a coordinatewise study of the minimization problem. Straightforward computations lead to ( ˆβ λ,j lasso = sign(xj T Y ) Xj T Y λ ) 2 + = X T j Y λ 2 if X T j Y λ 2 0 if λ 2 X T j Y λ 2 X T j Y + λ 2 if X T j Y λ 2 The LASSO (Least Absolute Shrinkage and Selection Operator) procedure corresponds to a soft thresholding algorithm. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 54 / 105

58 From Ridge estimate to Lasso estimate y The orthogonal case - Comparison We assume that the matrix X is orthogonal: X T X = I p. We compare (with a j =: X T j Y ): The OLS estimate: ˆβ j ols = a j The Ridge estimate (γ = 2): ˆβ ridge λ,j = (1 + λ) 1 a j The Lasso estimate or soft-thresholding rule (γ = 1): ( ˆβ λ,j lasso = sign(a j ) a j λ ) 2 The Model Selection estimate or hard-thresholding rule (γ = 0): ˆβ m.s. λ,j = a j 1 { aj > λ} OLS Ridge Lasso Model election Comparison of 4 estimates for the orthogonal case with λ = 1. x Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 55 / 105

59 From Ridge estimate to Lasso estimate Tuning the Lasso - V -fold Cross-validation We write the model Y i = x T i β + ɛ i, i = 1,..., n with x i R p and ɛ i i.i.d. E[ɛ i ] = 0, Var(ɛ i ) = σ 2. For a number V, we split the training pairs into V parts (or folds ). Commonly, V = 5 or V = 10. V -fold cross-validation considers training on all but the kth part, and then validating on the kth part, iterating over k = 1,..., V. When V = n, we call this leave-one-out cross-validation, because we leave out one data point at a time. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 56 / 105

60 From Ridge estimate to Lasso estimate Tuning the Lasso - V -fold Cross-validation 1 Choose V and a discrete set Λ of possible values for λ. 2 Split the training set {1,..., n} into V subsets, B 1,..., B V, of roughly the same size. ˆβ ( k) 3 For each value of λ Λ, for k = 1,..., V, compute the estimate λ on the training set ((x i, Y i ) i Bl ) l k and record the total error on the validation set B k : 1 ( e k (λ) := Yi xi T ˆβ ( k) ) 2. λ card(b k ) i B k 4 Compute the average error over all folds, CV (λ) := 1 V V e k (λ) = 1 V k=1 V k=1 1 ( Yi xi T card(b k ) i B k ˆβ ( k) ) 2. λ 5 We choose the value of tuning parameter that minimizes this function CV on Λ: ˆλ := argmin CV (λ). λ Λ Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 57 / 105

61 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom We write the model Y i = x T i β + ɛ i, ɛ i iid N (0, σ 2 ) i = 1,..., n. Definition (Efron (1986)) The degrees of freedom of a function g : R n R n with coordinates g i is defined by df(g) = 1 n σ 2 cov(g i (Y ), Y i ). i=1 The degrees of freedom may be viewed as the true number of independent pieces of informations on which an estimate is based. Example with rank(x ) = p: We estimate X β with g(y ) = X (X T X ) 1 X T Y df(g) = σ 2 n i=1 E[xi T (X T X ) 1 X T ɛ ɛ i ] = p Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 58 / 105

62 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom Efron s degrees of freedom is the main ingredient to generalize Mallows C p in high dimensions: Proposition Let ˆβ an estimate of β. If C p := Y X ˆβ 2 nσ 2 + 2σ 2 df(x ˆβ), then we have: E[C p ] = E[ X ˆβ X β 2 ]. Assume that for any λ > 0, we have df an estimate of df(x ˆβ λ ), where ˆβ λ is the Lasso estimate associated with λ. Then, we can choose λ by minimizing λ Y X ˆβ λ 2 + 2σ 2 df Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 59 / 105

63 From Ridge estimate to Lasso estimate Tuning the Lasso - Degrees of freedom Theorem (Zou, Hastie and Tibshirani (2007)) Assume rank(x ) = p. Then, with Ŝ λ := { } j : ˆβλ,j 0, we have E[card(Ŝ λ )] = df (X ˆβ λ ). Theorem (Tibshirani and Taylor (2012)) With we have { E λ := j : 2X T j (Y X ˆβ } λ ) = λ, E[rank(X Eλ )] = df (X ˆβ λ ), E[rank(XŜλ )] = df (X ˆβ λ ) This gives three possible estimates for df (X ˆβ λ ). Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 60 / 105

64 From Ridge estimate to Lasso estimate Illustration on real data Analysis of the famous prostate data, which records the prostate specific antigen, the cancer volume, the prostate weight, the age, the benign prostatic hyperplasia amount, the seminal vesicle invasion, the capsular penetration, the Gleason score, the percentage Gleason scores 4 or 5, for n = 97 patients. install.packages("elemstatlearn") install.packages("glmnet") library(glmnet) data("prostate", package = "ElemStatLearn") Y = prostate$lpsa X = as.matrix(prostate[,names(prostate)!=c("lpsa","train")]) ridge.out = glmnet(x=x,y=y,alpha=0) plot(ridge.out) lasso.out = glmnet(x=x,y=y,alpha=1) plot(lasso.out) Theses R commands produce a plot of the values of the coordinates of the Ridge and Lasso estimates when λ decreases. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 61 / 105

65 From Ridge estimate to Lasso estimate Illustration on real data Coefficients lcavol lweight age lbph svi lcp gleason pgg45 Coefficients lcavol lweight age lbph svi lcp gleason pgg L1 Norm L1 Norm The x-axis corresponds to ˆβ λ 1. The left-hand side corresponds to λ = +, the right-hand side corresponds to λ = 0. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 62 / 105

66 From Ridge estimate to Lasso estimate Generalizations of the Lasso - the Dantzig selector Remember that the Lasso estimate satisfies the constraint max 2X j T (Y X j=1,...,p We then introduce the convex set { D := β R p : max 2X j T j=1,...,p ˆβ lasso λ ) λ. } (Y X β) λ, which contains β with high probability if λ is well tuned. Remember also that we investigate sparse vectors where sparsity is measured by using the l 1 -norm. Therefore, Candès and Tao (2007) have suggested to use the Dantzig selector ˆβ Dantzig λ := argmin β D β 1. Note that ˆβ Dantzig λ 1 ˆβ λ lasso 1. Numerical and theoretical performances of Dantzig and Lasso estimates are very close. In some cases, they may even coincide. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 63 / 105

67 From Ridge estimate to Lasso estimate Generalization of the Lasso - Adaptive Lasso Due its soft-thresholding nature, the Lasso estimation of large coefficients may suffer from a large bias. We can overcome this problem by introducing data-driven weights. Zou (2006) proposed an adaptive version of the classical Lasso: p ˆβ λ Zou := arg min β R p Y X β 2 + λ w j β j, with The larger w j = 1 ˆβ j ols. j=1 ols ˆβ j, the smaller w j, which encourages large values for Instead of ˆβ ols, other preliminary estimates can be considered. ˆβ Zou λ,j. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 64 / 105

68 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Relaxed Lasso Instead of introducing weights, Meinshausen (2007) suggests a two-step procedure: 1 Compute { } ˆβ λ lasso = arg min Y X β 2 + λ β 1 β R p and set 2 For δ [0, 1], If X is orthogonal, ˆβ λ,δ,j relaxed = Ŝ λ := { } j : ˆβλ,j 0. { } ˆβ λ,δ relaxed := arg min Y X β 2 + δλ β 1 β R p, supp(β) Ŝ λ The value δ = 0 is commonly used. X T j Y δλ 2 if X T j Y λ 2 0 if λ 2 X T j Y λ 2 X T j Y + δλ 2 if X T j Y λ 2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 65 / 105

69 From Ridge estimate to Lasso estimate Generalizations of the Lasso - The square-root Lasso A natural property of the Lasso estimate would be to satisfy for any s > 0 { arg min Y X β 2 } + λ β 1 β R p a.e. = arg min β R p { sy sx β 2 + λ sβ 1 }. If the tuning parameter is chosen independently of σ, the standard deviation of Y, then the Lasso estimate is not scaled invariant. The estimate { arg min σ 1 Y X β 2 } + λ β 1 β R p is scaled invariant but is based on the knowledge of σ. Alternatively, you can consider the square-root Lasso: which also enjoys nice properties. arg min β R p { Y X β + λ β 1}, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 66 / 105

70 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Elastic net In the model Y = X β + ɛ, consider ˆβ lasso λ If we consider X = [X, X p ] and if with α [0, 1], is a solution of = arg min β R p { Y X β 2 + λ β 1 } ˆβ lasso λ,p 0, then any vector β λ such that ˆβ λ,j lasso if j p lasso β λ,j = α ˆβ λ,p if j = p (1 α) ˆβ λ,p lasso if j = p + 1 We have an infinite number of solutions. { } arg min Y X β 2 + λ β 1. β R p+1, Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 67 / 105

71 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Elastic net In practice, predictors are different but they may be strongly correlated. In this case, the Lasso estimate may hide the relevance of one of them, just because it is highly correlated to another one. Coefficients of two correlated predictors should be close. The elastic net procedure proposed by Zou and Hastie (2005) makes a compromise between Ridge and Lasso penalties: given λ 1 > 0 and λ 2 > 0, ˆβ λ e.n. { 1,λ 2 := arg min Y X β 2 + λ 1 β 1 + λ 2 β 2}. β R p The criterion is strictly convex, so there is a unique minimizer. If columns of X are centered and renormalized and if Y is centered, then for j k such that ˆβ λ e.n. 1,λ 2,j ˆβ λ e.n. > 0 then 1,λ 2,k e.n. e.n. ˆβ λ 1,λ 2,j ˆβ λ 1,λ 2,k Y 1 2(1 Xj X k). λ 2 We can improve e.n. e.n. ˆβ λ 1,λ 2 and consider (1 + λ 2 ) ˆβ λ 1,λ 2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 68 / 105

72 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Fused Lasso For change point detection, for instance, for which coefficients remain constant over large portions of segments, Tibshirani, Saunders, Rosset, Zhu and Knight (2005) have introduced the fused Lasso: given λ 1 > 0 and λ 2 > 0, p p ˆβ λ fused 1,λ 2 := arg min β R p Y X β 2 + λ 1 β j + λ 2 β j β j 1. The first penalty is the familiar Lasso penalty which regularizes the signal. The second penalty encourages neighboring coefficients to be identical. We can generalize the notion of neighbors from a linear ordering to more general neighborhoods, for examples adjacent pixels in image. This leads to a penalty of the form λ 2 j j β j β j. j=1 j=2 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 69 / 105

73 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Group Lasso To select simultaneously a group of variables, Yuan and Lin (2006) suggest to use the group-lasso procedure. For this purpose, we assume we are given K known non-overlapping groups G 1, G 2,..., G K and we set for λ > 0, { } ˆβ group := arg min β R p Y X β 2 + λ K β (k) where β (k)j = β j if j G k and 0 otherwise. As for the Lasso, the group-lasso can be characterized: k, 2X(k) T (Y X ˆβ group group ˆβ (k) ) = λ group if ˆβ group ˆβ (k) 2 (k) 0 2X(k) T (Y X ˆβ group group ) λ if ˆβ (k) = 0 The procedure keeps or discards all the coefficients within a block and can increase estimation accuracy by using information about coefficients of the same block. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 70 / 105 k=1,

74 From Ridge estimate to Lasso estimate Generalizations of the Lasso - Hierarchical group Lasso We consider 2 predictors X 1 et X 2. Suppose we want X 1 to be included in the model before X 2. This hierarchy can be induced by defining the overlapping groups: We take G 1 = {1, 2} et G 2 = {2}. This leads to ˆβ overlap = arg min β R p { Y X β 2 + λ ( β 1, β 2 + β 2 ) } Theorem (Zhao, Rocha et Yu (2009)) The contour plots of this penalty function is We assume we are given K known groups G 1, G 2,..., G K. Let I 1 and I 2 {1,..., p} be two subsets of indices. We assume: 1 For all 1 k K, I 1 G k I 2 G k. 2 There exists k 0 such that I 2 G k0 and I 1 G k0. Then, almost surely, ˆβ G I 2 0 ˆβ G I 1 0. u 2 u 1 Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 71 / 105

75 From Ridge estimate to Lasso estimate Generalizations of the Lasso - The Bayesian Lasso In the Bayesian approach, the parameter is random and we write: Y β, σ 2 N (X β, σ 2 I n ) Park and Casella (2008) suggest to consider a Laplace distribution for β: p [ λ ( β λ, σ 2σ exp λ j ) ] σ β. Then, the posterior density is ( exp 1 2σ 2 Y X β 2 λ ) σ β 1 j=1 and the posterior mode coincides with the Lasso estimate with smoothing parameter σλ. The posterior distribution provides more than point estimates since it provides the entire posterior distribution. The procedure is tuned by including priors for σ 2 and λ. Most of Lasso-type procedures have a Bayesian interpretation. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 72 / 105

76 From Ridge estimate to Lasso estimate Geometric constraints area Exercice : Connect each methodology to its associated geometric constraints area. 1 Lasso 2 Elastic net 3 Group-Lasso 4 Overlap group-lasso Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 73 / 105

77 From Ridge estimate to Lasso estimate Theoretical guarantees - Support recovery Question: Are the non-zero entries of the Lasso estimate positions as the true regression vector β? We set S := { j : βj 0 } {, Ŝ λ := j : We can identify conditions under which Ŝλ = S. Theorem We assume that for some γ > 0, K > 0 and c min > 0, max (X T j / S S X S ) 1 XS T X j 1 1 γ, ˆβ lasso λ } ˆβ λ,j lasso 0 in the same max X j K, j=1,...,p eig(x T S X S ) c min. Then, if λ 8Kσ log p γ, with probability larger than 1 p A, Ŝ λ S, lasso ˆβ λ β λ ( 4σ cmin + (X T S X S ) 1 ) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 74 / 105

78 From Ridge estimate to Lasso estimate Theoretical guarantees - Prediction bounds Theorem (Bunea et al. (2007)) Let us consider λ 3 max j=1,...,p (X T ɛ) j. For any β R p, let Then, if κ(β) > 0, X κ(β) := X ν 2 min ν C(β) ν 2, C(β) := {ν R p : 20 ν 1,Supp(β) > ν 1,Supp(β) c }. ˆβ lasso λ { } X β 2 inf 3 X β X β λ2 β 0. β R p κ(β) Deriving λ such that the first bound is satisfied with high probability is easy by using concentration inequalities. The Restricted Eigenvalues Condition expresses the lack of orthogonality of columns of X. Milder conditions can be used (see Negahban et al. (2012) or Jiang et al. (2017)) Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 75 / 105

79 From Ridge estimate to Lasso estimate Theoretical guarantees - Prediction bounds Lemma From the previous theorem, we can deduce estimation bounds for l 2 and l 1 norms for estimating sparse vectors β (see Jiang et al. (2017)) : lasso ˆβ λ β 2 λ 2 β 0 lasso ˆβ λ β 1 λ β 0 The proof of the Theorem is based on the following lemma. Let λ 3 max i=1,...,n (X T ɛ) i and β R p. Then, lasso λ ˆβ λ β 1,Supp(β) c 3 X β X β 2 + 5λ X ˆβ lasso λ ˆβ lasso λ β 1,Supp(β) X β 2 X β X β 2 lasso + 2λ ˆβ λ β 1,Supp(β) Better constants can be obtained via a more involved proof. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 76 / 105

80 From Ridge estimate to Lasso estimate Take-home message To overcome prohibitive computational complexity of model selection, convex critera can be considered leading, in particular, to Lasso-type estimates. By doing so, we introduce some bias but reduce the variance of predicted values. Moreover, we can identify a small number of predictors that have the strongest effects and then makes interpretation easier for the practitioner. By varying the basic Lasso l 1 -penalty, we can reduce problems encountered by the standard Lasso or incorporate some prior knowledge about the model. In the linear regression setting, these estimates, which can be easily computed, are very popular for high dimensional statistics. They achieve nice theoretical and numerical properties. Even if some standard recipes can be used to tune the Lasso, its calibration remains an important open problem. Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 77 / 105

81 From Ridge estimate to Lasso estimate References - Bunea, F., Tsybakov, A. and Wegkamp, M. Sparsity oracle inequalities for the Lasso. Electron. J. Stat. 1, , Candes, E. and Tao, T. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6), , Chen, S., Donoho, D. and Saunders, M. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20 (1998), no. 1, 33 61, Efron, B. How biased is the apparent error rate of a prediction rule?, Journal of the American Statistical Association: Theory and Methods 81 (394), , Jiang X., Reynaud-Bouret P., Rivoirard V., Sansonnet L. and Willet R. A data-dependent weighted LASSO under Poisson noise. Submitted, Meinshausen, N. Relaxed Lasso. Comput. Statist. Data Anal., 52(1), , Vincent Rivoirard (Université Paris-Dauphine) Statistics for high-dimensional data 78 / 105

Chapter 3. Linear Models for Regression

Chapter 3. Linear Models for Regression Chapter 3. Linear Models for Regression Wei Pan Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455 Email: weip@biostat.umn.edu PubH 7475/8475 c Wei Pan Linear

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Regularization Paths

Regularization Paths December 2005 Trevor Hastie, Stanford Statistics 1 Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Saharon Rosset, Ji Zhu, Hui Zhou, Rob Tibshirani and

More information

Least Angle Regression, Forward Stagewise and the Lasso

Least Angle Regression, Forward Stagewise and the Lasso January 2005 Rob Tibshirani, Stanford 1 Least Angle Regression, Forward Stagewise and the Lasso Brad Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani Stanford University Annals of Statistics,

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Regularization Paths. Theme

Regularization Paths. Theme June 00 Trevor Hastie, Stanford Statistics June 00 Trevor Hastie, Stanford Statistics Theme Regularization Paths Trevor Hastie Stanford University drawing on collaborations with Brad Efron, Mee-Young Park,

More information

arxiv: v2 [math.st] 12 Feb 2008

arxiv: v2 [math.st] 12 Feb 2008 arxiv:080.460v2 [math.st] 2 Feb 2008 Electronic Journal of Statistics Vol. 2 2008 90 02 ISSN: 935-7524 DOI: 0.24/08-EJS77 Sup-norm convergence rate and sign concentration property of Lasso and Dantzig

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Lecture 5: Soft-Thresholding and Lasso

Lecture 5: Soft-Thresholding and Lasso High Dimensional Data and Statistical Learning Lecture 5: Soft-Thresholding and Lasso Weixing Song Department of Statistics Kansas State University Weixing Song STAT 905 October 23, 2014 1/54 Outline Penalized

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications 1 The lasso: some novel algorithms and applications Newton Institute, June 25, 2008 Robert Tibshirani Stanford University Collaborations with Trevor Hastie, Jerome Friedman, Holger Hoefling, Gen Nowak,

More information

A significance test for the lasso

A significance test for the lasso 1 First part: Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Second part: Joint work with Max Grazier G Sell, Stefan Wager and Alexandra

More information

Sparsity and the Lasso

Sparsity and the Lasso Sparsity and the Lasso Statistical Machine Learning, Spring 205 Ryan Tibshirani (with Larry Wasserman Regularization and the lasso. A bit of background If l 2 was the norm of the 20th century, then l is

More information

STAT 462-Computational Data Analysis

STAT 462-Computational Data Analysis STAT 462-Computational Data Analysis Chapter 5- Part 2 Nasser Sadeghkhani a.sadeghkhani@queensu.ca October 2017 1 / 27 Outline Shrinkage Methods 1. Ridge Regression 2. Lasso Dimension Reduction Methods

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

Comparisons of penalized least squares. methods by simulations

Comparisons of penalized least squares. methods by simulations Comparisons of penalized least squares arxiv:1405.1796v1 [stat.co] 8 May 2014 methods by simulations Ke ZHANG, Fan YIN University of Science and Technology of China, Hefei 230026, China Shifeng XIONG Academy

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R

Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Identify Relative importance of covariates in Bayesian lasso quantile regression via new algorithm in statistical program R Fadel Hamid Hadi Alhusseini Department of Statistics and Informatics, University

More information

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010

SCIENCE CHINA Information Sciences. Received December 22, 2008; accepted February 26, 2009; published online May 8, 2010 . RESEARCH PAPERS. SCIENCE CHINA Information Sciences June 2010 Vol. 53 No. 6: 1159 1169 doi: 10.1007/s11432-010-0090-0 L 1/2 regularization XU ZongBen 1, ZHANG Hai 1,2, WANG Yao 1, CHANG XiangYu 1 & LIANG

More information

Introduction to Statistics and R

Introduction to Statistics and R Introduction to Statistics and R Mayo-Illinois Computational Genomics Workshop (2018) Ruoqing Zhu, Ph.D. Department of Statistics, UIUC rqzhu@illinois.edu June 18, 2018 Abstract This document is a supplimentary

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Kernel change-point detection

Kernel change-point detection 1,2 (joint work with Alain Celisse 3 & Zaïd Harchaoui 4 ) 1 Cnrs 2 École Normale Supérieure (Paris), DIENS, Équipe Sierra 3 Université Lille 1 4 INRIA Grenoble Workshop Kernel methods for big data, Lille,

More information

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays

Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays Regression Shrinkage and Selection via the Elastic Net, with Applications to Microarrays Hui Zou and Trevor Hastie Department of Statistics, Stanford University December 5, 2003 Abstract We propose the

More information

Statistical Learning with the Lasso, spring The Lasso

Statistical Learning with the Lasso, spring The Lasso Statistical Learning with the Lasso, spring 2017 1 Yeast: understanding basic life functions p=11,904 gene values n number of experiments ~ 10 Blomberg et al. 2003, 2010 The Lasso fmri brain scans function

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

Model selection theory: a tutorial with applications to learning

Model selection theory: a tutorial with applications to learning Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical

More information

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection

TECHNICAL REPORT NO. 1091r. A Note on the Lasso and Related Procedures in Model Selection DEPARTMENT OF STATISTICS University of Wisconsin 1210 West Dayton St. Madison, WI 53706 TECHNICAL REPORT NO. 1091r April 2004, Revised December 2004 A Note on the Lasso and Related Procedures in Model

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

High-dimensional statistics: Some progress and challenges ahead

High-dimensional statistics: Some progress and challenges ahead High-dimensional statistics: Some progress and challenges ahead Martin Wainwright UC Berkeley Departments of Statistics, and EECS University College, London Master Class: Lecture Joint work with: Alekh

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy

More information

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR

Institute of Statistics Mimeo Series No Simultaneous regression shrinkage, variable selection and clustering of predictors with OSCAR DEPARTMENT OF STATISTICS North Carolina State University 2501 Founders Drive, Campus Box 8203 Raleigh, NC 27695-8203 Institute of Statistics Mimeo Series No. 2583 Simultaneous regression shrinkage, variable

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Machine Learning for Economists: Part 4 Shrinkage and Sparsity Machine Learning for Economists: Part 4 Shrinkage and Sparsity Michal Andrle International Monetary Fund Washington, D.C., October, 2018 Disclaimer #1: The views expressed herein are those of the authors

More information

Saharon Rosset 1 and Ji Zhu 2

Saharon Rosset 1 and Ji Zhu 2 Aust. N. Z. J. Stat. 46(3), 2004, 505 510 CORRECTED PROOF OF THE RESULT OF A PREDICTION ERROR PROPERTY OF THE LASSO ESTIMATOR AND ITS GENERALIZATION BY HUANG (2003) Saharon Rosset 1 and Ji Zhu 2 IBM T.J.

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding

Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding arxiv:204.2353v4 [stat.ml] 9 Oct 202 Least Absolute Gradient Selector: variable selection via Pseudo-Hard Thresholding Kun Yang September 2, 208 Abstract In this paper, we propose a new approach, called

More information

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space

Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Asymptotic Equivalence of Regularization Methods in Thresholded Parameter Space Jinchi Lv Data Sciences and Operations Department Marshall School of Business University of Southern California http://bcf.usc.edu/

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent KDD August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. KDD August 2008

More information

A significance test for the lasso

A significance test for the lasso 1 Gold medal address, SSC 2013 Joint work with Richard Lockhart (SFU), Jonathan Taylor (Stanford), and Ryan Tibshirani (Carnegie-Mellon Univ.) Reaping the benefits of LARS: A special thanks to Brad Efron,

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise

Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise Orthogonal Matching Pursuit for Sparse Signal Recovery With Noise The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published

More information

Lecture 14: Variable Selection - Beyond LASSO

Lecture 14: Variable Selection - Beyond LASSO Fall, 2017 Extension of LASSO To achieve oracle properties, L q penalty with 0 < q < 1, SCAD penalty (Fan and Li 2001; Zhang et al. 2007). Adaptive LASSO (Zou 2006; Zhang and Lu 2007; Wang et al. 2007)

More information

arxiv: v1 [math.st] 8 Jan 2008

arxiv: v1 [math.st] 8 Jan 2008 arxiv:0801.1158v1 [math.st] 8 Jan 2008 Hierarchical selection of variables in sparse high-dimensional regression P. J. Bickel Department of Statistics University of California at Berkeley Y. Ritov Department

More information

Bias-free Sparse Regression with Guaranteed Consistency

Bias-free Sparse Regression with Guaranteed Consistency Bias-free Sparse Regression with Guaranteed Consistency Wotao Yin (UCLA Math) joint with: Stanley Osher, Ming Yan (UCLA) Feng Ruan, Jiechao Xiong, Yuan Yao (Peking U) UC Riverside, STATS Department March

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Some new ideas for post selection inference and model assessment

Some new ideas for post selection inference and model assessment Some new ideas for post selection inference and model assessment Robert Tibshirani, Stanford WHOA!! 2018 Thanks to Jon Taylor and Ryan Tibshirani for helpful feedback 1 / 23 Two topics 1. How to improve

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

The lasso: some novel algorithms and applications

The lasso: some novel algorithms and applications 1 The lasso: some novel algorithms and applications Robert Tibshirani Stanford University ASA Bay Area chapter meeting Collaborations with Trevor Hastie, Jerome Friedman, Ryan Tibshirani, Daniela Witten,

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Risk and Noise Estimation in High Dimensional Statistics via State Evolution

Risk and Noise Estimation in High Dimensional Statistics via State Evolution Risk and Noise Estimation in High Dimensional Statistics via State Evolution Mohsen Bayati Stanford University Joint work with Jose Bento, Murat Erdogdu, Marc Lelarge, and Andrea Montanari Statistical

More information

Learning with Sparsity Constraints

Learning with Sparsity Constraints Stanford 2010 Trevor Hastie, Stanford Statistics 1 Learning with Sparsity Constraints Trevor Hastie Stanford University recent joint work with Rahul Mazumder, Jerome Friedman and Rob Tibshirani earlier

More information

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly Robust Variable Selection Methods for Grouped Data by Kristin Lee Seamon Lilly A dissertation submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree

More information

High-dimensional Ordinary Least-squares Projection for Screening Variables

High-dimensional Ordinary Least-squares Projection for Screening Variables 1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables LIB-MA, FSSM Cadi Ayyad University (Morocco) COMPSTAT 2010 Paris, August 22-27, 2010 Motivations Fan and Li (2001), Zou and Li (2008)

More information

Regularization Path Algorithms for Detecting Gene Interactions

Regularization Path Algorithms for Detecting Gene Interactions Regularization Path Algorithms for Detecting Gene Interactions Mee Young Park Trevor Hastie July 16, 2006 Abstract In this study, we consider several regularization path algorithms with grouped variable

More information

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Nonconcave Penalized Likelihood with A Diverging Number of Parameters Nonconcave Penalized Likelihood with A Diverging Number of Parameters Jianqing Fan and Heng Peng Presenter: Jiale Xu March 12, 2010 Jianqing Fan and Heng Peng Presenter: JialeNonconcave Xu () Penalized

More information

Compressed Sensing in Cancer Biology? (A Work in Progress)

Compressed Sensing in Cancer Biology? (A Work in Progress) Compressed Sensing in Cancer Biology? (A Work in Progress) M. Vidyasagar FRS Cecil & Ida Green Chair The University of Texas at Dallas M.Vidyasagar@utdallas.edu www.utdallas.edu/ m.vidyasagar University

More information