High-dimensional regression modeling

Size: px

Start display at page:

Download "High-dimensional regression modeling"

Theodora Morris
5 years ago
Views:

1 High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR

2 Course objectives Making prediction rules from high-throughput data Data reduction techniques for regression; Regularization methods. Mathematical vs Applied statistics Statistical theory is reduced to its essentials Solving problems by data analysis using R By the end, students are expected to be able to: Implement methods for high-dimensional prediction; Compare procedures based on statistical arguments; Assess the prediction performance of a learning algorithm; Apply these key insights using statistical software.

3 Outline 1. High-dimensional regression modeling 1.1. Model selection based on Ordinary Least Squares 1.2. Assessment of regression rules (cross-validation) 2. Rank-reduced regression 2.1. Principal Component Regression 2.2. Partial Least Squares 3. Penalized regression 3.1. Ridge estimation 3.2. Lasso selection Course evaluation: written exam + project Bibliography: T. Hastie, J. Friedman, R. Tibshirani (2009). The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer. Free download.

4 Pre-requisites Regression Assumptions of linear regression modeling? Ordinary Least squares fitting? Model assessment R 2? AIC? Testing t-test? F-test? Statistical software: R lm(y x,...)? anova(lm(...))?

5 1 Regression modeling Why regression? Fitting regression models Aims 2 Variable selection Model comparison Model selection Outline 3 Penalized regression Shrinkage LASSO regression Penalized 2-class supervised classification 4 Rank reduced regression Regression on latent variables Partial Least Squares

6 Understanding life mechanisms F. Galton R. Fisher W. Gosset (Student) Issue in life sciences: understanding phenotypical variations In agricultural sciences: understanding yields variations

7 Regression modeling Wheat yield (Y ) modeling Wheat production profile (x) Variety Chemical inputs Soil composition... For a given profile x = (x 1,..., x p ) with phenotype Y f : regression function E x (Y ) = f (x)

8 Range of regression models Y can take various forms. Among them: The reference framework. Y on a continuous scale. E x (Y ) is a mean Y value for the profiles x The classification framework. Y is a 2-class variable (0-1). E x (Y ) = P x (Y = 1) f also f known up to some unknown parameters f (x; β 0, β 1 ) = β 0 + β 1 x, f (x; β 0, β 1 ) = β 0 x β 1,... f fully unknown f (x) is regular (continuous, differentiable,...)

9 Galton s regression model for heredity From Galton, F. (1886) Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute. Vol. 15. pp Galton (1886)'s data Height of child (inches) Height of mid parent (inches)

10 Galton s regression model for heredity From Galton, F. (1886) Regression towards mediocrity in hereditary stature. Journal of the Anthropological Institute. Vol. 15. pp

.. impossible on the slaughter-line or indirectly by fat and

11 Pigs grading in EU Lean Meat Percentage (LMP): commercial value of a pig carcass LMP is measured by complete dissection... impossible on the slaughter-line or indirectly by fat and muscle depths Different devices to measure tissue depths Invasive probe Scanning device

12 Prediction of the LMP Linear regression model E x (Y ) = β 0 + β 1 x β p x p, where Y is the LMP of a pig x = (x 1,..., x p ) is the tissue depths profile of this pig β 0 and β = (β 1,..., β p ) are the regression parameters To fit the regression model = to estimate the βs

13 Data needed to fit the model Sample of independent units Units Y x 1 x 2... x p 1 Y 1 x 11 x x 1p 2 Y 2 x 21 x x 1p..... n Y n x n1 x n2... x np

14 Data needed to fit the model Y x

15 Data needed to fit the model Sample of independent units: n = 60

16 The reference fitting method: least-squares Fitting principle: searching for the closest model from data closest : the least-squares perspective ε i =ith residual n ({}}{ Y i [ ] ) 2 β 0 + β 1 x i β p x ip = i=1 n i=1 ε 2 i A very convenient closed-form solution... provided S 1 x ˆβ = Sx 1 s xy, ˆβ 0 = Ȳ ˆβ 1 x 1... ˆβ p x p exists where S x is the sample p p variance matrix of the x profile and s xy is the sample p vector of covariances between Y and the x profile.

17 Assessment of the fit Closeness between observed Y and fitted values Ŷ : Ŷ = ˆβ 0 + ˆβ 1 x ˆβ p x p using the residual sum-of-squares: RSS = n (Y i Ŷi) 2 i=1 or, equivalently, the R 2 coefficient: R 2 = Cor 2 (Y, Ŷ ) Precision of estimation/prediction, assuming Var(ε) = σ 2 Var( ˆβ) = σ2 n S 1 x.

18 Impact of correlation on the precision of the fit Case p = 1: E x (Y ) = β 0 + βx Var( ˆβ) = σ2 1 n, where Sx 2 i=1 = (x i x) 2 n n S 2 x Precision depends on the spread of the x profile. Case p = 2: E x (Y ) = β 0 + β 1 x 1 + β 2 x 2 Var( ˆβ) = σ2 [ S 2 1 S ] 1 [ 12 Var( ˆβ = 1 ) Cov( ˆβ 1, ˆβ 2 ) n S 12 S2 2 Cov( ˆβ 1, ˆβ 2 ) Var( ˆβ 2 ) Hence, Var( ˆβ 1 ) = σ2 n 1 S r 2 12, Var( ˆβ 2 ) = σ2 n 1 S r 2 12 Precision depends also on the correlation (here r 12 ) within the x profile. ]

19 Reduction of the x profile Model reduction Handling of redundancy in the x profile Search for a complete and minimal subset of predictors Search for a moderate number of latent prediction components Reduce the variance of estimation Assessment of prediction rules Fit parcimonious models Test the prediction rule on new data (cross-validation)

20 1 Regression modeling Why regression? Fitting regression models Aims 2 Variable selection Model comparison Model selection Outline 3 Penalized regression Shrinkage LASSO regression Penalized 2-class supervised classification 4 Rank reduced regression Regression on latent variables Partial Least Squares

21 Which criterion to compare two models? In the linear model framework If p predictors, then 2 p 1 possible models ( k p) models of size k, M k : E x (Y ) = β 0 + β i1 x i β ik x ik How to rank these models? Which criterion? Fitting performance (RSS, R 2,...) Prediction performance (variance of the prediction error)

22 Fitting performance R 2 : intuitive and popular criterion But, for all j < k, R 2 (M k ) > R 2 (M j ). Alternative fitting performance criteria Akaike Information Criterion (AIC) [ ] RSS(Mk ) AIC(M k ) n log + 2(k + 1) n Bayesian Information Criterion (BIC) [ ] RSS(Mk ) BIC(M k ) n log + log(n)(k + 1) n

23 F-test F-test for the comparison of nested models F k j = RSS(M j ) RSS(M k ) k j RSS(M k ) n k 1 = R 2 (M k ) R 2 (M j ) k j 1 R 2 (M k ) n k 1 H0 F k j,n k 1 M k > M j if [ Fk j f α ] [ R 2 (M k ) R 2 (M j ) k j ] n k 1 f α(1 R 2 (M k ))

24 Prediction accuracy Let x 0 be the x profile of an individual with unknown response value Y 0 Prediction error: Y 0 Ŷ0 E x0 (Y 0 Ŷ0) = 0 Precision: [ σ0 2 = Var x0 (Y 0 Ŷ0) = E x0 (Y 0 Ŷ0) 2]

25 Prediction accuracy Estimation of σ0 2 using cross-validation viduals n indiv

26 Prediction accuracy Estimation of σ0 2 using cross-validation n indiv viduals The sample is split into k balanced subsamples

27 Prediction accuracy Estimation of σ0 2 using cross-validation s n indiv vidual Learning sample used to estimate the model

28 Prediction accuracy Estimation of σ0 2 using cross-validation Testing sample to calculate prediction errors s n indiv vidual Learning sample to estimate the model

29 Prediction accuracy Estimation of σ0 2 using cross-validation s Testing sample to calculate prediction errors vidual n indiv

30 Prediction accuracy Estimation of σ0 2 using cross-validation s n indiv vidual Testing sample to calculate prediction errors

31 Prediction accuracy Estimation of σ0 2 using cross-validation viduals n indiv Testing sample to calculate prediction errors

32 Prediction accuracy Let x 0 be the x profile of an individual with unknown response value Y 0 Prediction error: Y 0 Ŷ0 E x0 (Y 0 Ŷ0) = 0 Precision: [ σ0 2 = Var x0 (Y 0 Ŷ0) = E x0 (Y 0 Ŷ0) 2] Estimation of σ0 2 using cross-validation n prediction errors Y i Ŷ i Mean Squared Error of Prediction (MSEP) : ˆσ 2 0 = 1 n n i=1 [ ] 2 Y i Ŷ i

33 Search for the best sub-model If p components in the x profile, 2 p 1 possible submodels p = 11: 2048 submodels Exhaustive search is still possible p = 137: 1.7e41 submodels Suboptimal (stepwise) search has to be implemented Stepwise selection (at most p(p + 1)/2 submodels are fitted) Forward selection is to be privileged when p is large, the output is very unstable

34 1 Regression modeling Why regression? Fitting regression models Aims 2 Variable selection Model comparison Model selection Outline 3 Penalized regression Shrinkage LASSO regression Penalized 2-class supervised classification 4 Rank reduced regression Regression on latent variables Partial Least Squares

35 Impact of correlation on OLS Ordinary Least Squares fitting ˆβ minimizes n (Y i [ ] ) 2 β 0 + β 1 x i β p x ip i=1 A closed-form solution... provided S 1 x ˆβ is unbiased with variance ˆβ = S 1 x s xy. exists Var x ( ˆβ) = σ2 n S 1 x. In high-dimension, Var x ( ˆβ j ) can be vary large...

36 A biased alternative: penalized regression Constrained least-squares SC(β) = n i=1 with ( Yi Ȳ β 1[x (1) i p j=1 β 2 j κ x (1) ]... β p [x (p) i x (p) ] ) 2,

37 A biased alternative: penalized regression Constrained least-squares (equivalent form) SC(β) = n i=1 with ( Yi Ȳ β 1[x (1) i p βj 2 =κ j=1 x (1) ]... β p [x (p) i x (p) ] ) 2,

38 A biased alternative: penalized regression Constrained least-squares (equivalent form) SC(β) = n i=1 with ( Yi Ȳ β 1[x (1) i p βj 2 =κ j=1 x (1) ]... β p [x (p) i x (p) ] ) 2, Constrained least-squares (Lagrange multiplier) SC(β; λ) = + λ n i=1 ( Yi Ȳ β 1[x (1) i p j=1 β 2 j x (1) ]... β p [x (p) i x (p) ] ) 2

39 A biased alternative: penalized regression OLS fit of a regression model for LMP Lean Meat Percentage β^ = 0.863

40 A biased alternative: penalized regression Ridge fit of a regression model for LMP Penalized Residual Sum of Squares λ =

41 A biased alternative: penalized regression Ridge fit of a regression model for LMP Penalized Residual Sum of Squares λ =

42 A biased alternative: penalized regression Ridge fit of a regression model for LMP Penalized Residual Sum of Squares λ =

43 A biased alternative: penalized regression Ridge fit of a regression model for LMP Penalized Residual Sum of Squares λ =

44 For a given λ: ˆβ λ = ˆβ λ = Bias-variance trade-off [ S xx + λ n I p] 1 s xy, [ S xx + λ ] 1 n I p S ˆβ, xx [provided Sxx 1 exists]

45 For a given λ: ˆβ λ = ˆβ λ = Bias-variance trade-off [ S xx + λ n I p] 1 s xy, [ S xx + λ ] 1 n I p S ˆβ, xx [provided Sxx 1 exists] The ridge estimator is biased [ ] b λ = E x ˆβ λ β 0 [shrinkage estimation]... but ˆβ λ has a smaller variance than ˆβ

46 For a given λ: ˆβ λ = ˆβ λ = Bias-variance trade-off [ S xx + λ n I p] 1 s xy, [ S xx + λ ] 1 n I p S ˆβ, xx [provided Sxx 1 exists] The ridge estimator is biased [ ] b λ = E x ˆβ λ β 0 [shrinkage estimation]... but ˆβ λ has a smaller variance than ˆβ Bias-variance trade-off λ is chosen so that RMSEP(λ) is the smallest possible which can be achieved by cross-validation

47 Shrinkage and selection: LASSO regression Constrained least-squares SC(β; λ) = + λ n i=1 ( Yi Ȳ β 1[x (1) i p β j j=1 x (1) ]... β p [x (p) i x (p) ] ) 2

48 Shrinkage and selection: LASSO regression LASSO fit of a regression model for LMP Penalized Residual Sum of Squares λ = β

49 Shrinkage and selection: LASSO regression LASSO fit of a regression model for LMP Penalized Residual Sum of Squares λ = β

50 Shrinkage and selection: LASSO regression LASSO fit of a regression model for LMP Penalized Residual Sum of Squares λ = β

51 Shrinkage and selection: LASSO regression LASSO fit of a regression model for LMP Penalized Residual Sum of Squares λ = β

52 Shrinkage and selection: LASSO regression Constrained least-squares SC(β; λ) = + λ n i=1 ( Yi Ȳ β 1[x (1) i p β j j=1 x (1) ]... β p [x (p) i x (p) ] ) 2 The LASSO estimator [ ] is also biased b λ = E x ˆβ λ β 0 [shrinkage estimation]... but ˆβ λ has a smaller variance than ˆβ... and increasing λ kills the βs [selection]

53 Settings for supervised classification Let Y be a 2-class variable which levels are coded ±1 Linear Supervised Classification consists in deriving prediction rule for Y based on a linear score: If ˆβ 0 + ˆβ 1 x ˆβ p x p 0, then Ŷ = 1 In Bayes Linear classification, the linear score is related to Y by the logistic regression model of π x = P x (Y = +1): log P x(y = +1) P x (Y = 1) = logit( π x ) = β0 + β 1 x β p x p.

54 Fitting of a classification rule Based on the joint observations of x i = (x i1,..., x ip ) and y i = ±1, ˆβ = ( ˆβ 0, ˆβ 1,..., ˆβ p ) minimize the deviance: D(β) = 2 log L(β), [L(β) is the likelihood.] n = 2 log P xi (Y i = y i ), D(β) = 2 i=1 n i=1 Assessment of the fit: Residual deviance: D r = D( ˆβ) AIC: D r + 2(p + 1) BIC: D r + log(n)(p + 1) log (1 + exp [ y i (β 0 + β 1 x i β p x ip ) ])

55 Assessment of the classification performance Let S 1 = {Y i = 1, i = 1,..., n} and n 1 = #S 1 Misclassification rate 1 } {Ŷ i n # Y i, i = 1,..., n False positive rate = 1 sensibility FPR = 1 } # {Ŷ i = 1, i / S 1 n n 1 False negative rate = 1 specificity FNR = 1 } # {Ŷ i = 1, i S 1 n 1 For the classification rules [ˆπ x p Ŷ = 1], the ROC curve displays the increase of Sensibility (p) along FNR (p) The larger the Area Under the (ROC) Curve (AUC), the more discriminative is the x profile.

56 Penalized logistic regression l 1 -penalized deviance minimization (LASSO) D(β; λ) = D(β) + λ p β j j=1 or l 2 -penalized deviance minimization (Ridge) D(β; λ) = D(β) + λ p j=1 β 2 j λ is chosen by minimization of the BIC or the (cv d) misclassification rate or maximization of the (cv d) AUC

57 1 Regression modeling Why regression? Fitting regression models Aims 2 Variable selection Model comparison Model selection Outline 3 Penalized regression Shrinkage LASSO regression Penalized 2-class supervised classification 4 Rank reduced regression Regression on latent variables Partial Least Squares

58 Latent predicting variable If x = (x 1,..., x p) is the profile of scaled predictors x ij = x ij x j s j then the univariate regression models Y = β 0 + β j x j + ε j, where ˆβ 0 = Ȳ and ˆβ j = s x j y can intuitively be compared by their R 2 j = ˆβ 2 j /s 2 y.

59 Latent predicting variable If x = (x 1,..., x p) is the profile of scaled predictors x ij = x ij x j s j then the univariate regression models Y = β 0 + β j x j + ε j, where ˆβ 0 = Ȳ and ˆβ j = s x j y can intuitively be compared by their R 2 j = ˆβ 2 j /s 2 y. Latent predicting variable: t = α 1 x α px p such that s 2 ty is maximal among linear combinations with α α2 p = 1.

60 Dimensionality of a regression model Extraction of latent variables α (1) such that st 2 1,y is maximal α (2) such that s t1,t 2 = 0 and st 2 2,Y is maximal... α (k) such that s tk,t j = 0, j < k and st 2 k,y is maximal

61 Dimensionality of a regression model Extraction of latent variables α (1) such that st 2 1,y is maximal α (2) such that s t1,t 2 = 0 and st 2 2,Y is maximal... α (k) such that s tk,t j = 0, j < k and s 2 t k,y is maximal Y T 1 T 2 X 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X 9

62 Dimensionality of a regression model Extraction of latent variables α (1) such that st 2 1,y is maximal α (2) such that s t1,t 2 = 0 and s 2 t 2,Y is maximal... α (k) such that s tk,t j = 0, j < k and st 2 k,y is maximal Dimensionality: number k of latent variables which guarantees the optimal precision of prediction If k = min(n, p), then regression on the latent variables is equivalent to the OLS fit of the full regression model If k < min(n, p), then regression on the latent variables is called Partial-Least-Squares (PLS) k shall be determined by CV PLS with k components is always better that Regression on Principal Components (PCR) with k components

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.