Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016

Size: px

Start display at page:

Download "Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016"

Lucas Sanders
6 years ago
Views:

1 Regression as a term Linear Regression Ivo Ugrina King s College London // University of Zagreb // University of Split September 29, 2016 Galton Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Basic structure Y = f (X) + ε Simple Linear Regression ε is random error term with zero mean, independent of X Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

2 Model Estimating coefficients Y = β 0 + β 1 X + ε Y β 0 + β 1 X ε is random error term with zero mean, independent of X β 0 and β 1 are unknown constants representing/called intercept and slope. Often called coefficients or parameters. First, we need to have data, (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ) X = (x 1,, x n ), Y = (y 1,, y n ) With the data the goal is to obtain coefficient estimates ˆβ 0 and ˆβ 1 for which the linear model Ŷ = ˆβ 0 + ˆβ 1 X fits the data as close as possible Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Closeness (Ordinary) Least Squares (OLS) To fit the data as close as possible we should first have a measure of closeness There are many choices of measures to accomplish this, like (the most intuitive one, to me :) ) to minimize least absolute deviation LAD = n y i ˆβ 0 ˆβ 1 x i However, the measure should not only be intuitive (or at all) as much as it should be practical (easy to compute) Therefore, the most common approach for the past 200 years is to minimize the residual sum of squares RSS = n yi ˆβ 0 ˆβ 2 1 x i The least-squares method is usually credited to Carl Friedrich Gauss (1795), but it was first published by Adrien-Marie Legendre. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

3 OLS Randomness of least square lines Analytic solution to OLS is (easily) obtainable through a bit of calculus and is given by ˆβ 1 = n (x i x)(y i y) n (x i x) Add a figure here concept of unbiasedness ˆβ 0 = y ˆβ 1 x where y = 1 n n y i and x = 1 n n x i. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 How close are ˆβs to βs Do we know σ 2? SE( ˆβ 0 ) 2 = σ 2 1 n + x 2 n (x i x) 2 SE( ˆβ 1 ) 2 = σ 2 n (x i x) 2 In most real world problems σ 2 is unknown But, it can be estimated from the data This estimate is known as the residual standard error and is given by RSE = RSS/(n 2) σ 2 = Var(ε) Errors are uncorellated with common variance σ 2 Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

4 Confidence intervals Is there a relationship between X and Y? Using standard errors SE( ˆβ 1 ) and SE( ˆβ 0 ) we can derive solutions for (1 α)-confidence intevals For β i : ˆβ i ± t (n 2) (1 α/2) SE( ˆβ i ) A question of relationship between input variables and the outcome can be written as a hypothesis test like: H 0 : β 1 = 0 H a : β 1 = 0 there is no relationship there is a relationship Errors are Gaussian Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 (Wald) test Accuracy of the fit - RSE If there is no relationship between X and Y, statistic ˆβ 1 0 T = SE( ˆβ 1 ) will have a t-distribution with n 2 degrees of freedom if we have estimated the variance from the data. Otherwise, if variance is known, the distrubion will standard normal. Errors are Gaussian A measure of the lack of fit of the model Gives is an estimate of the average amount that the response will deviate from the true regression line If the predictions are very close to the true outcome values RSE will be small If the predictions are very far away from one or more true outcomes, RSE will be quite large Measured in units of Y so it is not (always) clear what is a good RSE Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

5 Accuracy of the fit - R 2 Independet of the scale of Y Always between 0 and 1 Given by TSS RSS R 2 = RSS = 1 RSS TSS where TSS = n (y i y i ) 2 is the total sum of squares Therefore, R 2 measures the proportion of variability in Y that can be explained with X It can be (and usually is) quite challenging to determine what a good R 2 is. It depends on the field mostly (biologists are sometimes happy with R ). (Multiple) Linear Regression Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Why? Model Many real world problems need more than one variable (predictor) to explain a phenomena (outcome) Should we run a simple linear regression for all variables? If so, how do we predict based on multiple models and yet taking into account all variables? What if there are synergies between variables (coffee/sugar example)? Y = β 0 + β 1 X β p X p + ε ε is random error term with zero mean, independent of X i β 0 and β j, j = 1,..., p are unknown constants representing the intercept and slopes. Often called coefficients or parameters. β j quantifies the association between X j and the outcome (response) β j can be interpreted as an average effect on Y of a one unit increase in X j holding all other variables fixed Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

6 Matrix notation Estimating coefficients y = Xβ + ε 1 x 11 x 12 x 1p β 0 y 1 1 x 21 x 22 x 2p X =..., β = β 1., y = y 2. 1 x n1 x n2 x np β p y n As with the simple linear regression coefficients are unknown and should be estimated The parameters (coefficients) are estimated using the same principle as in the simple linear regression. Namely, we want to minimize RSS = n yi ˆβ 0 ˆ β 1 x i1 ˆβ 1 x i2 ˆβ p x ip 2 = y X ˆβ 2 2 We will denote with ˆβ i coefficients that minimize RSS Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 OLS solution Prediction accuracy vs Model interpretability/inference Analytic (unique) solution to OLS for multiple linear regression is also available provided that colums of X are linearly independent and it has the form (in matrix notation) ˆβ = X T X 1 X T y n p! Are we interested only in the results (outcome) of the model? Can we easily obtain all predictors? Is it easy to obtain lots of data? How flexible are we computationally? How big is the penalty for bad predictions? Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

7 At least one predictor is useful in predicting response? F-statistic This hypothesis test can be resolved using the F statistic We can rephrase this question in the language of hypothesis testing H 0 : β 1 = β 2 = β 3 = = β p = 0 H A : at least one β j is non-zero (TSS RSS)/p F = RSS/(n p 1) which under the null hypothesis and Gaussian errors is distributed according to F-distribution It can be shown that even if the errors are not normally distributed if the sample size is big enough F will approximately follow an F-distribution Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 How small/large should F or T be? Important variables / features (selection) Probability of obtaining a value more extreme than F or T (under the null hypothesis) is often called the p value (in general, p values have a more broader definition). How small/large should a p-value be? A small one indicates that it is unlikely to observer such association just due to change (in the absence of any real association between predictors and the response). Curse of multiple p-values. A story of multiple Ts and one F. If we have obtained a signigicant F-statistic we could wonder which variables exactly are connected with the response The task of determining which variables are associated with the response is referred to as variable or feature selection Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

8 A naive approach The best model? Why not try all possible subsets of variables and see which model gives the best results For p = 2 this might seem plausible, but what if we have p = 20 for example. In this scenario we would have to check (and build) 2 p = 1, 048, 576 models :( Not practical in real life Additional problem is the definition of the best model We have to choose the best model within the group of models with the same number of variables and between them Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 R 2 for MLR RSE for MLR R 2 can be defined in the same way for MLR as for the SLR, as the fraction of variance explained, and could be used as a measure for the choice of the best model. However, R 2 is improving by adding more variables into the model. Additionally, what is a/the substantial improvement in R 2. Ideally, we would like to estimate the quality of a model/fit on the test (unknown) samples instead of the training samples. RSE can be defined in a similar fashion for MLR as for the SLR, 1 RSE = n p 1 RSS and could be used as a measure for the choice of the best model. What is a/the substantial improvement in RSE? How expensive is to keep a variable in the model? Ideally, we would like to estimate the quality of a model/fit on the test (unknown) samples instead of the training samples. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

9 Adjusted R 2 (Mallow s) C p Helpful in adjusting the training error for the model size. Very popular approach for selecting among a set of models that contain different numbers of variables. The adjusted R 2 statistic is calculated (for a model with d variables) as RSS/(n d 1) AdjustedR 2 = 1 TSS/(n 1) where TSS = n (y i y) 2. Unlike the R 2 statistic, the adjusted R 2 statistic does take into account (penalizes) the inclusion of unnecessary variables in the model. For a model with d predictors, the C p is defined as C p = 1 n RSS + 2dˆσ 2 where ˆσ 2 is an estimate of the variance of the error ε. 2dˆσ 2 is a penalty term to adjust for the fact that the training data usually underestimates the the real (test) error. Theoretically it justified through the asymptotic theory (sample size n is very large). Lower C p is better Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Akaike information criterion (AIC) MLE and OLS (digreesion) Defined for a large class of models fit by the maximum likelihood approach. In the of Gaussian errors maximum likelihood and least squares give the same estimates. AIC (for Gaussian errors) is defined by AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) In the model where ε N(0, σ 2 ) the log-likelihood is given by n 2 log(σ2 ) 1 n (y 2σ 2 i x i β) 2 Therefore, maximization of log-likelihood leads to minimization of RSS Proportional to C p Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

10 BIC Cross-validation (CV) For a model with d predictors BI is defined as BIC = 1 n RSS + log(n)dˆσ 2 BIC is derived from a Bayesian point of view It placs a heavier penalty on models with more variables and therefore results in the selection of smaller models than C p Lower BIC is better A way to directly estiamte a/the test (real) error Makes fewer assumptions about the true underlying model than C p, AIC, BIC and adjusted R 2 Can be used in a wider range of model selection tasks (e.g. hard to estimate the error variance σ 2 ) Due to computational improvements over the last years CV is slowly becoming the standard approach for selecting from among a number of models under consideration Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Forward stepwise (subset) selection Forward selection comments Start with the null model, which does not contain variables/predictors. For k = 0, 1,..., p 1 do Consider all p k models that augment the predictors in Mk with one additional predictor Choose the best among these models, called Mk+1. Best defined as the one with smallest RSS. Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Faster than best subset Not guaranteed to find the best subset (what s been included stays included) Can be applied in a/the high-dimensional setting where n < p but up to n variables. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

11 Backward stepwise (subset) selection Backward selection comments Start with the full model M p, containing all p variables/predictors. For k = p, p 1,..., 1 do Consider all k models that contain all but one of the predictors in M k Choose the best among these k models, called Mk 1. Best defined as the one with smallest RSS. Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Faster than best subset Not guaranteed to find the best subset (what s been excluded stays excluded) Can not be applied in a/the high-dimensional setting where n < p. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Mixed/Hybrid stepwise (subset) selection Mixed/Hybrid selection coments Add models sequentially (like in forward selection) After adding each new variable try to remove any variables in the model that are not providing improvement in the model fit Not as fast as forward/backward regression Hard to decide when to move back (remove a variable). Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

12 Gauss-Markov theorem Assumptions E(εi ) = 0 Var(εi ) = σ 2 < (Homoscedasticity) Cov(εi, ε j ) = 0, i = j Under these assumptions OLS estimator is a BLUE (Best linear unbias estimator) Challenges (or problems?) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Qualitative variables Repeated measurements Variables like Gender, Hospital,... We can use dummy variables in a regression model, e.g. 1, if the person is female x = 0, if the person is male For qualitative variables with k > 2 levels (values) we can create k 1 dummy variables Sometimes it also makes sense to do separate regressions per levels of the qualitative variable (different errors per level?) A study involving same patients over a period of time (BMI, hearth rate,... ) Not suitable for the classical linear regression due to dependencies between measurements Take a look at Linear Mixed Effects (LME) models Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

13 Synergies/Interactions between variables Non-linear relationship One way of extending the linear model is to allow for interaction effects to be included as an additional variable, e.g. Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ε If an interaction term is significant as a principle we should always include main effects/variables into the model although they (by themselves) might not be significant Sometimes we can justify a linear model by changing our predictors a bit Example, going from Y = β 0 + β 1 X 1 to Y = β 0 + β 1 X 1 + β 2 X 2 1 leaves us in the linear regression space but gives more flexibility. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Correlation between errors An important assumption for many results in linear regression models is that the error terms ε 1,...,ε n are uncorrelated. Standard errors for estimated coefficients are based on the idea of uncorrelated terms (and the subsequent hypothesis testing) Fields like time series analysis are devoted to explaining models with correlated errors. To mitigate the risk of correlations between error terms a proper EXPERIMENTAL DESIGN is crucial Heteroscedasticity Another important assumptions in the linear regression is that the error terms have a constant variance Standard errors, confidence intervals and hypothesis tests for linear regression rely upon this assumption Sometimes variables can be transformed (a classical example is using the log function) to reduce heteroscedasticity Also, techniques based on weighting prove to be useful. Especially, Weighted least squares are useful for proportional variances with known proportional constants w i. The goal is to minimize n w i yi ˆβ 0 ˆ β 1 x i1 ˆβ 1 x i2 ˆβ p x ip 2 Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

14 Outliers Leverage points Hard to define (agree on the definition) In real world outliers are quite common (depending on the definition :) ) Extremely hard to deal with. If we have no solid proof that they are artifical best to leave them in a/the model Observations with unusual values for x i Can be avoided with proper experimental designs In order to quantify an observation s leverage we can use the leverage statistic. For a simple linear regresion it is defined as h i = 1 n + (x i x) 2 n j=1 (x j x) 2 A large value indicates a high leverage Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 (Multi)Collinearity p» n Collinearity refers to a model where two or more variables are dependent on each other (related) For example, height and weight, BMI and weight,... Reduces the accuracy of the estimates of the regression coefficients and therefore increases the standard error. Often assesed with the variance inflation factor (VIF) There are two classical solutions to the problem: Remove one of the correlated variables. Rationale: the other already has the information. Combine correlated variables into one (if it can be justified in the real world). Often we have a study where we have measured a lot of variables but have only a small number of participants Genomic research, web based data (small web shop),... OLS does not have a unique solution for p n We need special models that fit this high-dimensional setting Popular solutions are based on regularization techniques (LASSO, ridge, elastic net) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

15 What and how Logistic Regression Ivo Ugrina King s College London // University of Zagreb // University of Split September 29, 2016 We would like to use ideas from the linear regression to discerte (categorical) outputs like Y {0, 1}. However, a linear regression will for quantitative inputs (variables) return quantitative outputs, e.g. elements from <, + > What if we could translate the task of getting discrete output to obtaining probabilities (numerical values) for these outputs? We would still have a problem since probabilities take values from 0 to 1 How about additionally transforming probabilities to cover the space for to? Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Model Simple Logistic Regression For outcome Y with only two possible values and one variable/predictor we define simple logistic regression model as ln p(x) 1 p(x) = β 0 + β 1 x where p(x) is the probability of obtaining one of the two possible values in the outcome for x, i.e. P(Y = 1 x). The transformation is usually called the logit transformation. Solving for p(x), this gives eβ0+β1x p(x) = 1 + e β0+β1x Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

16 Likelihood Estimating coefficients Because logistic regression predicts probabilities, rather than just classes, we can fit it using likelihood. For each training data-point, we have a vector of inputs, x i, and an observed class, y i. The probability of that class is either p, encode one class with 1 (y i = 1), or 1 p, for other class encoded as 0 (y i = 0). The likelihood is then L(β 0, β 1 ) = n p(x i ) yi (1 p(x i )) yi Coefficients can be estimated by maximizing likelihood No analytical solution known at the moment. Solved through numerical optimization. Solution does not always exist, but when it does it is unique. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Significance of the coefficients/variables Deviance After estimating the coefficients, our first look at the fitted model commonly concerns an assessment of the significance of the variables in the model. This usually involves formulation and testing of a statistical hypothesis to determine whether the independent variables in the model are significantly related to the outcome. If the predicted values with the variable in the model are better, or more accurate in some sense, then we believe that the variable in question is significant. In logistic regression the role of residual sum-of-squares is played by the deviance defined as D = 2 ln (likelihood of the fitted model) To assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the variable in the model is G =D(model without the variable) D(model with the variable) Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

17 Hypothesis testing Wald test Under the hypothesis that β 1 is equal to zero the statistic G follows a chi-square distribution with 1 degree of freedom. Valid if we have enough subjects with both y = 0 and y = 1 and a sufficiently large sample size n. As a rule of thumb (tested through simulations) what we need in logistic regression is for n min = minn 0, n 1 to be about 10 times the number of parameters. Some newer papers suggest that we could be ok with 5-9 times more. Assumptions need are the same as for the deviance based test Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter β ˆ 1 to an estimate of its standard error β 1 W = SE( ˆβ 1 ) Under the null hypothesis and the sample size assumptions this ratio follows a standard normal distribution. Believed to be inferior to deviance based testing. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Confidence interval estimation The confidence interval estimators for the slope and the intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a 100(1 α)% confidence interval for the slope coefficient are βˆ 1 ± z 1 α/2 SE( β ˆ 1 ) (Multiple) Logistic Regression and for the intercept they are βˆ 0 ± z 1 α/2 SE( β ˆ 0 ) where z 1 α/2 is the 100(1 α/2) quantile of the standard normal distribution. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

18 Intro Model As in the case of linear regression, the strength of the logistic regression model is the ability to handle many variables, some of which may be on different measurement scales. Central to the consideration of the multiple logistic models is estimating the coefficients and testing for their significance. For a collection of p (independent) variables denoted by X 1,...,X p (realizations x = (x 1,, x p )) the logit of the multiple logistic regression is given by the equation ln p(x) 1 p(x) = β 0 + β 1 x 1 + β 2 x β p x p Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Estimating coefficients Variances and covariances of estimates The method of estimation used in the multivariable case is the same as in the univariable (simple) logistic regression, i.e. maximum likelihood As in the simple logistic regression no closed-form/analytical solution is known at the moment Solved through numerical optimization. The method of estimating the variances and covariances of the estimated coefficients follows from well-developed theory of maximum likelihood estimation This theory states that the estimators are obtained from the matrix of second partial derivatives of the log-likelihood function Let us denote with I(β) the (p + 1) (p + 1) matrix containing the negative of these partial derivatives This matrix is called the observed (Fisher) information matrix Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

19 Variances and covariances of estimates Testing the significance of the model The variances and covariances of the estimated coefficients are obtained from the inverse of this matrix, denoted by Var(β) = I 1 (β) Estimators of the variances and covariances are obtained by evaluating Var(β) at ˆβ. We will denote them with Var( ˆβ). Estimated standard errors of the estimated coefficients SE( ˆβ j ) = Var( ˆβ j ) The deviance based test for overall significance of the p coefficients for the independent variables in the model is performed in exactly the same manner as in the simple logistic regression. Under the null hypothesis that the p (slope) coefficients are equal to zero, the distribution of G is chi-square with p degrees of freedom. Therefore, the test is based on p-values. where Var( ˆβ j ) denotes the j-th diagonal element. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Confidence interval estimation The methods used for confidence interval estimators are essentially the same as for the simple logistic regression. The endpoints of a 100(1 α)% confidence interval for the slope coefficients are ˆβ j ± z 1 α/2 SE( ˆβ j ) and for the intercept they are βˆ 0 ± z 1 α/2 SE( β ˆ 0 ) where z 1 α/2 is the 100(1 α/2) quantile of the standard normal distribution. Variable selection We can take the same approach as with the linear regression models. Namely, use one of the forward, backward or mixed selection approaches However, instead of basing our selection on residual sum-of-squares the selection should be based on deviance based test. More correctly, since the magnitude of the deviance based test, G, depends on its degrees of freedom the procedure should account for possible differences in degrees of freedom between variables (useful for qualitative variables). Therefore, instead of the absolute value G the decision should be based on the p-value (e.g. in forward selection by including the variable with the lowest p-value). Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

20 Variable selection p-values Goodness of fit Usually, thresholds for p-values are decided in advance Sometimes, the thresholds are ignored and p-value is used only based on its comparative advantage to other variables (p-values). More correctly, since the magnitude of the deviance based test, G, depends on its degrees of freedom the procedure should account for possible differences in degrees of freedom between variables (useful for qualitative variables). Therefore, instead of the absolute value G the decision should be based on the p-value (e.g. in forward selection by including the variable with the lowest p-value). Extremely hard to justify :( Everyone has an opinion on this (might not be that bad) Usually based on summary measures like Pearson residuals (leading to Pearson s Chi-square test) or deviance residuals Extremely popular are confusion tables based on true/false positives/negatives Also, an informative and more complete description (than Conf. tables) of classification accuracy can be obtained by Receiver Operating Characteristics (ROC) curves. The take home message is that the decision on the fit depends on the underlying problem and the intrinsic feel for risk of the data analyst/statistician running the study. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Why logit? Challenges (or problems?) There are other transformations from probabilities to R like probit. Logit has a long tradition in the statistical community. Log odds are easy to interpret. Easy to work with (from the mathematical point of view). Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

21 Qualitative variables Synergies/Interactions between variables Variables like Gender, Hospital,... We can use dummy variables in a regression model, e.g. 1, if the person is female x = 0, if the person is male For qualitative variables with k > 2 levels (values) we can create k 1 dummy variables One way of extending the logistic regression model is to allow for interaction effects to be included as an additional variable, e.g. p(x) ln = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 1 p(x) If an interaction term is significant as a principle we should always include main effects/variables into the model although they (by themselves) might not be significant Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Non-linear relationship Outliers Sometimes we can justify a linearity in the logistic model by changing our predictors a bit Example, going from p(x) ln 1 p(x) = β 0 + β 1 X 1 to ln p(x) 1 p(x) = β 0 + β 1 X 1 + β 2 X 2 1 leaves us in the linear space but gives more flexibility. Hard to define (agree on the definition) In real world outliers are quite common (depending on the definition :) ) Extremely hard to deal with. If we have no solid proof that they are artificial best to leave them in a/the model. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

22 p» n Often we have a study where we have measured a lot of variables but have only a small number of participants Genomic research, web based data (small web shop),... We need special models that fit this high-dimensional setting Some adaptations of ideas behind the logistic regression together with regularization techniques (LASSO, ridge, elastic net) are possible Ivo Ugrina Logistic Regression September 29, / 29

23 Regularization and Variable Selection: LASSO, Ridge and Elastic Net Ivo Ugrina King s College London // University of Zagreb // University of Split Introduction As in the previous lectures we assume the model: Y = f (X) + ε where ε is a random (noise) variable with mean 0 and variance σ 2. In the (classical) linear regression we were interested in finding the best (good enough) function estimate ˆf of the function f where ˆf (x) = x T β = β 0 + β 1 x β p x p September 29, 2016 A standard approach was/is to use the least squares estimator However, we have seen that the least squares estimator has a number of restrictions. One of the most obvious ones is that it has problems with multicollinearity. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Quality of the fit Functions that fit our data well are not necessarily good for unseen data (think of polynomial functions with perfect fit) If we were to sample new data (x i, y i ) then our ˆf, to be a good model, would have to have values ˆf (x i ) close to the real values y i Discrepancy between ˆf (x i ) and y i is often called the prediction error (PE) and is often measured through the mean squared precision error n MSPE = E y i ˆf 2 (x i ) Variance/Bias tradeoff Like the mean square error the MSPE can be decomposed into positive terms (variance/bias) MSPE = n 2 n E(ˆf (x i )) f (x i ) + Var(ˆf (x i )) Introducing a little bias in an estimator for β might lead to a substantial decrease in variance and thus a decrease in the presicion error Therefore, good estimators should on average have small precision errors Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

24 Model Ridge regression By controlling how large coefficients β might become (called regularization) we could control variance One way to do this is to define a constrained space where we are searching for coefficients Ridge regression is defined with the following constraints minimize n yi β T 2 x i where p β 2 j t Convention (important!): X are assumed to be standardized (mean 0, variance 1) and y are assumed to be centered (mean 0) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Penalized notation Analytical solution The problem can also be expressed as the classical least squares with and addition of a penalized term n minimize yi β T 2 p x i + λ β 2 j λ 0 is a tuning parameter that controls the strength. There is a one-to-one correspondence between t and λ The problem is (strictly) convex and has a unique solution An interesting thing is that this solution might have a smaller prediction error The reason for scaling (standardizing) our variables is that p β2 would be unfair if they weren t on the same scale j An analytical solution can be derived for the rigde estimator of the form β ridge = (X T X + λi) 1 X T y Since we are adding a positive constant to the diagonal of X T X, we are, in general, producing an invertiblea matrix, X T X + λi, even if X T X is singular. Historically, this particular aspect of ridge regression was the main motivation behind the adoption of this particular extension of OLS theory. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

25 Connection to OLS Tuning paramter λ The ridge regression estimator is related to the classical OLS estimator, β OLS in the following manner, β ridge = I + λi(x T X) 1 1 β OLS Moreover, when X is composed of orthonormal variables, such that X T X = I p, it then follows that ˆβ ridge = λ β OLS The solution of the minimization problem is obviously dependent on the tuning parameter λ Therefore, we could have a solution of every λ > 0 λ is called the shrinkage paramter λ controls the size of coefficients λ controls the amount of regularization For λ = 0 we have the least squares estimate As λ is increased the coefficients will go to zero We need a way to select an optimal λ Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Bias and variance of ridge regression Helpful with multicollinearity? There is a closed form solution for bias and variance terms It is not (too) hard (needs a few hours/days :)) to prove that the ridge regression gives biased estimates Usually, with the increase of λ there is an increase in bias Also, with the decrease of λ there is a decrease in variance One of the problems of OLS is that it not really useful in situations with collinearity due to high increase in the variance of the estimator By biasing towards zero (penalizing) ridge regression circumvents this problem Yet another part where the presenter should wave his hands a bit, and explain something too ;) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

26 Ridge trace Choosing λ Most common approach is the K-fold cross validation: Partition the training data T into K separate sets (T 1, T 2,, T K ) of (almost) equal size (λ) For each k = 1, 2,..., K, fit the model ˆf k to the training set excluding the k-th fold T k Compute the fitted values for the observations in Tk, based on the training data that excluded this fold Compute the cross-validation error (CVE) for the k-th fold: (CVE) λ k = 1 y ˆf λ T k k (x) (x,y) T k source: MathWorks Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Choosing λ (part 2) The overall cross-validation error is then K (CVE) λ = K 1 (CVE) λ k k=1 Select λ as the one with the minimum (CVE) λ Assessing the stability It is often the case that we are interested in how good the ridge regression is for our problem or how stable λ coefficients are (based on sampling) In this case we could employ a nested cross-validation where we first split our data set into folds representing test data and then with the other folds we employ the cross-validation as previously described. This approach is sometimes referred to as training, validation, test approach. The outer loop is to assess the performance of the model, and the inner loop is to select the best model; the model is selected on each outer-training set (using the inner CV loop) and its performance is measured on the corresponding outer-testing set. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

27 Variable selection? Bayesian framework We can show that ridge regression doesn t set coefficients exactly to zero unless λ =, in which case they are all zero. Hence, ridge regression cannot perform variable selection. Also, even though it performs well in terms of prediction accuracy, it does poorly in terms of offering a clear interpretation. What do we do with small coefficients? Suppose that the errors are independent and drawn from a normal distribution. If coefficients β come for a Gaussian distribution with zero mean and standard deviation a function of λ, then it follows that the most likely value for β (called the posterior mode) is given by the ridge regression solution. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Comments Ridge regression has one obvious disadvantage (where variables/predictors are expensive). It includes all p variables in the final model. For large p this creates problems in model interpretation. LASSO Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

28 Sparsity Model Sparsity implies that only a few variables are important (coefficients are non-zero) and the rest are not important (the rest are zero). Can we achieve this by constraining (shrinking) coefficients in our algorithms? Oh nooo, the presenter is waving his hands again :) LASSO: least absolute shrinkage and selection operator LASSO regression is defined with the following constraints minimize n yi β T 2 x i where p β j t Convention (important!): X are assumed to be standardized (mean 0, variance 1) and y are assumed to be centered (mean 0) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Penalized notation The problem can also be expressed as the classical least squares with and addition of a penalized term minimize n yi β T 2 p x i + λ β j Again, we have a tuning parameter λ 0 that controls the amount of regularization and there is a one-to-one correspondence between t and λ Unlike ridge regression, no closed form solution is known for the LASSO Comparison to ridge The tuning parameter λ controls the strength of the penalty, and (like ridge regression) we get ˆβ lasso = when λ = 0 and ˆβ lasso = 0 when λ =. ˆβ OLS For λ in between these two extremes we are balancing two ideas: fitting a linear model and shrinking the coefficients. But the nature of β j penalty (i.e. l 1 penalty) causes some coefficients to be shrunken to zero exactly. This makes LASSO substantially different from ridge regression. It seems able to perform variable selection in the linear model. As λ increases, more coefficients are set to zero, therefore less variables are selected In terms of prediction error (or mean squared error), the lasso performs comparably to ridge regression Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

29 Tuning paramter λ Bias and variance of ridge regression The solution of the minimization problem is obviously dependent on the tuning parameter λ Therefore, we could have a solution of every λ > 0 λ controls the size of coefficients λ controls the amount of regularization We need a way to select an optimal λ Although, we can t write explicit formulas for the bias and variance of the lasso estimate, we are able to see a general trend Usually, with the increase of λ there is an increase in bias Also, with the decrease of λ there is a decrease in variance Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Pathwise coordinate descent for LASSO Pathwise coordinate descent (part 2) One of the algorithms to solve the problem is the pathwise coordinate descent for lasso Coordinate descent: optimize one parameter (coordinate) at a time. Solution is the soft-thresholded estimate sign( ˆβ OLS )( ˆβ OLS λ) + where ˆβ is the least squares estimator Idea: with multiple predictors, cycle through each predictor in turn. We compute residuals r i = y i j =k x ij ˆβ k and applying soft-thresholding, pretending that our data is (x ij, r i ). Start with a large value of λ and slowly decrease it most coordinates that are zero never become non-zero easy to implement, fast and simple Can be generalized to similar models (connected with lasso) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

30 Bayesian framework Comments Suppose that the errors are independent and drawn from a normal distribution. If coefficients β come for a double-exponential (Laplace) distribution with mean zero and scale parameter a function of λ, then it follows that the most likely value for β (called the posterior mode) is given by the lasso regression solution. There are no (well developed/tested) tools and theory that allow these methods to be used in statistical practice: standard errors, p-values and confidence intervals that account for the adaptive nature of the estimation Randomly selecting from correlated variables In a dataset with n observations it can select only n predictors/variables. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Group LASSO Group LASSO (part 2) How can we allow predefined groups of covariates to be selected into or out of a model together, so that all the members of a particular group are either included or not included? This is useful for categorical variables (coded as dummy variables) since it often doesn t make sense to include only a few levels of the covariates. Also, it is quite useful in biological studies. Since genes and proteins often lie in known pathways, an investigator may be more interested in which pathways are related to an outcome than whether particular individual genes are. For the group LASSO we are trying to solve J 2 J minimize y i β i X i + λ β i Ki where β j Kj = β T j K jβ j. The design matrix X and covariate vector β have been replaced by a collection of design matrices X j and covariate vectors β j, one for each of the J groups. K j are positive definite matrices 2 Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

31 Motivation Elastic Net What if we could take the best from ridge and LASSO and combine it into just one model We know that the l 1 penalty is useful to generate sparse models We know that l 2 (ridge) penalty removes the limitation on the number of variables. Therefore, encourages grouping effect Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Model Solution Elastic Net regularization is defined as a combination of LASSO and ridge regressions by β EN = argmin y Xβ + λ β 1 + (1 λ) β 2 2 β What is the geometry of the Elastic Net? The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. A stage-wise algorithm called LARS-EN efficiently solves the entire elastic net solution path. The elastic net solution path is piecewise linear The presenter waves his hands for the third time in just one lecture!?!? Hope this is the last time (this is a data science school, not instructions in hand signaling)! :p Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

32 Take-home message this is THE END my only (OLS) friend, the End All models are wrong but some are useful and only proper experimental design leads to happiness Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

Linear Model Selection and Regularization

Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In