Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016

Size: px
Start display at page:

Download "Simple Linear Regression. Y = f (X) + ε. Regression as a term. Linear Regression. Basic structure. Ivo Ugrina. September 29, 2016"

Transcription

1 Regression as a term Linear Regression Ivo Ugrina King s College London // University of Zagreb // University of Split September 29, 2016 Galton Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Basic structure Y = f (X) + ε Simple Linear Regression ε is random error term with zero mean, independent of X Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

2 Model Estimating coefficients Y = β 0 + β 1 X + ε Y β 0 + β 1 X ε is random error term with zero mean, independent of X β 0 and β 1 are unknown constants representing/called intercept and slope. Often called coefficients or parameters. First, we need to have data, (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ) X = (x 1,, x n ), Y = (y 1,, y n ) With the data the goal is to obtain coefficient estimates ˆβ 0 and ˆβ 1 for which the linear model Ŷ = ˆβ 0 + ˆβ 1 X fits the data as close as possible Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Closeness (Ordinary) Least Squares (OLS) To fit the data as close as possible we should first have a measure of closeness There are many choices of measures to accomplish this, like (the most intuitive one, to me :) ) to minimize least absolute deviation LAD = n y i ˆβ 0 ˆβ 1 x i However, the measure should not only be intuitive (or at all) as much as it should be practical (easy to compute) Therefore, the most common approach for the past 200 years is to minimize the residual sum of squares RSS = n yi ˆβ 0 ˆβ 2 1 x i The least-squares method is usually credited to Carl Friedrich Gauss (1795), but it was first published by Adrien-Marie Legendre. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

3 OLS Randomness of least square lines Analytic solution to OLS is (easily) obtainable through a bit of calculus and is given by ˆβ 1 = n (x i x)(y i y) n (x i x) Add a figure here concept of unbiasedness ˆβ 0 = y ˆβ 1 x where y = 1 n n y i and x = 1 n n x i. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 How close are ˆβs to βs Do we know σ 2? SE( ˆβ 0 ) 2 = σ 2 1 n + x 2 n (x i x) 2 SE( ˆβ 1 ) 2 = σ 2 n (x i x) 2 In most real world problems σ 2 is unknown But, it can be estimated from the data This estimate is known as the residual standard error and is given by RSE = RSS/(n 2) σ 2 = Var(ε) Errors are uncorellated with common variance σ 2 Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

4 Confidence intervals Is there a relationship between X and Y? Using standard errors SE( ˆβ 1 ) and SE( ˆβ 0 ) we can derive solutions for (1 α)-confidence intevals For β i : ˆβ i ± t (n 2) (1 α/2) SE( ˆβ i ) A question of relationship between input variables and the outcome can be written as a hypothesis test like: H 0 : β 1 = 0 H a : β 1 = 0 there is no relationship there is a relationship Errors are Gaussian Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 (Wald) test Accuracy of the fit - RSE If there is no relationship between X and Y, statistic ˆβ 1 0 T = SE( ˆβ 1 ) will have a t-distribution with n 2 degrees of freedom if we have estimated the variance from the data. Otherwise, if variance is known, the distrubion will standard normal. Errors are Gaussian A measure of the lack of fit of the model Gives is an estimate of the average amount that the response will deviate from the true regression line If the predictions are very close to the true outcome values RSE will be small If the predictions are very far away from one or more true outcomes, RSE will be quite large Measured in units of Y so it is not (always) clear what is a good RSE Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

5 Accuracy of the fit - R 2 Independet of the scale of Y Always between 0 and 1 Given by TSS RSS R 2 = RSS = 1 RSS TSS where TSS = n (y i y i ) 2 is the total sum of squares Therefore, R 2 measures the proportion of variability in Y that can be explained with X It can be (and usually is) quite challenging to determine what a good R 2 is. It depends on the field mostly (biologists are sometimes happy with R ). (Multiple) Linear Regression Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Why? Model Many real world problems need more than one variable (predictor) to explain a phenomena (outcome) Should we run a simple linear regression for all variables? If so, how do we predict based on multiple models and yet taking into account all variables? What if there are synergies between variables (coffee/sugar example)? Y = β 0 + β 1 X β p X p + ε ε is random error term with zero mean, independent of X i β 0 and β j, j = 1,..., p are unknown constants representing the intercept and slopes. Often called coefficients or parameters. β j quantifies the association between X j and the outcome (response) β j can be interpreted as an average effect on Y of a one unit increase in X j holding all other variables fixed Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

6 Matrix notation Estimating coefficients y = Xβ + ε 1 x 11 x 12 x 1p β 0 y 1 1 x 21 x 22 x 2p X =..., β = β 1., y = y 2. 1 x n1 x n2 x np β p y n As with the simple linear regression coefficients are unknown and should be estimated The parameters (coefficients) are estimated using the same principle as in the simple linear regression. Namely, we want to minimize RSS = n yi ˆβ 0 ˆ β 1 x i1 ˆβ 1 x i2 ˆβ p x ip 2 = y X ˆβ 2 2 We will denote with ˆβ i coefficients that minimize RSS Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 OLS solution Prediction accuracy vs Model interpretability/inference Analytic (unique) solution to OLS for multiple linear regression is also available provided that colums of X are linearly independent and it has the form (in matrix notation) ˆβ = X T X 1 X T y n p! Are we interested only in the results (outcome) of the model? Can we easily obtain all predictors? Is it easy to obtain lots of data? How flexible are we computationally? How big is the penalty for bad predictions? Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

7 At least one predictor is useful in predicting response? F-statistic This hypothesis test can be resolved using the F statistic We can rephrase this question in the language of hypothesis testing H 0 : β 1 = β 2 = β 3 = = β p = 0 H A : at least one β j is non-zero (TSS RSS)/p F = RSS/(n p 1) which under the null hypothesis and Gaussian errors is distributed according to F-distribution It can be shown that even if the errors are not normally distributed if the sample size is big enough F will approximately follow an F-distribution Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 How small/large should F or T be? Important variables / features (selection) Probability of obtaining a value more extreme than F or T (under the null hypothesis) is often called the p value (in general, p values have a more broader definition). How small/large should a p-value be? A small one indicates that it is unlikely to observer such association just due to change (in the absence of any real association between predictors and the response). Curse of multiple p-values. A story of multiple Ts and one F. If we have obtained a signigicant F-statistic we could wonder which variables exactly are connected with the response The task of determining which variables are associated with the response is referred to as variable or feature selection Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

8 A naive approach The best model? Why not try all possible subsets of variables and see which model gives the best results For p = 2 this might seem plausible, but what if we have p = 20 for example. In this scenario we would have to check (and build) 2 p = 1, 048, 576 models :( Not practical in real life Additional problem is the definition of the best model We have to choose the best model within the group of models with the same number of variables and between them Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 R 2 for MLR RSE for MLR R 2 can be defined in the same way for MLR as for the SLR, as the fraction of variance explained, and could be used as a measure for the choice of the best model. However, R 2 is improving by adding more variables into the model. Additionally, what is a/the substantial improvement in R 2. Ideally, we would like to estimate the quality of a model/fit on the test (unknown) samples instead of the training samples. RSE can be defined in a similar fashion for MLR as for the SLR, 1 RSE = n p 1 RSS and could be used as a measure for the choice of the best model. What is a/the substantial improvement in RSE? How expensive is to keep a variable in the model? Ideally, we would like to estimate the quality of a model/fit on the test (unknown) samples instead of the training samples. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

9 Adjusted R 2 (Mallow s) C p Helpful in adjusting the training error for the model size. Very popular approach for selecting among a set of models that contain different numbers of variables. The adjusted R 2 statistic is calculated (for a model with d variables) as RSS/(n d 1) AdjustedR 2 = 1 TSS/(n 1) where TSS = n (y i y) 2. Unlike the R 2 statistic, the adjusted R 2 statistic does take into account (penalizes) the inclusion of unnecessary variables in the model. For a model with d predictors, the C p is defined as C p = 1 n RSS + 2dˆσ 2 where ˆσ 2 is an estimate of the variance of the error ε. 2dˆσ 2 is a penalty term to adjust for the fact that the training data usually underestimates the the real (test) error. Theoretically it justified through the asymptotic theory (sample size n is very large). Lower C p is better Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Akaike information criterion (AIC) MLE and OLS (digreesion) Defined for a large class of models fit by the maximum likelihood approach. In the of Gaussian errors maximum likelihood and least squares give the same estimates. AIC (for Gaussian errors) is defined by AIC = 1 nˆσ 2 (RSS + 2dˆσ2 ) In the model where ε N(0, σ 2 ) the log-likelihood is given by n 2 log(σ2 ) 1 n (y 2σ 2 i x i β) 2 Therefore, maximization of log-likelihood leads to minimization of RSS Proportional to C p Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

10 BIC Cross-validation (CV) For a model with d predictors BI is defined as BIC = 1 n RSS + log(n)dˆσ 2 BIC is derived from a Bayesian point of view It placs a heavier penalty on models with more variables and therefore results in the selection of smaller models than C p Lower BIC is better A way to directly estiamte a/the test (real) error Makes fewer assumptions about the true underlying model than C p, AIC, BIC and adjusted R 2 Can be used in a wider range of model selection tasks (e.g. hard to estimate the error variance σ 2 ) Due to computational improvements over the last years CV is slowly becoming the standard approach for selecting from among a number of models under consideration Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Forward stepwise (subset) selection Forward selection comments Start with the null model, which does not contain variables/predictors. For k = 0, 1,..., p 1 do Consider all p k models that augment the predictors in Mk with one additional predictor Choose the best among these models, called Mk+1. Best defined as the one with smallest RSS. Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Faster than best subset Not guaranteed to find the best subset (what s been included stays included) Can be applied in a/the high-dimensional setting where n < p but up to n variables. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

11 Backward stepwise (subset) selection Backward selection comments Start with the full model M p, containing all p variables/predictors. For k = p, p 1,..., 1 do Consider all k models that contain all but one of the predictors in M k Choose the best among these k models, called Mk 1. Best defined as the one with smallest RSS. Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Faster than best subset Not guaranteed to find the best subset (what s been excluded stays excluded) Can not be applied in a/the high-dimensional setting where n < p. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Mixed/Hybrid stepwise (subset) selection Mixed/Hybrid selection coments Add models sequentially (like in forward selection) After adding each new variable try to remove any variables in the model that are not providing improvement in the model fit Not as fast as forward/backward regression Hard to decide when to move back (remove a variable). Select a single best model among M 0, M 1,..., M p using one of the previously described measures (C p, AIC, BIC, adjusted R 2, CV) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

12 Gauss-Markov theorem Assumptions E(εi ) = 0 Var(εi ) = σ 2 < (Homoscedasticity) Cov(εi, ε j ) = 0, i = j Under these assumptions OLS estimator is a BLUE (Best linear unbias estimator) Challenges (or problems?) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Qualitative variables Repeated measurements Variables like Gender, Hospital,... We can use dummy variables in a regression model, e.g. 1, if the person is female x = 0, if the person is male For qualitative variables with k > 2 levels (values) we can create k 1 dummy variables Sometimes it also makes sense to do separate regressions per levels of the qualitative variable (different errors per level?) A study involving same patients over a period of time (BMI, hearth rate,... ) Not suitable for the classical linear regression due to dependencies between measurements Take a look at Linear Mixed Effects (LME) models Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

13 Synergies/Interactions between variables Non-linear relationship One way of extending the linear model is to allow for interaction effects to be included as an additional variable, e.g. Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ε If an interaction term is significant as a principle we should always include main effects/variables into the model although they (by themselves) might not be significant Sometimes we can justify a linear model by changing our predictors a bit Example, going from Y = β 0 + β 1 X 1 to Y = β 0 + β 1 X 1 + β 2 X 2 1 leaves us in the linear regression space but gives more flexibility. Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 Correlation between errors An important assumption for many results in linear regression models is that the error terms ε 1,...,ε n are uncorrelated. Standard errors for estimated coefficients are based on the idea of uncorrelated terms (and the subsequent hypothesis testing) Fields like time series analysis are devoted to explaining models with correlated errors. To mitigate the risk of correlations between error terms a proper EXPERIMENTAL DESIGN is crucial Heteroscedasticity Another important assumptions in the linear regression is that the error terms have a constant variance Standard errors, confidence intervals and hypothesis tests for linear regression rely upon this assumption Sometimes variables can be transformed (a classical example is using the log function) to reduce heteroscedasticity Also, techniques based on weighting prove to be useful. Especially, Weighted least squares are useful for proportional variances with known proportional constants w i. The goal is to minimize n w i yi ˆβ 0 ˆ β 1 x i1 ˆβ 1 x i2 ˆβ p x ip 2 Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

14 Outliers Leverage points Hard to define (agree on the definition) In real world outliers are quite common (depending on the definition :) ) Extremely hard to deal with. If we have no solid proof that they are artifical best to leave them in a/the model Observations with unusual values for x i Can be avoided with proper experimental designs In order to quantify an observation s leverage we can use the leverage statistic. For a simple linear regresion it is defined as h i = 1 n + (x i x) 2 n j=1 (x j x) 2 A large value indicates a high leverage Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56 (Multi)Collinearity p» n Collinearity refers to a model where two or more variables are dependent on each other (related) For example, height and weight, BMI and weight,... Reduces the accuracy of the estimates of the regression coefficients and therefore increases the standard error. Often assesed with the variance inflation factor (VIF) There are two classical solutions to the problem: Remove one of the correlated variables. Rationale: the other already has the information. Combine correlated variables into one (if it can be justified in the real world). Often we have a study where we have measured a lot of variables but have only a small number of participants Genomic research, web based data (small web shop),... OLS does not have a unique solution for p n We need special models that fit this high-dimensional setting Popular solutions are based on regularization techniques (LASSO, ridge, elastic net) Ivo Ugrina Linear Regression September 29, / 56 Ivo Ugrina Linear Regression September 29, / 56

15 What and how Logistic Regression Ivo Ugrina King s College London // University of Zagreb // University of Split September 29, 2016 We would like to use ideas from the linear regression to discerte (categorical) outputs like Y {0, 1}. However, a linear regression will for quantitative inputs (variables) return quantitative outputs, e.g. elements from <, + > What if we could translate the task of getting discrete output to obtaining probabilities (numerical values) for these outputs? We would still have a problem since probabilities take values from 0 to 1 How about additionally transforming probabilities to cover the space for to? Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Model Simple Logistic Regression For outcome Y with only two possible values and one variable/predictor we define simple logistic regression model as ln p(x) 1 p(x) = β 0 + β 1 x where p(x) is the probability of obtaining one of the two possible values in the outcome for x, i.e. P(Y = 1 x). The transformation is usually called the logit transformation. Solving for p(x), this gives eβ0+β1x p(x) = 1 + e β0+β1x Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

16 Likelihood Estimating coefficients Because logistic regression predicts probabilities, rather than just classes, we can fit it using likelihood. For each training data-point, we have a vector of inputs, x i, and an observed class, y i. The probability of that class is either p, encode one class with 1 (y i = 1), or 1 p, for other class encoded as 0 (y i = 0). The likelihood is then L(β 0, β 1 ) = n p(x i ) yi (1 p(x i )) yi Coefficients can be estimated by maximizing likelihood No analytical solution known at the moment. Solved through numerical optimization. Solution does not always exist, but when it does it is unique. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Significance of the coefficients/variables Deviance After estimating the coefficients, our first look at the fitted model commonly concerns an assessment of the significance of the variables in the model. This usually involves formulation and testing of a statistical hypothesis to determine whether the independent variables in the model are significantly related to the outcome. If the predicted values with the variable in the model are better, or more accurate in some sense, then we believe that the variable in question is significant. In logistic regression the role of residual sum-of-squares is played by the deviance defined as D = 2 ln (likelihood of the fitted model) To assess the significance of an independent variable we compare the value of D with and without the independent variable in the equation. The change in D due to the inclusion of the variable in the model is G =D(model without the variable) D(model with the variable) Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

17 Hypothesis testing Wald test Under the hypothesis that β 1 is equal to zero the statistic G follows a chi-square distribution with 1 degree of freedom. Valid if we have enough subjects with both y = 0 and y = 1 and a sufficiently large sample size n. As a rule of thumb (tested through simulations) what we need in logistic regression is for n min = minn 0, n 1 to be about 10 times the number of parameters. Some newer papers suggest that we could be ok with 5-9 times more. Assumptions need are the same as for the deviance based test Wald test is equal to the ratio of the maximum likelihood estimate of the slope parameter β ˆ 1 to an estimate of its standard error β 1 W = SE( ˆβ 1 ) Under the null hypothesis and the sample size assumptions this ratio follows a standard normal distribution. Believed to be inferior to deviance based testing. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Confidence interval estimation The confidence interval estimators for the slope and the intercept are, most often, based on their respective Wald tests and are sometimes referred to as Wald-based confidence intervals. The endpoints of a 100(1 α)% confidence interval for the slope coefficient are βˆ 1 ± z 1 α/2 SE( β ˆ 1 ) (Multiple) Logistic Regression and for the intercept they are βˆ 0 ± z 1 α/2 SE( β ˆ 0 ) where z 1 α/2 is the 100(1 α/2) quantile of the standard normal distribution. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

18 Intro Model As in the case of linear regression, the strength of the logistic regression model is the ability to handle many variables, some of which may be on different measurement scales. Central to the consideration of the multiple logistic models is estimating the coefficients and testing for their significance. For a collection of p (independent) variables denoted by X 1,...,X p (realizations x = (x 1,, x p )) the logit of the multiple logistic regression is given by the equation ln p(x) 1 p(x) = β 0 + β 1 x 1 + β 2 x β p x p Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Estimating coefficients Variances and covariances of estimates The method of estimation used in the multivariable case is the same as in the univariable (simple) logistic regression, i.e. maximum likelihood As in the simple logistic regression no closed-form/analytical solution is known at the moment Solved through numerical optimization. The method of estimating the variances and covariances of the estimated coefficients follows from well-developed theory of maximum likelihood estimation This theory states that the estimators are obtained from the matrix of second partial derivatives of the log-likelihood function Let us denote with I(β) the (p + 1) (p + 1) matrix containing the negative of these partial derivatives This matrix is called the observed (Fisher) information matrix Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

19 Variances and covariances of estimates Testing the significance of the model The variances and covariances of the estimated coefficients are obtained from the inverse of this matrix, denoted by Var(β) = I 1 (β) Estimators of the variances and covariances are obtained by evaluating Var(β) at ˆβ. We will denote them with Var( ˆβ). Estimated standard errors of the estimated coefficients SE( ˆβ j ) = Var( ˆβ j ) The deviance based test for overall significance of the p coefficients for the independent variables in the model is performed in exactly the same manner as in the simple logistic regression. Under the null hypothesis that the p (slope) coefficients are equal to zero, the distribution of G is chi-square with p degrees of freedom. Therefore, the test is based on p-values. where Var( ˆβ j ) denotes the j-th diagonal element. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Confidence interval estimation The methods used for confidence interval estimators are essentially the same as for the simple logistic regression. The endpoints of a 100(1 α)% confidence interval for the slope coefficients are ˆβ j ± z 1 α/2 SE( ˆβ j ) and for the intercept they are βˆ 0 ± z 1 α/2 SE( β ˆ 0 ) where z 1 α/2 is the 100(1 α/2) quantile of the standard normal distribution. Variable selection We can take the same approach as with the linear regression models. Namely, use one of the forward, backward or mixed selection approaches However, instead of basing our selection on residual sum-of-squares the selection should be based on deviance based test. More correctly, since the magnitude of the deviance based test, G, depends on its degrees of freedom the procedure should account for possible differences in degrees of freedom between variables (useful for qualitative variables). Therefore, instead of the absolute value G the decision should be based on the p-value (e.g. in forward selection by including the variable with the lowest p-value). Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

20 Variable selection p-values Goodness of fit Usually, thresholds for p-values are decided in advance Sometimes, the thresholds are ignored and p-value is used only based on its comparative advantage to other variables (p-values). More correctly, since the magnitude of the deviance based test, G, depends on its degrees of freedom the procedure should account for possible differences in degrees of freedom between variables (useful for qualitative variables). Therefore, instead of the absolute value G the decision should be based on the p-value (e.g. in forward selection by including the variable with the lowest p-value). Extremely hard to justify :( Everyone has an opinion on this (might not be that bad) Usually based on summary measures like Pearson residuals (leading to Pearson s Chi-square test) or deviance residuals Extremely popular are confusion tables based on true/false positives/negatives Also, an informative and more complete description (than Conf. tables) of classification accuracy can be obtained by Receiver Operating Characteristics (ROC) curves. The take home message is that the decision on the fit depends on the underlying problem and the intrinsic feel for risk of the data analyst/statistician running the study. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Why logit? Challenges (or problems?) There are other transformations from probabilities to R like probit. Logit has a long tradition in the statistical community. Log odds are easy to interpret. Easy to work with (from the mathematical point of view). Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

21 Qualitative variables Synergies/Interactions between variables Variables like Gender, Hospital,... We can use dummy variables in a regression model, e.g. 1, if the person is female x = 0, if the person is male For qualitative variables with k > 2 levels (values) we can create k 1 dummy variables One way of extending the logistic regression model is to allow for interaction effects to be included as an additional variable, e.g. p(x) ln = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 1 p(x) If an interaction term is significant as a principle we should always include main effects/variables into the model although they (by themselves) might not be significant Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29 Non-linear relationship Outliers Sometimes we can justify a linearity in the logistic model by changing our predictors a bit Example, going from p(x) ln 1 p(x) = β 0 + β 1 X 1 to ln p(x) 1 p(x) = β 0 + β 1 X 1 + β 2 X 2 1 leaves us in the linear space but gives more flexibility. Hard to define (agree on the definition) In real world outliers are quite common (depending on the definition :) ) Extremely hard to deal with. If we have no solid proof that they are artificial best to leave them in a/the model. Ivo Ugrina Logistic Regression September 29, / 29 Ivo Ugrina Logistic Regression September 29, / 29

22 p» n Often we have a study where we have measured a lot of variables but have only a small number of participants Genomic research, web based data (small web shop),... We need special models that fit this high-dimensional setting Some adaptations of ideas behind the logistic regression together with regularization techniques (LASSO, ridge, elastic net) are possible Ivo Ugrina Logistic Regression September 29, / 29

23 Regularization and Variable Selection: LASSO, Ridge and Elastic Net Ivo Ugrina King s College London // University of Zagreb // University of Split Introduction As in the previous lectures we assume the model: Y = f (X) + ε where ε is a random (noise) variable with mean 0 and variance σ 2. In the (classical) linear regression we were interested in finding the best (good enough) function estimate ˆf of the function f where ˆf (x) = x T β = β 0 + β 1 x β p x p September 29, 2016 A standard approach was/is to use the least squares estimator However, we have seen that the least squares estimator has a number of restrictions. One of the most obvious ones is that it has problems with multicollinearity. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Quality of the fit Functions that fit our data well are not necessarily good for unseen data (think of polynomial functions with perfect fit) If we were to sample new data (x i, y i ) then our ˆf, to be a good model, would have to have values ˆf (x i ) close to the real values y i Discrepancy between ˆf (x i ) and y i is often called the prediction error (PE) and is often measured through the mean squared precision error n MSPE = E y i ˆf 2 (x i ) Variance/Bias tradeoff Like the mean square error the MSPE can be decomposed into positive terms (variance/bias) MSPE = n 2 n E(ˆf (x i )) f (x i ) + Var(ˆf (x i )) Introducing a little bias in an estimator for β might lead to a substantial decrease in variance and thus a decrease in the presicion error Therefore, good estimators should on average have small precision errors Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

24 Model Ridge regression By controlling how large coefficients β might become (called regularization) we could control variance One way to do this is to define a constrained space where we are searching for coefficients Ridge regression is defined with the following constraints minimize n yi β T 2 x i where p β 2 j t Convention (important!): X are assumed to be standardized (mean 0, variance 1) and y are assumed to be centered (mean 0) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Penalized notation Analytical solution The problem can also be expressed as the classical least squares with and addition of a penalized term n minimize yi β T 2 p x i + λ β 2 j λ 0 is a tuning parameter that controls the strength. There is a one-to-one correspondence between t and λ The problem is (strictly) convex and has a unique solution An interesting thing is that this solution might have a smaller prediction error The reason for scaling (standardizing) our variables is that p β2 would be unfair if they weren t on the same scale j An analytical solution can be derived for the rigde estimator of the form β ridge = (X T X + λi) 1 X T y Since we are adding a positive constant to the diagonal of X T X, we are, in general, producing an invertiblea matrix, X T X + λi, even if X T X is singular. Historically, this particular aspect of ridge regression was the main motivation behind the adoption of this particular extension of OLS theory. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

25 Connection to OLS Tuning paramter λ The ridge regression estimator is related to the classical OLS estimator, β OLS in the following manner, β ridge = I + λi(x T X) 1 1 β OLS Moreover, when X is composed of orthonormal variables, such that X T X = I p, it then follows that ˆβ ridge = λ β OLS The solution of the minimization problem is obviously dependent on the tuning parameter λ Therefore, we could have a solution of every λ > 0 λ is called the shrinkage paramter λ controls the size of coefficients λ controls the amount of regularization For λ = 0 we have the least squares estimate As λ is increased the coefficients will go to zero We need a way to select an optimal λ Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Bias and variance of ridge regression Helpful with multicollinearity? There is a closed form solution for bias and variance terms It is not (too) hard (needs a few hours/days :)) to prove that the ridge regression gives biased estimates Usually, with the increase of λ there is an increase in bias Also, with the decrease of λ there is a decrease in variance One of the problems of OLS is that it not really useful in situations with collinearity due to high increase in the variance of the estimator By biasing towards zero (penalizing) ridge regression circumvents this problem Yet another part where the presenter should wave his hands a bit, and explain something too ;) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

26 Ridge trace Choosing λ Most common approach is the K-fold cross validation: Partition the training data T into K separate sets (T 1, T 2,, T K ) of (almost) equal size (λ) For each k = 1, 2,..., K, fit the model ˆf k to the training set excluding the k-th fold T k Compute the fitted values for the observations in Tk, based on the training data that excluded this fold Compute the cross-validation error (CVE) for the k-th fold: (CVE) λ k = 1 y ˆf λ T k k (x) (x,y) T k source: MathWorks Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Choosing λ (part 2) The overall cross-validation error is then K (CVE) λ = K 1 (CVE) λ k k=1 Select λ as the one with the minimum (CVE) λ Assessing the stability It is often the case that we are interested in how good the ridge regression is for our problem or how stable λ coefficients are (based on sampling) In this case we could employ a nested cross-validation where we first split our data set into folds representing test data and then with the other folds we employ the cross-validation as previously described. This approach is sometimes referred to as training, validation, test approach. The outer loop is to assess the performance of the model, and the inner loop is to select the best model; the model is selected on each outer-training set (using the inner CV loop) and its performance is measured on the corresponding outer-testing set. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

27 Variable selection? Bayesian framework We can show that ridge regression doesn t set coefficients exactly to zero unless λ =, in which case they are all zero. Hence, ridge regression cannot perform variable selection. Also, even though it performs well in terms of prediction accuracy, it does poorly in terms of offering a clear interpretation. What do we do with small coefficients? Suppose that the errors are independent and drawn from a normal distribution. If coefficients β come for a Gaussian distribution with zero mean and standard deviation a function of λ, then it follows that the most likely value for β (called the posterior mode) is given by the ridge regression solution. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Comments Ridge regression has one obvious disadvantage (where variables/predictors are expensive). It includes all p variables in the final model. For large p this creates problems in model interpretation. LASSO Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

28 Sparsity Model Sparsity implies that only a few variables are important (coefficients are non-zero) and the rest are not important (the rest are zero). Can we achieve this by constraining (shrinking) coefficients in our algorithms? Oh nooo, the presenter is waving his hands again :) LASSO: least absolute shrinkage and selection operator LASSO regression is defined with the following constraints minimize n yi β T 2 x i where p β j t Convention (important!): X are assumed to be standardized (mean 0, variance 1) and y are assumed to be centered (mean 0) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Penalized notation The problem can also be expressed as the classical least squares with and addition of a penalized term minimize n yi β T 2 p x i + λ β j Again, we have a tuning parameter λ 0 that controls the amount of regularization and there is a one-to-one correspondence between t and λ Unlike ridge regression, no closed form solution is known for the LASSO Comparison to ridge The tuning parameter λ controls the strength of the penalty, and (like ridge regression) we get ˆβ lasso = when λ = 0 and ˆβ lasso = 0 when λ =. ˆβ OLS For λ in between these two extremes we are balancing two ideas: fitting a linear model and shrinking the coefficients. But the nature of β j penalty (i.e. l 1 penalty) causes some coefficients to be shrunken to zero exactly. This makes LASSO substantially different from ridge regression. It seems able to perform variable selection in the linear model. As λ increases, more coefficients are set to zero, therefore less variables are selected In terms of prediction error (or mean squared error), the lasso performs comparably to ridge regression Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

29 Tuning paramter λ Bias and variance of ridge regression The solution of the minimization problem is obviously dependent on the tuning parameter λ Therefore, we could have a solution of every λ > 0 λ controls the size of coefficients λ controls the amount of regularization We need a way to select an optimal λ Although, we can t write explicit formulas for the bias and variance of the lasso estimate, we are able to see a general trend Usually, with the increase of λ there is an increase in bias Also, with the decrease of λ there is a decrease in variance Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Pathwise coordinate descent for LASSO Pathwise coordinate descent (part 2) One of the algorithms to solve the problem is the pathwise coordinate descent for lasso Coordinate descent: optimize one parameter (coordinate) at a time. Solution is the soft-thresholded estimate sign( ˆβ OLS )( ˆβ OLS λ) + where ˆβ is the least squares estimator Idea: with multiple predictors, cycle through each predictor in turn. We compute residuals r i = y i j =k x ij ˆβ k and applying soft-thresholding, pretending that our data is (x ij, r i ). Start with a large value of λ and slowly decrease it most coordinates that are zero never become non-zero easy to implement, fast and simple Can be generalized to similar models (connected with lasso) Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

30 Bayesian framework Comments Suppose that the errors are independent and drawn from a normal distribution. If coefficients β come for a double-exponential (Laplace) distribution with mean zero and scale parameter a function of λ, then it follows that the most likely value for β (called the posterior mode) is given by the lasso regression solution. There are no (well developed/tested) tools and theory that allow these methods to be used in statistical practice: standard errors, p-values and confidence intervals that account for the adaptive nature of the estimation Randomly selecting from correlated variables In a dataset with n observations it can select only n predictors/variables. Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Group LASSO Group LASSO (part 2) How can we allow predefined groups of covariates to be selected into or out of a model together, so that all the members of a particular group are either included or not included? This is useful for categorical variables (coded as dummy variables) since it often doesn t make sense to include only a few levels of the covariates. Also, it is quite useful in biological studies. Since genes and proteins often lie in known pathways, an investigator may be more interested in which pathways are related to an outcome than whether particular individual genes are. For the group LASSO we are trying to solve J 2 J minimize y i β i X i + λ β i Ki where β j Kj = β T j K jβ j. The design matrix X and covariate vector β have been replaced by a collection of design matrices X j and covariate vectors β j, one for each of the J groups. K j are positive definite matrices 2 Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

31 Motivation Elastic Net What if we could take the best from ridge and LASSO and combine it into just one model We know that the l 1 penalty is useful to generate sparse models We know that l 2 (ridge) penalty removes the limitation on the number of variables. Therefore, encourages grouping effect Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38 Model Solution Elastic Net regularization is defined as a combination of LASSO and ridge regressions by β EN = argmin y Xβ + λ β 1 + (1 λ) β 2 2 β What is the geometry of the Elastic Net? The quadratic penalty term makes the loss function strictly convex, and it therefore has a unique minimum. A stage-wise algorithm called LARS-EN efficiently solves the entire elastic net solution path. The elastic net solution path is piecewise linear The presenter waves his hands for the third time in just one lecture!?!? Hope this is the last time (this is a data science school, not instructions in hand signaling)! :p Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

32 Take-home message this is THE END my only (OLS) friend, the End All models are wrong but some are useful and only proper experimental design leads to happiness Ivo Ugrina Regularization September 29, / 38 Ivo Ugrina Regularization September 29, / 38

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Linear model selection and regularization

Linear model selection and regularization Linear model selection and regularization Problems with linear regression with least square 1. Prediction Accuracy: linear regression has low bias but suffer from high variance, especially when n p. It

More information

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression In God we trust, all others bring data. William Edwards Deming Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Supervised Learning: Regression I Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Some of the

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project MLR Model Selection Author: Nicholas G Reich, Jeff Goldsmith This material is part of the statsteachr project Made available under the Creative Commons Attribution-ShareAlike 3.0 Unported License: http://creativecommons.org/licenses/by-sa/3.0/deed.en

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Lecture 6: Linear Regression (continued)

Lecture 6: Linear Regression (continued) Lecture 6: Linear Regression (continued) Reading: Sections 3.1-3.3 STATS 202: Data mining and analysis October 6, 2017 1 / 23 Multiple linear regression Y = β 0 + β 1 X 1 + + β p X p + ε Y ε N (0, σ) i.i.d.

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lecture 6: Linear Regression

Lecture 6: Linear Regression Lecture 6: Linear Regression Reading: Sections 3.1-3 STATS 202: Data mining and analysis Jonathan Taylor, 10/5 Slide credits: Sergio Bacallado 1 / 30 Simple linear regression Model: y i = β 0 + β 1 x i

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that

This model of the conditional expectation is linear in the parameters. A more practical and relaxed attitude towards linear regression is to say that Linear Regression For (X, Y ) a pair of random variables with values in R p R we assume that E(Y X) = β 0 + with β R p+1. p X j β j = (1, X T )β j=1 This model of the conditional expectation is linear

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

High-dimensional regression modeling

High-dimensional regression modeling High-dimensional regression modeling David Causeur Department of Statistics and Computer Science Agrocampus Ouest IRMAR CNRS UMR 6625 http://www.agrocampus-ouest.fr/math/causeur/ Course objectives Making

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

High-dimensional regression

High-dimensional regression High-dimensional regression Advanced Methods for Data Analysis 36-402/36-608) Spring 2014 1 Back to linear regression 1.1 Shortcomings Suppose that we are given outcome measurements y 1,... y n R, and

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

The prediction of house price

The prediction of house price 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague Multiple (non) linear regression Jiří Kléma Department of Computer Science, Czech Technical University in Prague Lecture based on ISLR book and its accompanying slides http://cw.felk.cvut.cz/wiki/courses/b4m36san/start

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error? Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

Regularization and Variable Selection via the Elastic Net

Regularization and Variable Selection via the Elastic Net p. 1/1 Regularization and Variable Selection via the Elastic Net Hui Zou and Trevor Hastie Journal of Royal Statistical Society, B, 2005 Presenter: Minhua Chen, Nov. 07, 2008 p. 2/1 Agenda Introduction

More information

Model Selection. Frank Wood. December 10, 2009

Model Selection. Frank Wood. December 10, 2009 Model Selection Frank Wood December 10, 2009 Standard Linear Regression Recipe Identify the explanatory variables Decide the functional forms in which the explanatory variables can enter the model Decide

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Linear Regression (9/11/13)

Linear Regression (9/11/13) STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter

More information

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Statistics 203: Introduction to Regression and Analysis of Variance Penalized models Jonathan Taylor - p. 1/15 Today s class Bias-Variance tradeoff. Penalized regression. Cross-validation. - p. 2/15 Bias-variance

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours Instructions: STATS216v Introduction to Statistical Learning Stanford University, Summer 2017 Remember the university honor code. Midterm Exam (Solutions) Duration: 1 hours Write your name and SUNet ID

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences Biostatistics-Lecture 16 Model Selection Ruibin Xi Peking University School of Mathematical Sciences Motivating example1 Interested in factors related to the life expectancy (50 US states,1969-71 ) Per

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

ISQS 5349 Spring 2013 Final Exam

ISQS 5349 Spring 2013 Final Exam ISQS 5349 Spring 2013 Final Exam Name: General Instructions: Closed books, notes, no electronic devices. Points (out of 200) are in parentheses. Put written answers on separate paper; multiple choices

More information

Lecture Data Science

Lecture Data Science Web Science & Technologies University of Koblenz Landau, Germany Lecture Data Science Regression Analysis JProf. Dr. Last Time How to find parameter of a regression model Normal Equation Gradient Decent

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Chapter 6 October 18, 2016 Chapter 6 October 18, 2016 1 / 80 1 Subset selection 2 Shrinkage methods 3 Dimension reduction methods (using derived inputs) 4 High

More information

Python 데이터분석 보충자료. 윤형기

Python 데이터분석 보충자료. 윤형기 Python 데이터분석 보충자료 윤형기 (hky@openwith.net) 단순 / 다중회귀분석 Logistic Regression 회귀분석 REGRESSION Regression 개요 single numeric D.V. (value to be predicted) 과 one or more numeric I.V. (predictors) 간의관계식. "regression"

More information

Math 423/533: The Main Theoretical Topics

Math 423/533: The Main Theoretical Topics Math 423/533: The Main Theoretical Topics Notation sample size n, data index i number of predictors, p (p = 2 for simple linear regression) y i : response for individual i x i = (x i1,..., x ip ) (1 p)

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17

Transformations The bias-variance tradeoff Model selection criteria Remarks. Model selection I. Patrick Breheny. February 17 Model selection I February 17 Remedial measures Suppose one of your diagnostic plots indicates a problem with the model s fit or assumptions; what options are available to you? Generally speaking, you

More information

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors

More information

Final Review. Yang Feng. Yang Feng (Columbia University) Final Review 1 / 58

Final Review. Yang Feng.   Yang Feng (Columbia University) Final Review 1 / 58 Final Review Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Final Review 1 / 58 Outline 1 Multiple Linear Regression (Estimation, Inference) 2 Special Topics for Multiple

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 14, 2014 Today s Schedule Course Project Introduction Linear Regression Model Decision Tree 2 Methods

More information

A simulation study of model fitting to high dimensional data using penalized logistic regression

A simulation study of model fitting to high dimensional data using penalized logistic regression A simulation study of model fitting to high dimensional data using penalized logistic regression Ellinor Krona Kandidatuppsats i matematisk statistik Bachelor Thesis in Mathematical Statistics Kandidatuppsats

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 7/8 - High-dimensional modeling part 1 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Classification

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

Multiple Regression Analysis

Multiple Regression Analysis Chapter 4 Multiple Regression Analysis The simple linear regression covered in Chapter 2 can be generalized to include more than one variable. Multiple regression analysis is an extension of the simple

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata' Business Statistics Tommaso Proietti DEF - Università di Roma 'Tor Vergata' Linear Regression Specication Let Y be a univariate quantitative response variable. We model Y as follows: Y = f(x) + ε where

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

Introduction to Statistical modeling: handout for Math 489/583

Introduction to Statistical modeling: handout for Math 489/583 Introduction to Statistical modeling: handout for Math 489/583 Statistical modeling occurs when we are trying to model some data using statistical tools. From the start, we recognize that no model is perfect

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77

Linear Regression. September 27, Chapter 3. Chapter 3 September 27, / 77 Linear Regression Chapter 3 September 27, 2016 Chapter 3 September 27, 2016 1 / 77 1 3.1. Simple linear regression 2 3.2 Multiple linear regression 3 3.3. The least squares estimation 4 3.4. The statistical

More information

Tuning Parameter Selection in L1 Regularized Logistic Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2012 Tuning Parameter Selection in L1 Regularized Logistic Regression Shujing Shi Virginia Commonwealth University

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson Bayesian variable selection via penalized credible regions Brian Reich, NC State Joint work with Howard Bondell and Ander Wilson Brian Reich, NCSU Penalized credible regions 1 Motivation big p, small n

More information

6. Regularized linear regression

6. Regularized linear regression Foundations of Machine Learning École Centrale Paris Fall 2015 6. Regularized linear regression Chloé-Agathe Azencot Centre for Computational Biology, Mines ParisTech chloe agathe.azencott@mines paristech.fr

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Theorems. Least squares regression

Theorems. Least squares regression Theorems In this assignment we are trying to classify AML and ALL samples by use of penalized logistic regression. Before we indulge on the adventure of classification we should first explain the most

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures FE661 - Statistical Methods for Financial Engineering 9. Model Selection Jitkomut Songsiri statistical models overview of model selection information criteria goodness-of-fit measures 9-1 Statistical models

More information

Regression.

Regression. Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning LECTURE 10: LINEAR MODEL SELECTION PT. 1 October 16, 2017 SDS 293: Machine Learning Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

ECON The Simple Regression Model

ECON The Simple Regression Model ECON 351 - The Simple Regression Model Maggie Jones 1 / 41 The Simple Regression Model Our starting point will be the simple regression model where we look at the relationship between two variables In

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2015 Announcements TA Monisha s office hour has changed to Thursdays 10-12pm, 462WVH (the same

More information

High-Dimensional Statistical Learning: Introduction

High-Dimensional Statistical Learning: Introduction Classical Statistics Biological Big Data Supervised and Unsupervised Learning High-Dimensional Statistical Learning: Introduction Ali Shojaie University of Washington http://faculty.washington.edu/ashojaie/

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Matematické Metody v Ekonometrii 7.

Matematické Metody v Ekonometrii 7. Matematické Metody v Ekonometrii 7. Multicollinearity Blanka Šedivá KMA zimní semestr 2016/2017 Blanka Šedivá (KMA) Matematické Metody v Ekonometrii 7. zimní semestr 2016/2017 1 / 15 One of the assumptions

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information