Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Size: px

Start display at page:

Download "Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?"

Lester Tucker
5 years ago
Views:

1 Simple linear regression Linear Regression Nicole Beckage y " = β % + β ' x " + ε so y* " = β+ % + β+ ' x " Method to assess and evaluate the correlation between two (continuous) variables. The slope of the line relates to the strength and correlation of a particular variable. Key assumptions The relationship between X and Y is linear Y is distributed normally for each value of X The variance of Y at every value of X is the same (homogeneity of variance, e.g. σ - and σ. are constant across the whole population) The observations are independent. The noise is distributed randomly. Advertising sales: Variance changes based on # of TVs Advertising sales: Normal error? 1

2 What s the best fitting line? Let s define the best fit line as the line that minimizes the difference between the squared error between Y and Y0. (ordinary least squares estimate) < min 9(y " β+ % β+ ' x " )? 45 6,45 8 "=' How do we find the solution? Remember calc 1? Minimization We have a function we want to minimize Minima/maxima have a first derivative equal to zero Second derivative test? This is a convex problem (not proved ) so we know that the local minima also the global minima This is a convex problem so we know that any first derivative equal to zero is a minima Minimizing squared error Solving for the intercept Divide by 2 Distribute the sum Divide by N Solving for the slope Distribute x " Substitute β % Distribute sum 2

3 Solving for slope cont. Simple regression to linear regression We often use a design matrix instead of X In design matrix x B = 1 Is the linear model trivial? To fit? Yes. This is possibly the easiest MLE model we will see To interpret? Yes. Sometimes this is a very very good thing. Can we do more with it? Expanding the power of linear models Standardization Transformation of inputs ln (x), exp (x), x, etc. Linear basis function expansion p(y x, ) =N (y w T (x), (x) =[1,x,x 2,...,x D p ] numeric or dummy coding for qualitative analysis Interactions between variables 2 ) Standardization Input variables are (assumed) to be normally distributed Many times the explanatory variables will have different units, or worse the same units on different scales Why could this be problematic? Let s remap the mean and standard deviation of the data Note: we remap based on the training data Save the mapping and apply it to all validation/testing data we see later z i = x i x p i Var(xi ) 3

Transformations on input Some variables are more easily interpreted on other scales We ve seen this in the case of probabilities Log and log odds are easier to interpret Other natural examples Human

4 Transformations on input Some variables are more easily interpreted on other scales We ve seen this in the case of probabilities Log and log odds are easier to interpret Other natural examples Human perception tends to be on a logarithmic rather than linear e.g. Weber s law Just Noticeable Difference Sound and Fourier transformations Basis function and linearity? How can we change the basis without losing linearity? (x) =[1,x,x 2,...,x p ] Model is linear in the parameter space. We can make really complicated models that are still linear Dummy variable Let s say we have a class variable 1,2,..., k Linear models will weight class 2 twice as much as class 1, and class k, k times as much as class 1. This assumption might not be valid Instead we can create a set of k-1 dummy variables Interaction terms Create a new feature where x " = x H x I Or a new feature x " = x H x I x J etc. Allows us to fit effectively separate lines for different groups student and income in model but no interaction Interaction of student with income Problems with linear models Non-linearity of response-predictor relationships Correlation of error terms Non-constant variance of error terms Outliers High leverage points Colinearity 4

Non-linearity of error terms Outliers and High Leverage points Outliers extreme values in the output High leverage points extreme values in input Colinearity If input is really correlated (or

5 Non-linearity of error terms Outliers and High Leverage points Outliers extreme values in the output High leverage points extreme values in input Colinearity If input is really correlated (or inversely correlated), they are colinear If input has high colinearity then fitting the best betas can be hard. Extreme case: if x ' = cx? then β ' = cβ? but there are now infinite solutions. Other considerations As one includes transformed variables, dummy variables, interaction terms etc The model is more and more likely to overfit the data We ll see a few ways of handling this soon. Note that Breiman complains (rightfully) about many of these approaches Subset selection (if we have time) Regularization OLS model (X M X) N' XY = β+ Why use this model? It s simple to solve. It allows for error decomposition. We can quantify what percent of the data isn t explained. Why to not use this model Unrealistic assumptions about the world. What are they? Collinearity issues. 5

6 OLS as a model It s usually not the best we can do. May include variables that capture noise in the output variable instead of actual signal. One solution would to perform a t-test on each β to see if it s significant. Issue is that if we have more than 20 predictors we will likely say a predictor is significant when it isn t. There are corrections for multiple t-tests (Bonferoni s or Scheffe s correction) but is there something else we can do? What can we do instead? Use only a subset of the predictors. Specifically use the predictors that are the most useful. If we have two variables, we can compare all models. Model with no predictors, model containing only X ', model containing only X?, model containing both. Best subset selection Best subset selection O(ON') models contain 2 parameters? 2 O models in total Curse of dimensionality Curse of dimensionality If we have lots of data we can usually make accurate predictions even when we have TONS of predictors. But often we have tons of predictors and not enough data. Even if our predictors are binary, we have O 2 R orderings. Adding one dimension increases the parameter space by an exponential amount 6

7 Other parameter selection methods Subset selection as discussed above makes no guarantees about resulting model. It s also somewhat arbitrary It is better than exploring the whole space of possible models, but can we do better? Regularization! Regularization! Let s introduce bias such that we favor smaller models What our model needs to do Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β+ a good estimate of β (consistent, minimizes error) Will Xβ+ fit future observations well. (generalizes well) regression (Frequentist) If the βs are unconstrained They can get very large As they get larger, they are more susceptible to high variance Regularize the coefficients: add constraints to keep them small S O min 9 y " Xβ? s. t. 9 β H t 45 "=' H=' Necessary for Y to be centered, X s to be standardized. Centered: mean 0 Standardized: mean 0, variance 1 Regression: L2 penalty New loss function: (instead of MSE or RSS) Penalized residual sum of squares S PRSS β \"R]^ = 9 y " Xβ? + λ 9 β H? "=' S = 9 y " Xβ? "=' O H=' + λ β H? This is a convex optimization problem: There s a unique solution Solution is a function of λ? 7

$λ λ is known as the shrinkage parameter λ controls the size the β coeffienients can take Controls the amount of regularization As λ 0 we obtain β+ Bbc \"R]^ As λ we obtain β+ f= g = 0 (intercept-only$

8 λ λ is known as the shrinkage parameter λ controls the size the β coeffienients can take Controls the amount of regularization As λ 0 we obtain β+ Bbc \"R]^ As λ we obtain β+ f= g = 0 (intercept-only model) coefficients OLS solution Intercept-only coefficients Why does Regression help? Bias-variance tradeoff We accept bias to turn down the variance. How do we choose λ? How do we choose λ? (geometric proof) We need a systematic and principled way of choosing λ We want to choose λ that minimizes the PRSS Usually it s not the OLS solution We want to minimize the size of βs while minimizing the MSE. The blue ball is the beta contribution, the red the OLS MSE. Just like in the two dimensional case, we want the cross over point 8

9 Choosing lambda in practice Regression regression: keep the size of the βs small regression: keep the βs zero. Regularization as Optimization Similar to, except different penalty. (and thus different interpretation) We could show is biased. Unless λ=0 Analytical solution is less clear than either or OLS, but it is again a function of λ. Similar problem to before, we have to choose λ. Performance of OLS solution Intercept-only 9

10 Performance of How do we choose λ? (geometric proof) The teal square is the beta contribution, the red the MSE. Just like in the two dimensional case, we want the cross over point Performance of Performance: vs Probability of seeing a given value of β 10

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate