Homework 1: Solutions

Size: px

Start display at page:

Download "Homework 1: Solutions"

Alfred Parks
6 years ago
Views:

1 Homework 1: Solutions Statistics 413 Fall 2017 Data Analysis: Note: All data analysis results are provided by Michael Rodgers 1. Baseball Data: (a) What are the most important features for predicting players salary? (i) Fit and visualize regularization paths for the following methods: Lasso. Lasso 0.50 Variable C C 0.25 C LeagueA LeagueN NewLeagueN Log Lambda 1

2 (at two separate alpha levels) Log Lambda Alpha = Log Lambda Alpha =.25 C C C LeagueA LeagueN NewLeagueN C C C LeagueA LeagueN NewLeagueN 2

3 Adaptive Lasso Adaptive Lasso Variable C C C LeagueA LeagueN NewLeagueN 1e+07 1e+09 Log Lambda (ii) What are the top predictors selected by each method? Are they different? If so, why? Lasso s top predictors are 1. **** 2. **C** 3. **** 4. **** s (alpha=.75) top predictors are 1. **** 2. **C** 3. **s** 4. **CRbi** s (alpha=.25) top predictors are 1. **** 2. **C** 3. **s** 4. **** Adaptive Lasso s top predictors are 1. **** 2. **** 3. **** 4. **** All models pick up a strong signal in the cumulative runs variable. But the variable selection starts to deviate once we allow alpha to not equal 1. In both models, cumulative runs, hits, and at bats are picked up. We also see some kind of hits variable stand out in all four models. The Net Elastic models selected three of the same top four predictors (, C, s). This suggests that salary is based on cumulative stats, rather than noncumulative stats. The top predictors for salary are and C according to the regularization paths. 3

4 (b) Which linear method is best at predicting players salary? (i) Compare the average prediction MSE on the test set for the following methods: Least Squares. Ridge Regression. Best Subsets. Lasso.. Adaptive Lasso. (ii) Visualize the results of your comparisons Model Average Prediction MSE Ridge Best Subsets Lasso Adaptive Lasso Model Best Prediction Error Lambda Alpha Norm of Zero Count NA NA Ridge Best Subsets NA NA Lasso Adaptive Lasso

5 Best Residuals Ridge Model Lasso Model Adaptive Lasso Best Subsets Lasso Ridge Best Subsets Adaptive Lasso Standardized Residuals 5

6 Best Fit Variable NewLeagN LeagN LeagA C Model Adaptive Lasso Best Subsets Lasso Ridge C C Coefficient (iii) Reflection. Which types of methods give the best prediction error? According to my simulation, the best prediction error was achieved through the Best Subsets model. This model also had the third largest average prediction MSE. This large average MSE means that we can t trust the model to be the most consistent. We also see that this model is sparse (meaning it set alot of coefficients equal to zero), so this implies that the model would be biased against cases that aren t strong in these variables left in. Why do these methods perform well? These models (except for Least Squares and Best Subsets) perform well because they have penalties. These penalties allow for the model to minimize the RSS and minimize for a specific type of norm for the beta vector. Since the estimate has no bias, its MSE is composed of full variance. This variance tends to cause overfitting, which leads to inaccurate predictions. When we introduce a penalty, we are adusting the variance and bias and the result is a model that is not overfitted, producing better prediction. This is seen in our results since has the highest MSE and highest best prediction error. We also see that each model has residuals that are fairly normal. This is what we need since we assume the errors are iid and normally distributed with mean 0. In the Lasso model, our norm is larger, but that is because our lambda star value is low. This implies that its beta vector should be similar to. Do any methods seem to overfit to the training set? If so, why? This question is answered by taking the norm of each beta vector. We see that the has the largest norm. This implies that there is overfitting. We also see that the Best 6

7 Subsets model has a large norm too. This is because there is no penalty term. The model with the lowest norm is the. This is because its lambda value is large. The larger lambda is, the smaller the norm will be. We also see this with the Ridge. The Adusted Lasso has a larger norm because its lambda value is smaller. Do all the methods choose the same subset of variables? According to the Best Fit graph, we only see a few sign changes. But for the most part, we ust see shrinkage. We see that and Adaptive Lasso both set to zero the same variable. This is because we used coefficients as the weights. What is interesting is that Best subsets sets 11 variables equal to zero and sets 2 variables equal to zero. This must mean the Best Subsets had potentially had the right idea, but had too much variance, and thus when we add a penalty and search to find the best parameter for Adaptive Lasso it reaches the same conclusion as Best Subsets but minimizes the beta vector s norm; This causes variance to decrease and we get a better prediction model. We also see that and Best Subsets produce the largest coefficients. This is because they have no penalty term. 7

8 Mathematical Problem: 2. Assume that X is orthogonal (e.g. X T X = I). Compute the bias and variance of the least squares, ridge, and lasso estimators. Compare and contrast these. (a) Least Squares Estimator Bias: E[ˆβ ] = E[(X T X) 1 X T y] = X T E[Xβ + ɛ] = X T Xβ = β and therefore, Bias[ˆβ ] = E[ˆβ ] β = β β = 0 (b) Least Squares Estimator variance: Cov(ˆβ ) = E[(X T y E[ˆβ ])(X T y E[ˆβ ]) T ] = E[(X T (Xβ + ɛ) β)(x T (Xβ + ɛ) β) T ] = E[(β + X T ɛ β)(β + X T ɛ β) T ] = E[X T ɛɛ T X] = X T E[ɛɛ T ]X = X T ( 2 I p )X = 2 I p and therefore, Var( ˆβ ) = 2, for = 1,..., p (c) Ridge Estimator Bias: For orthogonal inputs, ˆβ r = (I p + λi p ) 1 ˆβ. and therefore, (d) Ridge Estimator Variance: E[ˆβ r ] = E[(I p + λi p ) 1 X T y] = (I p + λi p ) 1 E[X T y] = (I p + λi p ) 1 β Bias(ˆβ r ) = E[ˆβ r ] β = ( λ 1 + λ )β Cov(ˆβ r ) = E[(ˆβ r E[ˆβ r ])(ˆβ r E[ˆβ r ]) T ] = E[((I p + λi p ) 1 (X T y β))((i p + λi p ) 1 (X T y β)) T ] = E[(I p + λi p ) 1 (X T y β)(y T X β T )(I p + λi p ) 1 ] = (I p + λi p ) 1 E[(X T y β)(y T X β T )](I p + λi p ) 1 = (I p + λi p ) 1 2 I p (I p + λi p ) 1 = 2 (I p + λi p ) 2 and therefore, Var( ˆβ r ) = 2 /(1 + λ) 2, for = 1,..., p 8

9 (e) Lasso Estimator Bias: We know that ɛ N(0, 2 ) and ˆβ = β + ɛ where β β + ɛ λ, β + ɛ > λ, ˆβ L = β + ɛ + λ, β + ɛ < λ, 0, β + ɛ λ. is the true parameter and (1) Now we compute Bias( ˆβ L ) using Law of Total Expectation: E[ ˆβ L ] = E [E[ ˆβ L ˆβ ˆβ ]] = E[β + ɛ λ β + ɛ > λ] Pr(β + ɛ > λ) + E[β + ɛ + λ β + ɛ < λ] Pr(β + ɛ < λ) + E[0 β + ɛ λ] Pr( β + ɛ λ) = (β λ)φ( λ β where Pr(β + ɛ > λ) = Pr(ɛ / > (λ β )/) = Φ( λ β and therefore Bias[ ˆβ L] = E[ ˆβ L] β for = 1,..., p. (f) Lasso Estimator Variance: Use the Law of Total Variance and let E[ ˆβ L ) + (β + λ)φ( λ β ˆβ ] = e: ) ) and Pr(β + ɛ < λ) = Φ( λ β ) Var[ ˆβ L ] = E [Var[ ˆβ L ˆβ ˆβ ]] + Var [E[ ˆβ L ˆβ ˆβ ]] { } = Var[β + ɛ λ β + ɛ > λ] Pr(β + ɛ > λ) + Var[β + ɛ + λ β + ɛ < λ] Pr(β + ɛ < λ) { } + E [e 2 ] (E [e]) 2 ˆβ ˆβ { = 2 Φ( λ β ) + 2 Φ( λ } { } β ) + E [e 2 ] (E [e]) 2 ˆβ ˆβ where E [e 2 ] = (β ˆβ λ) 2 Pr(β + ɛ > λ) + (β + λ) 2 Pr(β + ɛ < λ) = (β λ) 2 Φ( λ β ) + (β + λ) 2 Φ( λ β ) and { (E [e]) 2 = (β ˆβ λ)φ( λ β ) + (β + λ)φ( λ } 2 β ) 9

Lecture 14: Shrinkage

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the