HW1 Roshena MacPherson Feb 1, 2017

Size: px

Start display at page:

Download "HW1 Roshena MacPherson Feb 1, 2017"

Allison Thompson
5 years ago
Views:

1 HW1 Roshena MacPherson Feb 1, 2017 This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code. Question 1: In this question we will consider some real life application of statistical learning a) Unsupervised Learning 1) inferring driving modes (lane changing, lane keeping, merging, etc. ) from unlabeled highway data 2) threat detection from unlabeled video data 3) classifying politicians based on voting records b) Regression 1) Trying to fit a polynomial to the dynamics of a system. Predictors are initial conditions and control input, response is future state. The application is prediction. We want to be able to predict how the system will react if perturbed. 2) Trying to fit a function to relate student reported happiness (the response) to different predictors such as the number of hours spent working a day, number of hours of exercise, etc. The goal is inference because we would like to draw conclusions about correlations between certain behaviors and reported happiness. 3) Trying to fit a function to relate number of attempts a baby has taken at completing a certain task (predictor) and how well it performs the task (response). The goal is inference because we would like to draw conclusions about the learning rate of the baby. c) Classification 1) Trying to identify classes of drivers such as aggressive, timid, distracted, etc. based on average number of lane changes per time (response) give their age, ethnicity, car type (predictors). The goal is prediction because we would like to be able to predict in the future what sorts of actions the driver will take based on their classification 2) Trying to determine groups of patients that a drug is effective on based on their how well they say they are feeling worse, the same, better (response) after being given different amounts of a drug (predictor) 3) Classifying tv shows into different genres based on the ratings given to them by different types of viewers (response) given the age, ethnicity, and viewing habits of the viewers (predictor) Question 2: Explain whether each scenario below is a regression, classification, or unsupervised learning problem, and indicate for each supervised learning scenario whether we are more interested in inference or prediction. Finally, provide n and p. a) This is a classification problem. We are interested in inference because the school wants to understand how different elements are predictors of admittance. n is 42,000 (the number of students that we have data for) and p is 7. b) This is an unsupervised learning problem. We are trying to infer subtypes of consumers from unlabeled data. n is 1.5 million (the number of consumers we have data on ) and p is 500,000 ( the number of products we have data on ) c) This is a regression problem. The outputs (good sell, bad sell, horrible sell, etc.) have a clear ordering so it makes sense to use regression instead of classification. We could also do classification however. We are interested in inference to understand how these different factors affect whether a book will sell well. n is 4,000 (number of books we have data on) and p is 5. d) This is a regression problem. We are interested in prediction because we would like to predict by how much global temperatures will rise in the coming years. n is 116 ( the number of years we have data for ) and p is 6. Question 3: 1

2 a) The advantages of a very flexible approach for regression are that it allows for a variety of underlying effects to be modeled. For instance, a simple linear model will almost definitely not capture what s truly going on, whereas a more flexible model that has third order and second order terms will allow for the capture of higher order effects that might be going on. The disadvantage of a very flexible model is that there are many more parameters to fit and you could easily end up overfitting your model to your training data. You may end up capturing the noise in your training data rather than the underlying trends b) If you know that the noise in your system is very low (you have already characterized and calibrated the signal to noise ratio in your sensing setup and know that the noise is very small) then it may be appropriate to use a very flexible model since we know that most of the variance we see will be due to actual trends. If we are interested in prediction rather than inference (aka we don t care that much about the interpretability of the fit), and we have a very large number of samples (aka n is very large) then using a more flexible model would be appropriate (this could be trying to generate a predictor for which stocks will increase if we have decades of data). c) If we know that there is a decent amount of noise in our system, then we would prefer a less flexible method to make sure we aren t fitting higher order terms to the noise in our system. If the number of data points n is small relative to the number of variables p, then we would probably want to use a less flexible method. Question 4: library(mass) attach(boston) a) There are 506 rows and 14 columns. The rows represent the different suburbs. The columns represent the different predictors that have been measured nrow(boston) [1] 506 ncol(boston) [1] 14 b) In the two plots below I have plotted the weighted mean of distances to five Boston employment centers vs the proportion of owner-occupied units built prior to 1940 (plot A) and nitrogen oxides concentration vs proportion of non-retail business acres per town (plot B). For A, the Rˆ2 value was , meaning about 55% of the variance is explained by our model. Considering where our data came from, it s reasonable to assume that there is a decent amount of noise in the data, so this Rˆ2 value seems as if our model is a decent fit. Additionally, the p value for the slope coefficient is less than 2ˆ-16, meaning it is very unlikely that the null hypothesis is true. For B, the Rˆ2 value was , meaning that about 58% of the variance is explained by our model. Again, this seems like a reasonably good model considering we do expect our data to be pretty noisy. For the slope coefficient the p value again is <2e-16, meaning that it is very unlikely that the null hypothesis is true. We can be fairly certain that there is a non-zero relationship between the two variables. model1 = lm(dis~age) model2 = lm(nox~indus) summary(model1) Call: lm(formula = dis ~ age) Residuals: Min 1Q Median 3Q Max

3 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** age <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 summary(model2) Call: lm(formula = nox ~ indus) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** indus <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 plot(age,dis) abline(model1, col = "blue") 3

4 dis plot(indus, nox) abline(model2, col = "blue") age nox indus c) It appears that all towns with a non-zero per capita e rate have no residential land zoned for lots over 25,000 sq. ft. So if you have any residential land zoned for lots over 25,000 sq. ft., it appears very likely that your e rate is close to zero. It also appears that almost all towns with a e rate above 5% per capita have a proportion of non-retail business acres of 18. Any higher or lower than that and the e rates do not exceed around 5%. If the town is bounded by the Charles River, the e rate does not exceed about 15% per capita. For those not bounded by the Charles River, the e rate spans the whole range. For towns with a nox value of less than about 0.55, the e rate is essentially 0. Above that and there doesn t seem to be any relationship. There is a linear relationship between age 4

5 and e, with a p value of <2ˆ-16, though it seems as if a slightly more flexible model might fit the data better. There appears to be an inverse relationship between e and weighted mean of distances to five Boston employment centers. It appears that if your town has an index of accessibility to radial highways of less than 25, the e rate is very likely to be very small. Similarly, it appears that if your town has a full property tax rate per $10,000 of less than 650, it is very likely that the e rate is very small. Also, if your pupil to teacher ratio is less than 20, the e rate is likely to be near zero. There appears to be a linear relationship between the percentage of lower status of your population and e (p<2ˆ-16, Rˆ2=.21). There seems to be an inverse relationship between e and median value of owner-occupied homes in $1000s. par(mfrow=c(3,5)) plot(~zn) plot(~indus) plot(~chas) plot(~nox) plot(~rm) plot(~age) plot(~dis) plot(~rad) plot(~tax) plot(~ptratio) plot(~black) plot(~lstat) plot(~medv) zn indus chas nox rm age dis rad tax ptratio black plot(zn,) lstat medv 5

6 plot(age,) model=lm(~age) summary(model) zn Call: lm(formula = ~ age) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-05 *** age e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: 2.855e-16 abline(model, col="blue") 6

7 plot(dis,) model = lm(~dis) summary(model) age Call: lm(formula = ~ dis) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** dis <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 504 DF, p-value: < 2.2e-16 abline(model,col="blue") 7

8 plot(lstat, ) model = lm(~lstat) summary(model) dis Call: lm(formula = ~ lstat) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** lstat < 2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 504 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: 132 on 1 and 504 DF, p-value: < 2.2e-16 abline(model, col="blue") 8

9 lstat d) There are a few towns with e rate per capita above 40, which is not the norm. 96% are in the 0-20 range. For zn, about 75% of towns have a proportion of residential land zoned for lots over 25,000 sq. ft of less than ten. For indus, the spread is pretty even, from 0 to around 30. For chas, the distribution is obviously binary as it s a dummy variable that can only be one or zero. For nox, there are a few outliers with a nox value of For rm, it looks very much like a Gaussian distribution between 3 rooms and 9 rooms per dwelling. Age is pretty evenly distributed between 0 and 100. There are a few outliers for dis at around About 70% of towns have a rad value <= 8. The other 30% have rad values of 24. Again, about 70% of towns have a tax value <=437, while the other 30% have values >=666. ptratio is pretty evenly distributed. The vast majority of towns have a black number between 350 and 400. The rest are evenly distributed across the board from 0 to 350. Lstat looks like a slightly lopsided Gaussian, ranging from 0 to 40. Medv is similar, ranging from 0 to 50. par(mfrow=c(3,5)) hist() hist(zn) hist(indus) hist(chas) hist(nox) hist(rm) hist(age) hist(dis) hist(rad) hist(tax) hist(ptratio) hist(black) hist(lstat) hist(medv) 9

10 Histogram of Histogram of zn Histogram of indu Histogram of cha Histogram of nox zn indus chas nox Histogram of rm Histogram of age Histogram of dis Histogram of rad Histogram of tax rm age dis rad tax Histogram of ptrat Histogram of blac Histogram of lsta Histogram of med ptratio black lstat medv e) 35 towns bound the Charles River sum(chas) [1] 35 f) The median of ptratio is median(ptratio) [1] g) Town numbers 399 and 406 have the smallest median value of owner-occupied homes (tied at 5.0). The e rates are both pretty high at 38 and 68. The tax rates are in the high range at 666 each. Ptratio is also a the upper end of the spectrum for each at I would want to look at the values of these predictors across other towns with different median value of owner-occupied homes before drawing conclusions. idx1 = 399 idx2 = 406 [idx1] [1] [idx2] [1] tax[idx1] [1] 666 tax[idx2] 10

11 [1] 666 ptratio[idx1] [1] 20.2 ptratio[idx2] [1] 20.2 h) 64 towns have on average 7 rooms or more in each house. 13 towns have on average 8 rooms or more in each house. sum(rm>7) [1] 64 sum(rm>8) [1] 13 hist(rm) Histogram of rm Question 5 a. Split the data set into a training set and a test set of approximately equal size. smp_size = nrow(boston)/2 seed_num = 25 set.seed(seed_num) train_ind <- sample(seq_len(nrow(boston)), size = smp_size) train <- Boston[train_ind, ] test <- Boston[-train_ind, ] rm b. Fit a linear model using least squares on the training set, and report the mean training and mean test error obtained. 11

12 model = lm(~., train) summary(model) Call: lm(formula = ~., data = train) Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) zn indus chas nox rm age dis * rad e-06 *** tax ptratio black lstat medv * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: on 239 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 13 and 239 DF, p-value: < 2.2e-16 predicted_vals_test = predict(model, test) predicted_vals_train = predict(model, train) res = predicted_vals_test - test$ mean(model$residuals^2) [1] mean(res^2) [1] c) It appears from our model that the index of accessibility to radial highways is extremely significant, and median value of owner occupied homes in $1000s and weighted mean of distances to five Boston employment centres are also significant but less so. Indus and age no longer appear significant, though this may be because they are correlated with dis, rad, or medv. The R squared value of our model is.42, meaning we have explained 42% of the variance in our system. Considering the MSE went from 36 to 46 which is not too big of a jump, I would say our model does a pretty good job of predicting e rate. Question 6: The most important predictors in this case are nox, rad, zn, dis, and black. The training misclassification rate is 5% and the test misclassification rate is 14%. This seems to perform much better than the linear regression model when compared to the Rˆ2 values we had. 12

13 Y = train$>=median(train$) Ytest = test$>=median(test$) for (i in 1:nrow(train)){ if(y[i]){ Y[i] = 1} else { Y[i] = 0} } for (i in 1:nrow(test)){ if(ytest[i]){ Ytest[i] = 1} else { Ytest[i] = 0} } test_changed_ = test test_changed_$ = Ytest train_changed_ = train train_changed_$ = Y model = glm(~.,family = binomial, data = train_changed_) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(model) Call: glm(formula = ~., family = binomial, data = train_changed_) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** zn * indus chas nox e-05 *** rm age dis * rad ** tax * ptratio black * lstat medv Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) 13

14 Null deviance: on 252 degrees of freedom Residual deviance: on 239 degrees of freedom AIC: Number of Fisher Scoring iterations: 10 yhat = model$fit>0.5 class_rate_train = mean(y==yhat) 1-class_rate_train [1] y_test = predict(model, newdata = test_changed_, type="response" ) y_test = y_test>0.5 class_rate_test = mean(ytest==y_test) 1-class_rate_test [1] When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file). 14

Multiple Regression Part I STAT315, 19-20/3/2014

Multiple Regression Part I STAT315, 19-20/3/2014 Regression problem Predictors/independent variables/features Or: Error which can never be eliminated. Our task is to estimate the regression function f.