Holiday Assignment PS 531

Similar documents
Introduction to Statistics and R

Using R in 200D Luke Sonnet

Prediction problems 3: Validation and Model Checking

Introduction and Single Predictor Regression. Correlation

Stat 5102 Final Exam May 14, 2015

Density Temp vs Ratio. temp

Stat 5031 Quadratic Response Surface Methods (QRSM) Sanford Weisberg November 30, 2015

Introduction to Linear Regression Rebecca C. Steorts September 15, 2015

Multiple Linear Regression (solutions to exercises)

Explore the data. Anja Bråthen Kristoffersen

Variance Decomposition and Goodness of Fit

Handout 4: Simple Linear Regression

Tests of Linear Restrictions

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Consider fitting a model using ordinary least squares (OLS) regression:

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

Chapter 5 Exercises 1

Final Exam. Name: Solution:

Statistical Simulation An Introduction

Nonstationary time series models

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Explore the data. Anja Bråthen Kristoffersen Biomedical Research Group

Regression and Models with Multiple Factors. Ch. 17, 18

Chapter 3 - Linear Regression

MATH 644: Regression Analysis Methods

Collinearity: Impact and Possible Remedies


Gov 2000: 9. Regression with Two Independent Variables

Chapter 8 Conclusion

STAT 350: Summer Semester Midterm 1: Solutions

STAT 3022 Spring 2007

MODELS WITHOUT AN INTERCEPT

Lecture 8: Fitting Data Statistical Computing, Wednesday October 7, 2015

Chapter 16: Understanding Relationships Numerical Data

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

The Application of California School

Chapter 9. Polynomial Models and Interaction (Moderator) Analysis

Inference for Regression

Aster Models and Lande-Arnold Beta By Charles J. Geyer and Ruth G. Shaw Technical Report No. 675 School of Statistics University of Minnesota January

SCHOOL OF MATHEMATICS AND STATISTICS

R Demonstration ANCOVA

R STATISTICAL COMPUTING

Lecture 18: Simple Linear Regression

Applied Regression Analysis

Motor Trend Car Road Analysis

Statistics - Lecture Three. Linear Models. Charlotte Wickham 1.

Inferences on Linear Combinations of Coefficients

Ch. 16: Correlation and Regression

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Chapter 4 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (December 11, 2006)

Chapter 5 Exercises 1. Data Analysis & Graphics Using R Solutions to Exercises (April 24, 2004)

Statistics GIDP Ph.D. Qualifying Exam Methodology

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Matematisk statistik allmän kurs, MASA01:A, HT-15 Laborationer

Regression. Marc H. Mehlman University of New Haven

Introduction to Simple Linear Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Statistical Computing Session 4: Random Simulation

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

BIOSTATS 640 Spring 2018 Unit 2. Regression and Correlation (Part 1 of 2) R Users

Biostatistics 380 Multiple Regression 1. Multiple Regression

lm statistics Chris Parrish

Logistic Regression. 0.1 Frogs Dataset

STAT 215 Confidence and Prediction Intervals in Regression

Simple Linear Regression for the Climate Data

Later in the same chapter (page 45) he asserted that

Inference with Heteroskedasticity

2015 SISG Bayesian Statistics for Genetics R Notes: Generalized Linear Modeling

Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

Review of Statistics 101

You are permitted to use your own calculator where it has been stamped as approved by the University.

Simple linear regression

Reaction Days

Diagnostics and Transformations Part 2

Examples of fitting various piecewise-continuous functions to data, using basis functions in doing the regressions.

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Contents. Acknowledgments. xix

Lecture 1 Intro to Spatial and Temporal Data

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Unit 6 - Introduction to linear regression

Correlation and Regression

Stat 401B Final Exam Fall 2015

Homework 2. For the homework, be sure to give full explanations where required and to turn in any relevant plots.

MS&E 226: Small Data

Estimated Simple Regression Equation

The Statistical Sleuth in R: Chapter 9

Understanding p Values

Linear Probability Model

Note on Bivariate Regression: Connecting Practice and Theory. Konstantin Kashin

Ordinary Least Squares Regression Explained: Vartanian

Stat 5303 (Oehlert): Randomized Complete Blocks 1

This gives us an upper and lower bound that capture our population mean.

IES 612/STA 4-573/STA Winter 2008 Week 1--IES 612-STA STA doc

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Simple Linear Regression: One Quantitative IV

Lecture 5 - Plots and lines

Swarthmore Honors Exam 2012: Statistics

Regression Diagnostics

Transcription:

Holiday Assignment PS 531 Prof: Jake Bowers TA: Paul Testa January 27, 2014 Overview Below is a brief assignment for you to complete over the break. It should serve as refresher, covering some of the basic concepts and skills you learned in PS 530, so that we can all start on the same page when class begins. Please complete the assignment and turn it in to us no later than Friday, January 24, 2014. You should turn in 1) a.pdf file containing your write up with any tables and figures you d like to include 2),.R or.txt file containing the code you used to generate your analysis. # Create some toy data, 500 observations set.seed(1231985) the.probs <- c(seq(1, 2, length.out = 12), rep(4, 6), rep(3, 2))/20 length(the.probs) [1] 20 hrs <- sample(1:20, size = 500, replace = T, prob = the.probs) * 12 horde <- rbinom(500, 1, prob = 0.45) y <- 2.4 * (hrs/12) - 0.07 * (hrs/12)ˆ2 + horde * runif(500, -5, 4.2) + rnorm(500, 0, 3.75) + 66 practice <- data.frame(cbind(y, hrs, horde)) write.csv(practice, file = "practice.csv", row.names = F) save(practice, file = "practice.rda") 1. Download the dataset practice.rda from Dropbox using the following code. download.file("https://dl.dropboxusercontent.com/s/1sqkyo4m8riljfy/practice.rda", destfile = " /Desktop/practice.rda", method = "curl") load(" /Desktop/practice.rda") # Change location if you like This simulated dataset contains 500 observations of the previous students in this course. For each student, you have information on his or her final grade y, the total number of hours, hrs, he or she spent studying for the class, and an indicator of the student s World of Warcraft faction, horde, that takes a value of 1 for horde, and 0 for alliance. 1 ) 2. Calucluate two measures of the typical number of hours spent studying by these fictional previous students in the class. Discuss the differences and benefits of each measure of centrality or typicalness. What do the two measures together tell you about the distribution hours spent studying? Hint: You can learn about the idea of typical values with Kaplan s textbook here ( http://www.macalester.edu/ kaplan/ism/statmodeling-review.pdf ). mean(practice$hrs) [1] 153.3 median(practice$hrs) [1] 168 1 http://us.battle.net/wow/en/game/race/

Holiday Assignment PS 531 January 27, 2014 2 3. Describe the distribution of final grades, y. Begin by calculating the mean, variance and standard deviation of y (You can use R s standard functions or for practice, try doing it by hand). Next, calculate y s median, range, the interquartile range, and the 95% coverage interval. Finally, produce a figure that summarizes this information. Hint: You can learn about the idea of typical variation with Kaplan s textbook here ( http: //www.macalester.edu/ kaplan/ism/statmodeling-review.pdf ). n <- length(practice$y) mean(practice$y) [1] 83.12 sum(practice$y)/n [1] 83.12 var(practice$y) [1] 44.07 sum((practice$y - mean(practice$y))ˆ2)/(n - 1) [1] 44.07 # Why n-1? sd(practice$y) [1] 6.638 sqrt(sum((practice$y - mean(practice$y))ˆ2)/(n - 1)) [1] 6.638 quantile(practice$y, prob = c(0.25, 0.5, 0.75)) 25% 50% 75% 79.84 84.08 87.57 summary(practice$y) Min. 1st Qu. Median Mean 3rd Qu. Max. 54.7 79.8 84.1 83.1 87.6 99.7 par(mfrow = c(1, 3), mgp = c(1.5, 0.5, 0), oma = rep(0, 4)) with(practice, boxplot(y)) with(practice, hist(y)) with(practice, plot(ecdf(y))) Histogram of y ecdf(y) 60 70 80 90 100 Frequency 0 50 100 150 Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 50 60 70 80 90 100 y 50 60 70 80 90 100 x 4. How do horde members compare to alliance members in their grades? To answer this question some use a difference of means. Please produce a difference of means and interpret it.

Holiday Assignment PS 531 January 27, 2014 3 5. Having calculated a difference of means, some would wonder, Do we have enough information to exclude the idea that the difference of means is really zero? Please answer this question without using any canned function (for example, no t.test() or lm() ). t.test(practice$y practice$horde, var.equal = F) Welch Two Sample t-test data: practice$y by practice$horde t = 1.307, df = 424.2, p-value = 0.1919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.4022 1.9991 sample estimates: mean in group 0 mean in group 1 83.46 82.66 t.test(practice$y practice$horde, var.equal = T) Two Sample t-test data: practice$y by practice$horde t = 1.332, df = 498, p-value = 0.1835 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3795 1.9764 sample estimates: mean in group 0 mean in group 1 83.46 82.66 muh <- mean(practice$y[practice$horde == 1]) # Horde mua <- mean(practice$y[practice$horde == 0]) # Alliance nh <- length(practice$y[practice$horde == 1]) # Number of Horde na <- length(practice$y[practice$horde == 0]) # Number of Alliance varh <- var(practice$y[practice$horde == 1]) # Variance of y for Horde vara <- var(practice$y[practice$horde == 0]) # Variance of y for Alliance # Calculate T-stat by hand sigma <- sqrt(varh/nh + vara/na) sigma [1] 0.6108 df <- (varh/nh + vara/na)ˆ2/((varh/nh)ˆ2/(nh - 1) + (vara/na)ˆ2/(na - 1)) t.stat <- (mua - muh)/sigma p.val <- 2 * pt(-abs(t.stat), df = df) # Equal Variance df.eq <- nh + na - 2 sigma.eq <- sqrt(((nh - 1) * varh + (na - 1) * vara)/df.eq * (1/nH + 1/nA)) t.stat.eq <- (mua - muh)/(sigma.eq) t.stat.eq [1] 1.332 p.val.eq <- 2 * pt(-abs(t.stat.eq), df) p.val.eq [1] 0.1837 6. Now, use a linear model to calculate the average difference between the final grades of those who support the horde versus alliance. How should we interpret this model. What do the coefficients on the intercept and horde mean? What do the standard errors mean? What do the test statistics and p-values mean? summary(lm(y horde, data = practice)) Call: lm(formula = y horde, data = practice)

Holiday Assignment PS 531 January 27, 2014 4 Residuals: Min 1Q Median 3Q Max -27.917-3.360 0.978 4.283 17.079 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 83.460 0.392 212.78 <2e-16 *** horde -0.798 0.600-1.33 0.18 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 6.63 on 498 degrees of freedom Multiple R-squared: 0.00355,Adjusted R-squared: 0.00155 F-statistic: 1.77 on 1 and 498 DF, p-value: 0.184 7. Suppose instead you had estimated the following model: y = α 0 + α 1 beard01 + u Where beard01 is an indicator for facial hair. 2 Your model yields a coefficient on α 1 or ˆα 1 of 1.23 with a standard error of 0.85 and degrees of freedom 498. Assume you had prior knowledge that led you to believe that beards should only have a positive effect on final grades. Formulate the null and alternative hypotheses for this claim and calculate a test statistic and corresponding p-value. Are you conducting a one-tailed or two-tailed-hypothesis test? Some might talk about whether the average difference in final grades is statistically significan. What does this mean? Plot your test statistic on the probability density function of the appropriate t-distribution. a1 <- 1.23 se1 <- 0.85 a1.t <- a1/se1 2 * pt(-abs(a1.t), df = 498) [1] 0.1485 x <- seq(-4.5, 4.5, length.out = 100) plot(x, dt(x, df = 498), type = "l") abline(v = a1.t, lty = 2) 2 Fur counts as facial hair in this exercise.

Holiday Assignment PS 531 January 27, 2014 5 dt(x, df = 498) 0.0 0.1 0.2 0.3 0.4 4 2 0 2 4 x 8. Create two figures, one for the horde and one for the alliance, showing how the final grade, y, varies with the average hours,hrs spent studying in a week. Describe any differences or similarities you observe. par(mfrow = c(1, 2)) with(practice, plot(x = hrs, y = y, type = "n", xlab = "Hours a Week Spent\n\t\t Studying", ylab = "Final Grade", main = "Horde")) Warning: font width unknown for character 0x9 Warning: font width unknown for character 0x9 with(practice[practice$horde == 1, ], points(x = hrs, y = y, col = "red", cex = 0.5, pch = 20)) with(practice, plot(x = hrs, y = y, type = "n", xlab = "Hours a Week Spent\n\t\t Studying", ylab = "Final Grade", main = "Alliance")) Warning: font width unknown for character 0x9 Warning: font width unknown for character 0x9 with(practice[practice$horde == 0, ], points(x = hrs, y = y, col = "black", cex = 0.5, pch = 20))

Holiday Assignment PS 531 January 27, 2014 6 50 100 150 200 60 70 80 90 100 Horde Hours a Week Spent Studying Final Grade 50 100 150 200 60 70 80 90 100 Alliance Hours a Week Spent Studying Final Grade par(mfrow = c(1, 1)) 9. Estimate a simple linear regression predicting final grade,y, as a function of average hours a week spent studying hrs. Interpret the coefficients and standard errors and p-values from this model. Calculate a 95-percent confidence interval for the coefficient on hrs. Now calculate a 95-percent confidence interval for using the percentile bootstrap method with 1,000 replications. How do the two confidence intervals compare? When would you prefer to use the bootstrap versus the analytic based confidence interval? What assumptions do you need to make for the bootstrap interval? What assumptions do you need to make for the analytic interval? Would these assumptions be reasonable, given what you might assume about the research design that generated these data? fm1 <- lm(y hrs, data = practice) confint(fm1)[2, ] 2.5 % 97.5 % 0.06874 0.08180 (ci.upper <- coef(fm1)[2] + qt(0.975, 498) * sqrt(vcov(fm1)[2, 2])) hrs 0.0818 (ci.lower <- coef(fm1)[2] - qt(0.975, 498) * sqrt(vcov(fm1)[2, 2])) hrs 0.06874 n <- 500 R <- 2000 bs.est <- NA for (i in 1:R) { s <- sample(1:n, replace = T) f <- lm(y[s] hrs[s], data = practice) coefs <- coef(f) bs.est[i] <- coefs[2] } quantile(bs.est, c(0.025, 0.975)) 2.5% 97.5% 0.06745 0.08283

Holiday Assignment PS 531 January 27, 2014 7 confint(fm1)[2, ] 2.5 % 97.5 % 0.06874 0.08180 10. Plot the residuals from your linear model against their predicted (fitted) values. What should this plot look like if the assumptions of OLS are met? What does it look like? plot(x = fm1$fitted, y = fm1$residuals) 75 80 85 90 15 10 5 0 5 10 15 fm1$fitted fm1$residuals 11. Propose (and estimate) an alternative model for the relationship between grades and time spent studying. Again calculate 95-percent confidence intervals using both the analytic method and the percentile bootstrap method. Compare your results to those obtained from the simple bivariate regression. How does the coefficient on hrs change? Did you include a coefficient for warcraft faction (horde)? Why or why not? Overall, does your model do a better job explaining variation in final grades? Bonus: What s the optimal amount of time someone should spend studying if they want to maximize their expected final grade? practice$hrs.sq <- practice$hrsˆ2 fm2 <- lm(y hrs + hrs.sq, data = practice) fm2.1 <- lm(y hrs + hrs.sq + horde, data = practice) summary(fm2) Call: lm(formula = y hrs + hrs.sq, data = practice) Residuals: Min 1Q Median 3Q Max -12.183-2.874-0.029 2.721 13.415 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.41e+01 8.01e-01 80.0 <2e-16 *** hrs 2.22e-01 1.28e-02 17.3 <2e-16 *** hrs.sq -5.46e-04 4.66e-05-11.7 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

Holiday Assignment PS 531 January 27, 2014 8 Residual standard error: 4.13 on 497 degrees of freedom Multiple R-squared: 0.614,Adjusted R-squared: 0.613 F-statistic: 396 on 2 and 497 DF, p-value: <2e-16 summary(fm2.1) Call: lm(formula = y hrs + hrs.sq + horde, data = practice) Residuals: Min 1Q Median 3Q Max -11.912-2.761-0.105 2.750 13.694 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 6.44e+01 8.20e-01 78.44 <2e-16 *** hrs 2.21e-01 1.28e-02 17.27 <2e-16 *** hrs.sq -5.45e-04 4.65e-05-11.72 <2e-16 *** horde -4.80e-01 3.73e-01-1.29 0.2 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.13 on 496 degrees of freedom Multiple R-squared: 0.616,Adjusted R-squared: 0.613 F-statistic: 265 on 3 and 496 DF, p-value: <2e-16 plot(x = fm2$fitted, y = fm2$residuals) 70 75 80 85 10 5 0 5 10 fm2$fitted fm2$residuals hrs.star <- -coef(fm2)[2]/(2 * coef(fm2)[3]) hrs.star.1 <- -coef(fm2.1)[2]/(2 * coef(fm2.1)[3]) par(mfrow = c(1, 1)) plot(x = practice$hrs, y = practice$y, pch = 20, col = "grey", type = "n") points(x = practice$hrs, y = practice$y, pch = 20, col = "grey", cex = 0.5) pred.df <- expand.grid(hrs = sort(unique(practice$hrs)), horde = 1) pred.df$hrs.sq <- pred.df$hrs * pred.df$hrs pred.y <- predict(fm2, newdata = pred.df) lines(x = sort(unique(practice$hrs)), y = pred.y, col = "red", lty = 1)

Holiday Assignment PS 531 January 27, 2014 9 abline(v = hrs.star, col = "red", lty = 2) 50 100 150 200 60 70 80 90 100 practice$hrs practice$y n <- 500 R <- 1000 bs.est.2 <- matrix(na, nrow = R, ncol = 2) for (i in 1:R) { s <- sample(1:n, replace = T) f <- lm(y[s] hrs[s] + hrs.sq[s], data = practice) coefs <- coef(f) bs.est.2[i, ] <- coefs[2:3] } quantile(bs.est.2[, 1], c(0.025, 0.975)) 2.5% 97.5% 0.1962 0.2470 quantile(bs.est.2[, 2], c(0.025, 0.975)) 2.5% 97.5% -0.0006370-0.0004516 confint(fm2) 2.5 % 97.5 % (Intercept) 62.5512665 65.6990422 hrs 0.1964465 0.2468448 hrs.sq -0.0006377-0.0004546 par(mfrow = c(2, 2), pty = "s", mgp = c(1.5, 0.5, 0), oma = rep(0, 4)) plot(fm2)

Holiday Assignment PS 531 January 27, 2014 10 70 75 80 85 15 5 0 5 10 15 Fitted values Residuals Residuals vs Fitted 260 308 477 3 2 1 0 1 2 3 3 1 0 1 2 3 Theoretical Quantiles Standardized residuals Normal Q Q 260 308477 70 75 80 85 0.0 0.5 1.0 1.5 Fitted values Standardized residuals Scale Location 260 308 477 0.000 0.010 0.020 3 1 0 1 2 3 4 Leverage Standardized residuals Cook's distance Residuals vs Leverage 477 379 166 12. Do your results suggest a causal relationship between time spent studying and final grades in this class? What are some factors that might lead to a potential spurious relationship between studying and grades? Propose a strategy or strategies to identify the causal effect of an 1 extra hour of studying a week on a student s final grade? Discuss the benefits, limitations, and potential difficulties of your approach. Hint: You ll need to be clear about what you mean by studying causes grades. If you are not already comfortable with the idea of potential outcomes, you might want to learn about it. For example, see http://jakebowers.org/itvexperiments/gg Field Experiments Ch02. pdf, http://jakebowers.org/itvexperiments/gg Field Experiments Ch01.pdf, and for a canonical piece http: //jakebowers.org/itvexperiments/holland86wdisc.pdf.