Stat 139 Homework 7 Solutions, Fall 2015

Similar documents
1 Inferential Methods for Correlation and Regression Analysis

University of California, Los Angeles Department of Statistics. Simple regression analysis

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Properties and Hypothesis Testing

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Regression, Inference, and Model Building

Linear Regression Models

Simple Linear Regression

Final Examination Solutions 17/6/2010

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Statistics 20: Final Exam Solutions Summer Session 2007

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

Mathematical Notation Math Introduction to Applied Statistics

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

MA 575, Linear Models : Homework 3

Describing the Relation between Two Variables

Interval Estimation (Confidence Interval = C.I.): An interval estimate of some population parameter is an interval of the form (, ),

Section 9.2. Tests About a Population Proportion 12/17/2014. Carrying Out a Significance Test H A N T. Parameters & Hypothesis

Comparing your lab results with the others by one-way ANOVA

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

Chapter 13, Part A Analysis of Variance and Experimental Design

11 Correlation and Regression

Ismor Fischer, 1/11/

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Common Large/Small Sample Tests 1/55

This is an introductory course in Analysis of Variance and Design of Experiments.

Chapter 23: Inferences About Means

Correlation Regression

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

HYPOTHESIS TESTS FOR ONE POPULATION MEAN WORKSHEET MTH 1210, FALL 2018

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Algebra of Least Squares

(all terms are scalars).the minimization is clearer in sum notation:

Output Analysis (2, Chapters 10 &11 Law)

Lecture 11 Simple Linear Regression

y ij = µ + α i + ɛ ij,

Sample Size Determination (Two or More Samples)

n but for a small sample of the population, the mean is defined as: n 2. For a lognormal distribution, the median equals the mean.

Frequentist Inference

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

multiplies all measures of center and the standard deviation and range by k, while the variance is multiplied by k 2.

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

ENGI 4421 Confidence Intervals (Two Samples) Page 12-01

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Statistics Lecture 27. Final review. Administrative Notes. Outline. Experiments. Sampling and Surveys. Administrative Notes

MA238 Assignment 4 Solutions (part a)

Random Variables, Sampling and Estimation

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Regression. Correlation vs. regression. The parameters of linear regression. Regression assumes... Random sample. Y = α + β X.

BHW #13 1/ Cooper. ENGR 323 Probabilistic Analysis Beautiful Homework # 13

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

CONFIDENCE INTERVALS STUDY GUIDE

Statistics 511 Additional Materials

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

2 1. The r.s., of size n2, from population 2 will be. 2 and 2. 2) The two populations are independent. This implies that all of the n1 n2

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Number of fatalities X Sunday 4 Monday 6 Tuesday 2 Wednesday 0 Thursday 3 Friday 5 Saturday 8 Total 28. Day

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

Chapter 8: Estimating with Confidence

University of California, Los Angeles Department of Statistics. Hypothesis testing

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

Correlation. Two variables: Which test? Relationship Between Two Numerical Variables. Two variables: Which test? Contingency table Grouped bar graph

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Simple Random Sampling!

REGRESSION MODELS ANOVA

AP Statistics Review Ch. 8

Math 140 Introductory Statistics

Statistical Intervals for a Single Sample

M1 for method for S xy. M1 for method for at least one of S xx or S yy. A1 for at least one of S xy, S xx, S yy correct. M1 for structure of r

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Efficient GMM LECTURE 12 GMM II

BIOS 4110: Introduction to Biostatistics. Breheny. Lab #9

Formulas and Tables for Gerstman

Linear Regression Demystified

MEASURES OF DISPERSION (VARIABILITY)

Topic 9: Sampling Distributions of Estimators

Good luck! School of Business and Economics. Business Statistics E_BK1_BS / E_IBA1_BS. Date: 25 May, Time: 12:00. Calculator allowed:

Chapter 20. Comparing Two Proportions. BPS - 5th Ed. Chapter 20 1

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

Transcription:

Stat 139 Homework 7 Solutios, Fall 2015 Problem 1. I class we leared that the classical simple liear regressio model assumes the followig distributio of resposes: Y i = β 0 + β 1 X i + ɛ i, i = 1,...,, (1) where ɛ i i.i.d N(0, σ 2 ). Two estimators we d like to calculate are: (i) ˆµ Y X = ˆβ 0 + ˆβ 1 X, the predicted mea value of the respose, µ Y, for ay idividual with a specific value of the predictor, X. This ca be used to build a cofidece iterval for where the true µ Y will be located give X. (ii) Ŷj X j = ˆβ 0 + ˆβ 1 X j + ɛ j, the predicted value for a ew idividual respose, Y j, give that idividual s value of the predictor, X j. This ca be used to build a predictio iterval for where a ew Y j will be located give it s X j. I this problem we will determie the samplig distributios of these two estimators [Note: the secod is ot techically a estimator sice ɛ j is ot observable, but we ca still determie some characteristics of this etity sice we ca assume the samplig distributio of ɛ j provided above]. (a) Calculate the expected value of ˆµ Y X ad Ŷj X j. The samplig distributio results of ˆβ 0 ad ˆβ 1 provided i class ca be used directly for this problem. E(ˆµ Y X ) = E( ˆβ 0 + ˆβ 1 X) = E( ˆβ 0 ) + XE( ˆβ 1 ) = β 0 + Xβ 1 E(Ŷj X j ) = E( ˆβ 0 + ˆβ 1 X + ɛ j ) = E( ˆβ 0 ) + XE( ˆβ 1 ) + E(ɛ j ) = β 0 + Xβ 1 + 0 It turs out that Cov(Ȳ, ˆβ 1 ) = 0. I other words, the estimator of the slope of a regressio lie is ot correlated with the average respose. Ituitively, it is true because the regressio lie has to pass through the poit ( X, Ȳ ) regardless of the slope value. (b) Show that: Cov( ˆβ 0, ˆβ σ 2 X 1 ) = usig the fact above that Cov(Ȳ, ˆβ 1 ) = 0 ad the properties of Covariace: Cov(aX, Y ) = acov(x, Y ) ad Cov(X + W, Y ) = Cov(X, Y ) + Cov(W, Y ). Cov( ˆβ 0, ˆβ 1 ) = Cov(Ȳ ˆβ 1 X, ˆβ1 ) = Cov(Ȳ, ˆβ 1 ) Cov( ˆβ 1 X, ˆβ1 ) = 0 XVar( ˆβ σ 2 X 1 ) = (c) Determie Var(ˆµ Y X ). Hit: = i=1 (X2 i ) X 2 may be useful (though you may ot eed to use this property). Note: we decided to use X j to be where we are doig the calculatio so as to ot get cofused with the observed X s i the data set: Var(ˆµ Y Xj ) = Var( ˆβ 0 + ˆβ 1 X j ) = Var( ˆβ 0 ) + Var( ˆβ 1 X j ) + 2X j Cov( ˆβ 0, ˆβ 1 ) ( σ 2 = + X2 σ 2 ) X i=1 (X i X) j 2 2 + σ2 2X j σ 2 X i=1 (X i ( X) 2 = σ 2 1 X2 + + Xj 2 + 2X j X ) ( 1 = σ 2 + (X j X) 2 ) 1

It also turs out that Cov(ˆµ Y X, ɛ j ) = 0. I other words, the residuals aroud the lie are ot correlated with where the predicito is beig made. Ituitively this is true because oe of our assumptios i the regressio model is that the variace is the same ot matter what value of X j is observed.(note: Cov(Y j, ɛ j ) 0). (d) Determie Var(Ŷj X j ). Var(Ŷj X j ) = Var(ˆµ Yj X j + ɛ j ) = Var(ˆµ Yj X j ) + Var(ɛ j ) + 2Cov(ˆµ Y X, ɛ j ) = σ 2 ( 1 + (X j X) 2 ) + σ 2 + 0 ( = σ 2 1 + 1 + (X j X) 2 ) (e) Make a argumet for why ˆµ Y X ad Ŷj X j are both Normally distributed. Both of these estimators should be Normally distributed sice they are comprised of liear combiatios of Normally distributed radom variables (liear combiatios of ˆβ 0, ˆβ 1 X j, ad ɛ j ). (f) Where will ˆµ Y X lie 95% of the time? Where will Ŷ j X j lie 95% of the time. Note: these itervals ca be used to build a cofidece itervals ad predictio itervals at a particular value of X by ceterig them at the estimates rather tha the true values, ad by usig the usual regressio estimate for σ 2. These will lie plus or mius Φ 1 (0.975) = 1.96 times their respective stadard deviatios aroud the predicted value at X j. That is ˆµ Y X will lie withi: (β 0 + β 1 X j ) ± 1.96 ( 1 σ 2 + (X j X) 2 ) 95% of the time, ad Ŷj X j will lie withi the followig bouds 95% of the time: (β 0 + β 1 X j ) ± 1.96 σ 2 ( 1 + 1 + (X j X) 2 ) Problem 2. A study was coducted to determie the associatio betwee the maximum distace at which a highway sig ca be read (i feet) ad the age of the driver (i s). Thirty drivers of various ages were studied. Sample meas ad variaces for distace ad age ad the correlatio betwee these variables are give i the accompayig table. sample mea sample variace Distace 30 423.333 6678.16 Age 30 51.0 474.207 Correlatio r = 0.8012 (a) Fid ˆβ 0, ˆβ 1 the stadard error of ˆβ 1 ad the least-squares regressio equatio that would predict 2

the distace at which a highway sig ca be read give the age of the driver. ˆβ 1 = r s Y 6678.16 = 0.8012 s X 474.207 = 3.007 ˆβ 0 = Ȳ X ˆβ 1 = 423.333 (51)( 3.007) = 576.7 s (1 s ˆβ1 = 2 (Xi X) 2 = r 2 )s 2 Y (1 0.8012 ( 2)s 2 = 2 )6678.16 = 0.4244 X (28)474.207 Ŷ = ˆβ 0 + ˆβ 1 X = 576.7 3.007 X which uses the fact that the variace estimate of the residuals is s 2 = MSE = SSE/( 2) = (1 r 2 )SSY/( 2) = (1 r 2 )s 2 Y ( 1)/( 2). (b) Is Age a sigificat predictor of Distace i this liear model? Coduct a fromal hypothesis test at the α = 0.05 level (iclude the usual elemets of a test of hypothesis). H 0 : β 1 = 0 vs. H A : β 1 0 t = ˆβ 1 s ˆβ1 = 3.007 0.4244 = 7.085 p-value = P ( t df=28 > 7.085 ) 0.0000001 Sice our p-value is less tha 0.05, we ca reject the ull hypothesis. There is evidece that distace is related to age, i fact, youger age is associated with beig able to read from a further distace. (c) Usig oly the correlatio coefficiet (r) ad the sample size, coduct a test to determie if there is a sigificat associatio (i.e. H 0 : ρ = 0) betwee these two variables usig α = 0.05. t = r 1 r 2 2 = 0.8012 1 0.8012 2 28 = 7.085 We get the exact same t-statistic with the same d.f. as part (b), so we ll have the same p-value ad come to the same coclusio. (d) Comparig the results of part (b) ad (c) above, what do you coclude about these two tests? These two tests are mathematically equivalet (t-statistic, degrees of freedom, ad p-value) ad will always come to the same coclusio. This ca be show algebraically (e) Usig your results from part (a) above, calculate the 95% cofidece iterval for the slope of this regressio lie. ˆβ 1 ± t df=28 s ˆβ1 = 3.007 ± 2.0484(0.4244) = ( 3.876, 2.138) > qt(0.975,df=28) [1] 2.048407 (f) Cosiderig the lower ad upper 95% cofidece limits of the slope you calculated i part (e), how are these cosistet with your results for parts (b) ad (c)? These results are cosistet sice we rejected the ull hypothesis (H 0 : β 1 = 0) that the slop is trule zero, ad the cofidece iterval does ot iclude the value zero iside it, so either way, it appears a slope of zero if ot a plausible assumptio. 3

Problem 3. The data for the above problem are available i a Excel file o the class website uder the fileame HighwaySigs.csv. (a) Make a scatterplot of this data (with fitted regressio lie) i R (do ot iclude it here...you ll prit it out for part (g)), ru a regressio model ad cofirm the results you calculated i problem 2(a) for ˆβ 0, ˆβ 1, ad s ˆβ1. > fit=lm(distace~age,data=highwaysigs) > summary(fit) Call: lm(formula = distace ~ age, data = highwaysigs) Residuals: Mi 1Q Media 3Q Max -78.231-41.710 7.646 33.552 108.831 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 576.6819 23.4709 24.570 < 2e-16 *** age -3.0068 0.4243-7.086 1.04e-07 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 49.76 o 28 degrees of freedom Multiple R-squared: 0.642, Adjusted R-squared: 0.6292 F-statistic: 50.21 o 1 ad 28 DF, p-value: 1.041e-07 > plot(distace~age,data=highwaysigs,pch=16,cex=3) > ablie(fit,lwd=3,col="red") distace 300 350 400 450 500 550 600 20 30 40 50 60 70 80 age Based o the above output, we see that the estimates match out had-calculated oes (igorig roudig errors): ˆβ1 = 3.0068, ˆβ 0 = 576.68, ad s ˆβ1 = 0.4243. (b) Usig the results of the regressio model from R, locate the calculated value of the t-test statistic ad the associated p-value to determie if age is a sigificat predictor of distace ad compare these results to the results you obtaied by had i problem 2(b) above. Basead o the R output, we see the calculated t-statistic for the slope is -70.86 with a p-value of 1.04 10 7, which agree with the work doe by had. 4

(c) Usig oly the regressio aalysis results table ad the sample meas ad variaces give i problem 2, calculate a 95% cofidece iterval for the average distace at which a highway sig ca be read by idividuals 75 s of age. Ŷ = ˆβ 0 + ˆβ 1 X = 576.7 3.007 X = 576.7 3.007 75 = 351.2 95% Cofidece Iterval for µ Y at x = 75: Ŷ ± t df= 2 s 1 + (x x) 2 1 (75 51)2 ( 1)s 2 = 351.2 ± 2.0484 49.76 + = (323.2, 379.2) x 30 (29)474.207 (d) Use R to cofirm the 95% cofidece iterval i part (c). You ll have to create a ew dataframe with the predictor variable age set to 75 ew=data.frame(age = 75), ad the use the commad predict usig the liear model from part (a). > ew=data.frame(age = 75) > predict(fit,ew,iterval="cofidece") fit lwr upr 1 351.1693 323.2135 379.1251 (e) Usig oly the regressio aalysis results table ad the sample meas ad variaces give i problem 2, calculate a 95% predictio iterval for the distace at which a highway sig ca be read by a idividual 75 s of age. Ŷ ± t df= 2 s 1 + 1 + (x x) 2 ( 1)s 2 x = 351.2 ± 2.0484 49.76 (f) Use R to cofirm the 95% predictio iterval i part (e). > predict(fit,ew,iterval="predictio") fit lwr upr 1 351.1693 245.4732 456.8653 1 + 1 30 (75 51)2 + = (245.51, 456.89) (29)474.207 (g) Prit out the scatterplot with least-squares lie, eter your two itervals from parts (c) ad (e) by had oto the scatterplot, ad explai the differece betwee the 95% cofidece iterval ad the 95% predictio iterval. distace 250 300 350 400 450 500 550 600 20 30 40 50 60 70 80 age I the above graph the predictio iterval is i blue ad the cofidece iterval is i orage. The predictio iterval is a reasoable iterval estimate for where a sigle 75 old perso would be 5

predicted to be able to read a sig at the 95% cofidece level, while the cofidece iterval is a rage of plausible values for where the average distace of all 75 olds i the populatio are able to read a sig (aka, where the uderlyig populatio is lyig i the Y -directio at X = 75) at the 95% cofidece level. Problem 4. The data set malebirths.csv cotais data for the proportio of male births i 4 differet coutries (Demark, the Netherlads, Caada, ad the Uited States) for a umber of s. Use this data set to aswer the followig questios: (a) Ru four differet simple liear regressio models i R, oe for each coutry separately. Create a table with four rows (oe for each coutry) ad four colums: oe colum each for the calculatios ˆβ 1, the stadard error of ˆβ 1, the related t-statistic, ad the p-value related to this test. For which coutries is the associatio sigificat? Coutry ˆβ1 s ˆβ1 t-stat p-value Demark 4.29 10 5 2.07 10 5 2.07 0.0442 Netherlads 8.08 10 5 1.42 10 5 5.71 < 0.00001 Caada 1.11 10 4 2.77 10 5 4.02 0.00074 USA 5.43 10 5 9.39 10 6 5.78 0.00001 Based o the results of the liear regressio models, we see that the associatio betwee the proportio of births that are male babies has sigificatly decreased over time i all four coutries. From most to least sigificat: Netherlads, USA, Caada, ad Demark. (b) Preset four differet scatterplots (oe for each coutry) with the observed poits ad the related fitted regressio lie as well. It would be helpful for iterpretatio if each plot had the same bouds o both axes. Be sure the plots are clearly labeled. Demark Netherlads demark etherlads Caada USA caada usa 6

(c) Explai why the U.S. ca have the largest of the 4 t-statistics (i magitude) eve though its slope is ot the highest. This ca be explaied by the fact that the stadard error of the slope is smaller for the U.S.(because the spread of the poits aroud the fitted regressio lie is much smaller i the vertical directio). (d) Explai why the stadard error of the estimated slope is smaller for the U.S. tha for Caada, eve though the sample size is the same. This ca be explaied by the fact that the estimate for the residuals is much smaller (less spread aroud the lie i the y-directio), so the precisio of the slope as a estimate for the ukow true value geeratig this data is much better. (e) Provide a reaso why the stadard devatios aroud the regressio lie might be differet for the four coutries (hit: the proportio of males ca be though of as a average withi each coutry respectively). This is because the poits actually represet averages of zeros ad oes: a measuremet for each baby that is bor for if it is a male or ot. Sice there are so may more observatios (births) i the U.S., the we d expect the average of these zeros ad oes to vary less (based o the Law of Large Numbers). 7