MA 575, Linear Models : Homework 3

Similar documents
University of California, Los Angeles Department of Statistics. Simple regression analysis

Statistics 203 Introduction to Regression and Analysis of Variance Assignment #1 Solutions January 20, 2005

Stat 139 Homework 7 Solutions, Fall 2015

Simple Linear Regression

University of California, Los Angeles Department of Statistics. Practice problems - simple regression 2 - solutions

Topic 9: Sampling Distributions of Estimators

Open book and notes. 120 minutes. Cover page and six pages of exam. No calculators.

1 Inferential Methods for Correlation and Regression Analysis

S Y Y = ΣY 2 n. Using the above expressions, the correlation coefficient is. r = SXX S Y Y

ST 305: Exam 3 ( ) = P(A)P(B A) ( ) = P(A) + P(B) ( ) = 1 P( A) ( ) = P(A) P(B) σ X 2 = σ a+bx. σ ˆp. σ X +Y. σ X Y. σ X. σ Y. σ n.

Topic 9: Sampling Distributions of Estimators

Regression, Inference, and Model Building

Topic 9: Sampling Distributions of Estimators

Worksheet 23 ( ) Introduction to Simple Linear Regression (continued)

11 Correlation and Regression

Final Review. Fall 2013 Prof. Yao Xie, H. Milton Stewart School of Industrial Systems & Engineering Georgia Tech

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Properties and Hypothesis Testing

Lecture 11 Simple Linear Regression

(all terms are scalars).the minimization is clearer in sum notation:

Linear Regression Models

Continuous Data that can take on any real number (time/length) based on sample data. Categorical data can only be named or categorised

3/3/2014. CDS M Phil Econometrics. Types of Relationships. Types of Relationships. Types of Relationships. Vijayamohanan Pillai N.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Circle the single best answer for each multiple choice question. Your choice should be made clearly.

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

WEIGHTED LEAST SQUARES - used to give more emphasis to selected points in the analysis. Recall, in OLS we minimize Q =! % =!

Simple Regression. Acknowledgement. These slides are based on presentations created and copyrighted by Prof. Daniel Menasce (GMU) CS 700

REGRESSION MODELS ANOVA

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

REGRESSION (Physics 1210 Notes, Partial Modified Appendix A)

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Assessment and Modeling of Forests. FR 4218 Spring Assignment 1 Solutions

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Grant MacEwan University STAT 252 Dr. Karen Buro Formula Sheet

TMA4245 Statistics. Corrected 30 May and 4 June Norwegian University of Science and Technology Department of Mathematical Sciences.

SIMPLE LINEAR REGRESSION AND CORRELATION ANALYSIS

Stat 421-SP2012 Interval Estimation Section

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Economics 250 Assignment 1 Suggested Answers. 1. We have the following data set on the lengths (in minutes) of a sample of long-distance phone calls

9. Simple linear regression G2.1) Show that the vector of residuals e = Y Ŷ has the covariance matrix (I X(X T X) 1 X T )σ 2.

Sample Size Determination (Two or More Samples)

Efficient GMM LECTURE 12 GMM II

Solutions to Odd Numbered End of Chapter Exercises: Chapter 4

Introduction to Econometrics (3 rd Updated Edition) Solutions to Odd- Numbered End- of- Chapter Exercises: Chapter 4

Direction: This test is worth 150 points. You are required to complete this test within 55 minutes.

UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL/MAY 2009 EXAMINATIONS ECO220Y1Y PART 1 OF 2 SOLUTIONS

Good luck! School of Business and Economics. Business Statistics E_BK1_BS / E_IBA1_BS. Date: 25 May, Time: 12:00. Calculator allowed:

Correlation Regression

Describing the Relation between Two Variables

Confidence Intervals for the Population Proportion p

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Common Large/Small Sample Tests 1/55

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Linear Regression Analysis. Analysis of paired data and using a given value of one variable to predict the value of the other

Chapters 5 and 13: REGRESSION AND CORRELATION. Univariate data: x, Bivariate data (x,y).

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

STAC51: Categorical data Analysis

AMS 216 Stochastic Differential Equations Lecture 02 Copyright by Hongyun Wang, UCSC ( ( )) 2 = E X 2 ( ( )) 2

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Comparing your lab results with the others by one-way ANOVA

Topic 10: Introduction to Estimation

Linear Regression Demystified

of the matrix is =-85, so it is not positive definite. Thus, the first

Economics 326 Methods of Empirical Research in Economics. Lecture 18: The asymptotic variance of OLS and heteroskedasticity

Statistics 20: Final Exam Solutions Summer Session 2007

Statistical Properties of OLS estimators

Computing Confidence Intervals for Sample Data

A Question. Output Analysis. Example. What Are We Doing Wrong? Result from throwing a die. Let X be the random variable

MA 575, Linear Models : Homework 2

Response Variable denoted by y it is the variable that is to be predicted measure of the outcome of an experiment also called the dependent variable

Final Examination Solutions 17/6/2010

Ismor Fischer, 1/11/

This is an introductory course in Analysis of Variance and Design of Experiments.

MATH/STAT 352: Lecture 15

MBACATÓLICA. Quantitative Methods. Faculdade de Ciências Económicas e Empresariais UNIVERSIDADE CATÓLICA PORTUGUESA 9. SAMPLING DISTRIBUTIONS

STA6938-Logistic Regression Model

1 Models for Matched Pairs

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Y i n. i=1. = 1 [number of successes] number of successes = n

Simple Linear Regression

INSTRUCTIONS (A) 1.22 (B) 0.74 (C) 4.93 (D) 1.18 (E) 2.43

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Section 11.8: Power Series

10-701/ Machine Learning Mid-term Exam Solution

MA Advanced Econometrics: Properties of Least Squares Estimators



Output Analysis and Run-Length Control

Random Variables, Sampling and Estimation

Series: Infinite Sums

Parameter, Statistic and Random Samples

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

TAMS24: Notations and Formulas

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

Transcription:

MA 575, Liear Models : Homework 3 Questio 1 RSS( ˆβ 0, ˆβ 1 ) (ŷ i y i ) Problem.7 Questio.7.1 ( ˆβ 0 + ˆβ 1 x i y i ) (ȳ SXY SXY x + SXX SXX x i y i ) ((ȳ y i ) + SXY SXX (x i x)) (ȳ y i ) SXY SXX SY Y SXY SXY SXY + ( SXX SXX ) SXX SY Y SXY SXX (a) Let s derive the formula of the estimator ˆβ 1. (Equatios (A.9) i Appedix (A.3)) (y i ȳ)(x i x) + ( SXY SXX ) (x i x) RSS(β 1 ) (y i β 1 x i ) RSS(β 1 ) β 1 x i (y i ˆβ 1 x i ) 0 x i (y i β 1 x i ) Therefore, x i y i ˆβ 1 (b) Let s show that the estimator ˆβ 1 is ubiased. x i y i x i E[y i ] E[ ˆβ 1 ] E x β 1 i x i x i x i x i β 1 (c) Let s fid the variace of our estimator ˆβ 1 1

where, By idepedecy of the {y i }.., we obtai : V ar( ˆβ 1 ) E[ ˆβ 1] E[ ˆβ 1 ] E[ ˆβ x i y i E x i x j y i y j 1] E j1 x i ( x i ) E[ ˆβ 1] Therefore, x i x i E[yi ] + x i x j E [y i ]E[y j ] j1 j i ( x i ) σ + β1x i + β 1 ( x i ) x j j1 j i (d) Let s fid a ubiased estimator of σ. Let RSS 0 RSS( ˆβ 1 ) the : V ar( ˆβ 1 ) x i (V ar(y i) + E[y i ] ) + ( x i ) x i (σ + β1 x i ) ( x i ) σ x i σ x i + β 1 β1x i x j j1 j i ( x i ) RSS 0 (ŷ i y i ) ( ˆβ 1 x i y i ) ( ˆβ 1x i ˆβ 1 x i y i + yi ) ˆβ 1 ( ˆβ 1 x i x i y i ) + yi ( x i y i ) x i yi After a few calculatios, oe ca easily fid that E[RSS 0 ] ( 1)σ, therefore we choose : ˆσ RSS 0 1 I our model, there are observatios ad oe parameter. Therefore, ˆσ has 1 df. Questio.7. (a) Let s derive the ANOVA table We wat to test the model : H 0 β 0 0. Source df SS MS F p-value Regressio 1 RSS 0 RSS (RSS 0 RSS)/1 (RSS 0 RSS)/ˆσ Residuals RSS ˆσ RSS/( ) Total 1 RSS 0 Table 1: ANOVA table

(b) Let s show that the F statistic is equal to the sqaure of t test statistic. To do so, we eed to prove that F ad t are equal, where : F RSS 0 RSS ˆσ t ˆβ 0 V ar( ˆβ 0 ) (ȳ ˆβ 1 x) ˆσ ( 1 + x SXX ) (ȳ ˆσ ( 1 + SXY SXX x) x SXX (SXXȳ xsxy ) ) ˆσ SXX (SXX + x ) Therefore, provig that F t (SXXȳ xsxy ) is equivalet to provig that RSS 0 RSS SXX(SXX+ x ) RSS 0 RSS yi ( x i y i ) x i SY Y + SXY SXX (y i ȳ) + ȳ ( (x i x)(y i ȳ) + xȳ) SY Y + SXY (x i x) + x SXX SY Y + ȳ (SXY + xȳ) SY Y + SXY SXX + x SXX By multiplyig each terms so that our equatio has oly oe deomiator ad the suppressig the terms that cacel oe aother, we easily fid that : Questio.7.3 RSS 0 RSS [( xsxy ) xȳsxy SXX + (ȳsxx) ] SXX(SXX + x ) ( xsxy ȳsxx) SXX(SXX + x ) (a) Let s fit a regressio over the "sake" data. # Load ALR package library(alr3) # Attach the Sake data file attach(sake) m0.lm lm(y~ X -1, datasake) # -1 to remove the itercept optio summary(m0.lm) (b) What are the values of ˆβ 1 ad ˆσ? summary(m0.lm) Call: lm(formula Y ~ X - 1, data sake) Residuals: Mi 1Q Media 3Q Max -.407-1.494-0.1935 1.6515 3.0771 Coefficiets: Estimate Std. Error t value Pr(> t ) X 0.5039 0.01318 39.48 <e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 1.7 o 16 degrees of freedom Multiple R-squared: 0.9898,Adjusted R-squared: 0.989 F-statistic: 1559 o 1 ad 16 DF, p-value: <.e-16 3

Therefore, ˆβ 1 0.5039 ad ˆσ 1.7.89. (c) What is the 95% cofidece iterval of ˆβ 1? We kow that ˆβ 1 β 1 ŝe(β 1) T ( 1), therefore : where t 1 α is the 1 α dim(sake)[1] alpha 0.05 I 95% [ ˆβ 1 t 1 α ŝe(β 1); ˆβ 1 + t 1 α ŝe(β 1)] quatile of a studet distributio with 1 df. z qt(1-alpha/,-1) beta_hat m0.lm$coefficiets sbeta_hat summary(m0.lm)$coefficiets[1, ] Iter95 c(beta_hat -z*sbeta_hat,beta_hat + z*sbeta_hat ) Iter95 X X 0.49451 0.548337 (d) Let s test that the itercept is 0 # Fit a regressio with a o ull itercept m1.lm lm(y~ X, datasake) summary(m1.lm) # ANOVA aova(m0.lm,m1.lm) Aalysis of Variace Table Model 1: Y ~ X - 1 Model : Y ~ X Res.Df RSS Df Sum of Sq F Pr(>F) 1 16 46.6 15 45.560 1 0.6663 0.193 0.6463 The p-value is equal to 0.64. The ANOVA does ot reject the hypothesis that the itercept is ull. Questio.7.4 plot(m0.lm,which 1) 4

The model seems ok sice the residuals are approximately cetered aroud 0 (equally reparted aroud the x-axis) ad idepedet from oe aother (the scatter plot roud shape like). Problem.10 Questio.10.1 Because the questio i problem.10 ad problem 4 of the homework are fairly similar (oly the rak max to cosider ad the data (for problem 4) chage), the developmet of a R-fuctio seems appropriate so that we do ot waste time. The followig code is suggested : ru_lm fuctio(x,y,b1) { legth(x) _char as.character() #.10.1 # Ru a liear regressio usig Zipf s formula logmod.lm lm(log(y)~log(x)) # Plot ad save the preicted values vs the actual values fileame paste("predvsrak",_char,".pg",sep"") pg(fileamefileame) plot(log(x),log(y),xlab"log rak",ylab"log frequecy",maipaste("log frequecy vs Log rak, lies(log(x),logmod.lm$fitted.values) dev.off() # Plot the residuals QQ plot fileame paste("resqqplot", _char,".pg",sep"") pg(fileamefileame) plot(logmod.lm, which,mai paste("normal Q-Q plot, ", _char,sep"")) dev.off() # Ouput the results of the liear regressio prit(summary(logmod.lm)) #.10. Compute the t-test for the slope if (args() > ) { b1_hat logmod.lm$coefficiets[[]] se_b1 summary(logmod.lm)$coefficiets[, ] t (b1_hat + b1)/se_b1 p_val *(1-pt(abs(t),-)) } prit(paste("p-val ",as.character(p_val),sep "")) } where Y is a vector cotaiig the frequecies cosidered for the liear regressio ad X is the vector cotaiig the raks associated to the frequecies withi Y. Last but ot least, b is the value that oe might wat to test for the slope of the regressio. Note that this field ca be left empty. To aswer questios,.10.1 ad.10., ru the followig code : attach(mwwords) # Select the set correspodig to the th highest rak 50 set MWwords$HamiltoRak < Y Hamilto[set] 5

X HamiltoRak[set] ru_lm(x,y,1) We obtai the followig graphs for the Normal Q Qplot of the residuals ad the estimated mea fuctio. Zipf s law seems to model fairly well the frequecies i fuctio of the word s rak. Below, the umerical results of the liear regressio. Call: lm(formula log(y) ~ log(x)) Residuals: Mi 1Q Media 3Q Max -0.57413-0.05088-0.001563 0.043448 0.18868 Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) 4.7714 0.03948 10.84 <e-16 *** log(x) -1.00764 0.0175-79.04 <e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.07934 o 48 degrees of freedom Multiple R-squared: 0.994,Adjusted R-squared: 0.99 F-statistic: 648 o 1 ad 48 DF, p-value: <.e-16 Questio.10. Let H 0 ˆb 1. (b 1 ad ot 1 to fit the R liear model) where T τ( ). P H0 (T t ) (1 P H0 (T t ) [1 P (T ˆb + 1 ŝe(ˆb) )] This p-value is computed i the fuctio preseted above ad we obtai : [1] "p-val 0.551839164089" The t test does ot reject the hypothesis H 0. Questio.10.3 For this questio, use the R-code below : 6

for ( i c(75,100)) { set MWwords$HamiltoRak < Y Hamilto[set] X HamiltoRak[set] ru_lm(x,y) } The model seems to work for 75 but the predicted values of the low frequecies are ot well modeled i the case 100. Note : I the case 100, the date cotais more tha a hudred values because 3 words are raked i the 100 th positio. Questio 4 To aswer this questio, ru the R-code below. load("simpsoswordfreq.rdata") attach(simpsos.wordfreq) for ( i c(000,3000)) { Y Frequecy[1:] X Rak[1:] ru_lm(x,y) } We obtai : 7

It seems that i that case, the Zipf s model is appropriate for the smallest frequecies. 8