Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Similar documents
ST430 Exam 2 Solutions

MS&E 226: Small Data

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Variance Decomposition and Goodness of Fit

Inference for Regression

Exam Applied Statistical Regression. Good Luck!

Multiple Linear Regression

1 Use of indicator random variables. (Chapter 8)

Figure 1: The fitted line using the shipment route-number of ampules data. STAT5044: Regression and ANOVA The Solution of Homework #2 Inyoung Kim

Coefficient of Determination

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Ch 2: Simple Linear Regression

BMI 541/699 Lecture 22

MATH 644: Regression Analysis Methods

Logistic Regressions. Stat 430

Lecture 4 Multiple linear regression

ST430 Exam 1 with Answers

Chapter 4. Regression Models. Learning Objectives

Applied Regression Analysis

Week 7 Multiple factors. Ch , Some miscellaneous parts

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Chapter 4: Regression Models

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

R Output for Linear Models using functions lm(), gls() & glm()

Comparing Nested Models

Linear Regression Model. Badr Missaoui

Simple Linear Regression

R Hints for Chapter 10

Ch 3: Multiple Linear Regression

14 Multiple Linear Regression

CAS MA575 Linear Models

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Lecture 1 Intro to Spatial and Temporal Data

Biostatistics 380 Multiple Regression 1. Multiple Regression

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Lecture 6 Multiple Linear Regression, cont.

Linear Regression Models P8111

Chapter 3 - Linear Regression

Stat 5102 Final Exam May 14, 2015

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Homework 9 Sample Solution

Applied Statistics Preliminary Examination Theory of Linear Models August 2017

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

SCHOOL OF MATHEMATICS AND STATISTICS

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

Motor Trend Car Road Analysis

Regression Models. Chapter 4. Introduction. Introduction. Introduction

One-Way Analysis of Variance: A Guide to Testing Differences Between Multiple Groups

Linear Model Specification in R

Generating OLS Results Manually via R

Tests of Linear Restrictions

Logistic Regression 21/05

Multiple Linear Regression

R 2 and F -Tests and ANOVA

Regression and Models with Multiple Factors. Ch. 17, 18

Explanatory Variables Must be Linear Independent...

Regression with Qualitative Information. Part VI. Regression with Qualitative Information

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Exercise 5.4 Solution

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

General Linear Statistical Models - Part III

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

2. Regression Review

ANOVA CIVL 7012/8012

STA102 Class Notes Chapter Logistic Regression

Second Midterm Exam Name: Solutions March 19, 2014

AMS-207: Bayesian Statistics

The linear model. Our models so far are linear. Change in Y due to change in X? See plots for: o age vs. ahe o carats vs.

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

General Linear Statistical Models

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

Introduction to the Analysis of Tabular Data

Five people were asked approximately how many hours of TV they watched per week. Their responses were as follows.

Introduction and Single Predictor Regression. Correlation

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Ordinary Least Squares Regression Explained: Vartanian

Handout 4: Simple Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

Correlation 1. December 4, HMS, 2017, v1.1

SIMPLE REGRESSION ANALYSIS. Business Statistics

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

16.3 One-Way ANOVA: The Procedure

1 Introduction 1. 2 The Multiple Regression Model 1

Unit 27 One-Way Analysis of Variance

Lecture 6: Linear Regression

Exam details. Final Review Session. Things to Review

Simple linear regression

ANALYSES OF NCGS DATA FOR ALCOHOL STATUS CATEGORIES 1 22:46 Sunday, March 2, 2003

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Transcription:

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam Prof. J. Taylor You may use your 4 single-sided pages of notes This exam is 7 pages long. There are 4 questions, first 3 worth 10 points, the fourth worth 15 points. 1

Q. 1) A model is fit to predict MPG (miles per gallon) of several makes of cars, based on WT (weight), SP (speed); VOL (cab volume) and HP (horse power). Here is the output of R s summary for that model. > lm1 = lm(mpg ~ WT+VOL+HP+SP) > summary(lm1) Call: lm(formula = MPG ~ WT + VOL + HP + SP) Residuals: Min 1Q Median 3Q Max -9.0108-2.7731 0.2733 1.8362 11.9854 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 192.43775 23.53161 8.178 4.62e-12 *** WT -1.85980 0.21336-8.717 4.22e-13 *** VOL -0.01565 0.02283-0.685 0.495 HP 0.39221 0.08141 4.818 7.13e-06 *** SP -1.29482 0.24477-5.290 1.11e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.653 on 77 degrees of freedom Multiple R-squared: 0.8733,Adjusted R-squared: 0.8667 F-statistic: 132.7 on 4 and 77 DF, p-value: < 2.2e-16 (a) Which entry above is σ, i.e. the usual estimate of the parameter σ in this model? (b) What is the error sum of squares for this model, i.e. SSE? (No need to compute this with a calculator, just give an expression using entries from the table). 2

(c) If you knew the SSE, how would you compute the total sum of squares of MP G, i.e. SST, from the above output? (d) Suppose you just wanted to test whether V OL is unrelated to MP G allowing for all the other effects in the model. Test this hypothesis with an F test. What is the p-value of this test? 3

(e) Finally, the research team has decided that they want to test the hypothesis that the coefficients for HP and V OL are zero. That is, that HP and V OL are unrelated to MP G when allowing or controlling for the other variables in the model. To do so, they fit another model. Here is R s summary of that model. > lm2 = lm(mpg ~ WT+SP) > summary(lm2) Call: lm(formula = MPG ~ WT + SP) Residuals: Min 1Q Median 3Q Max -10.5160-2.5085-0.8544 0.9377 16.6276 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 75.64938 3.89181 19.438 <2e-16 *** WT -0.99738 0.07774-12.830 <2e-16 *** SP -0.09816 0.04508-2.177 0.0325 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.184 on 79 degrees of freedom Multiple R-squared: 0.8294,Adjusted R-squared: 0.8251 F-statistic: 192.1 on 2 and 79 DF, p-value: < 2.2e-16 Describe how you could test this hypothesis using the above output. 4

Df SSE Use 2 33.24 Size 3 12.06 Size:Use 6 4.25 Residuals 108 154.62 Table 1: ANOVA output for gasoline consumption analysis Q. 2) In order to determine some of the factors that effect gasoline consumption, the fuel average miles per gallon for 120 families and their cars were tracked over several months. The cars were categorized in to 4 categories: compact, sedan, minivan, SUV; and the prevalent use was broken down into 3 categories: commuting to work, weekend, driving to school. The sums of squares of a two-way analysis of variance (ANOVA) model are presented above. (a) What assumptions does the two-way ANOVA model make? Be as precise as possible. (b) Compute the F -statistic used to test for the main effects of both Use and Size. (c) Based on the definition of the F distribution, if the denominator degrees of freedom of an F -statistic is large, what might you use to approximate the expected value of the F -statistic under the appropriate null hypothesis? Using that guideline, does there appear to be evidence for main effects of Use and Size? (d) Compute the F -statistic to test for an interaction between Use and Size. Using the above guideline, does there appear to be evidence for an interaction effect? 5

Q. 3) A liver specialist has come to you data relating the number of alcoholic beverages per week for each subject to the chances that they have develop liver disease (within some long follow-up period). The study controlled for the effects of Age as well as overall fitness level Fitness with a high value indicating a very fit subject. The results of a logistic regression used in the study were: Call: glm(formula = Y ~ Age + Drinks + Fitness, family = binomial(link= logit )) Deviance Residuals: Min 1Q Median 3Q Max -1.21798-0.26941-0.16629-0.08747 2.44946 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.8436 5.1508 0.552 0.58090 Age -0.0993 0.1028-0.966 0.33418 Drinks 0.1601 0.1208 1.325 0.18503 Fitness -0.7584 0.2411-3.146 0.00166 (a) What is the estimated probability that a 50-year old who drinks 5 drinks a week of fitness level 3 will develop liver disease? Redo this calculation for a 50-year old of the same fitness level who does not drink any alcoholic beverages? (If you don t have a calculator, just don t simplify the expression). (b) Using an odds ratio, approximate how many more times likely is the adult who drinks 5 drinks likely to develop liver disease than the person who drinks none? (c) The liver specialist guesses that there might be some interaction between fitness level and the number of drinks per week on the chances of liver disease. Describe how you might fit a model that allows for this interaction and use this model to test the hypothesis (in approximately correct R code). (d) Suppose the specialist had decided to discretize Fitness into 3 levels, say (L,M,H). How would the rows of the summary coefficients table above be different? What about the rows in the model for part (c)? (Don t worry if you do not guess exactly what R would do, the general idea is sufficient.) 6

Q. 4) Consider a regression model Y n 1 = X n p β p 1 + δ n 1. Unlike the usual assumptions, though, in this model it is assumed that nature has conspired against us in that E(X T δ) 0 though β is the true parameter of interest. Suppose we also have access to other variables Z n q for which E(Z T δ) = 0. (This is the so-called instrumental variables model.) The instrumental variables estimate for β is ˆβ IV = (X T Z(Z T Z) 1 Z T X) 1 (X T Z(Z T Z) 1 Z T Y ). (1) The estimator ˆβ IV can be motivated from the equation Z T Y = (Z T X)β + Z T δ (2) and is the generalized least-squares estimator assuming that Cov(Z T δ) = σ 2 Z T Z. (a) What do we need to assume about p, q above so that ˆβ IV is welldefined? (b) Pretending that X T Z and Z T Z are constant, show that E( ˆβ IV ) = β. (c) Assume that the cases (X i, Z i, δ i ) for 1 i n are IID from some distribution F with δ i independent of Z i with E F (δ 1 ) = 0. What does the law of averages say about X T Z and Z T Z as n? (d) What does the law of averages say about Z T δ as n? What is the covariance of Z T δ? (Recall that Z i is independent of each δ i.) (e) Use parts (c) and (d) above to conclude that ˆβ IV is well-approximated in distribution by N(β, Var F (δ)(x T Z(Z T Z) 1 Z T X) 1 ). (In practice, the quantity Var F (δ) can be estimated by Y X ˆβ IV 2 /(n p). (f) Another estimator we might use upon inspection of (2) is the OLS estimator in this equation: β = (X T ZZ T X) 1 X T ZZ T Y. Show that E( β) = β. How might you approximate its distribution? 7