Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Similar documents
Lecture 18: Simple Linear Regression

AMS 7 Correlation and Regression Lecture 8

Inferences for Regression

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Regression. Marc H. Mehlman University of New Haven

Multiple Linear Regression

Inference for Regression

Generalised linear models. Response variable can take a number of different formats

ST430 Exam 2 Solutions

STA 101 Final Review

Sampling Distributions: Central Limit Theorem

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

ST430 Exam 1 with Answers

Test 3 Practice Test A. NOTE: Ignore Q10 (not covered)

MS&E 226: Small Data

Tests of Linear Restrictions

Ch 2: Simple Linear Regression

Variance Decomposition and Goodness of Fit

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

M(t) = 1 t. (1 t), 6 M (0) = 20 P (95. X i 110) i=1

Eco 391, J. Sandford, spring 2013 April 5, Midterm 3 4/5/2013

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

16.3 One-Way ANOVA: The Procedure

MATH 644: Regression Analysis Methods

2.4.3 Estimatingσ Coefficient of Determination 2.4. ASSESSING THE MODEL 23

SCHOOL OF MATHEMATICS AND STATISTICS

appstats27.notebook April 06, 2017

UNIVERSITY OF TORONTO Faculty of Arts and Science

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

1 Correlation and Inference from Regression

Log-linear Models for Contingency Tables

Final Exam - Solutions

INFERENCE FOR REGRESSION

Business 320, Fall 1999, Final

Mathematics for Economics MA course

Simple Linear Regression

Linear regression. We have that the estimated mean in linear regression is. ˆµ Y X=x = ˆβ 0 + ˆβ 1 x. The standard error of ˆµ Y X=x is.

Confidence Intervals, Testing and ANOVA Summary

The t-test: A z-score for a sample mean tells us where in the distribution the particular mean lies

HOMEWORK ANALYSIS #3 - WATER AVAILABILITY (DATA FROM WEISBERG 2014)

1 The Classic Bivariate Least Squares Model

Final Exam. Name: Solution:

Lecture 11: Simple Linear Regression

We d like to know the equation of the line shown (the so called best fit or regression line).

STAT 350 Final (new Material) Review Problems Key Spring 2016

POLI 443 Applied Political Research

1: a b c d e 2: a b c d e 3: a b c d e 4: a b c d e 5: a b c d e. 6: a b c d e 7: a b c d e 8: a b c d e 9: a b c d e 10: a b c d e

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Regression and the 2-Sample t

Multiple Regression: Example

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Hypothesis testing: Steps

VTU Edusat Programme 16

Swarthmore Honors Exam 2012: Statistics

You may not use your books/notes on this exam. You may use calculator.

Section 3: Simple Linear Regression

An Introduction to Path Analysis

Six Sigma Black Belt Study Guides


Lectures 5 & 6: Hypothesis Testing

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

Part Possible Score Base 5 5 MC Total 50

Mathematical Notation Math Introduction to Applied Statistics

General Linear Statistical Models - Part III

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Chapter 27 Summary Inferences for Regression

Inference with Heteroskedasticity

Regression Analysis: Exploring relationships between variables. Stat 251

Multiple Linear Regression for the Salary Data

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

R 2 and F -Tests and ANOVA

Advanced Experimental Design

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Exam Applied Statistical Regression. Good Luck!

This document contains 3 sets of practice problems.

Simple Linear Regression

Correlation & Simple Regression

Residual Analysis for two-way ANOVA The twoway model with K replicates, including interaction,

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Simple Linear Regression

Simple Linear Regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

MGEC11H3Y L01 Introduction to Regression Analysis Term Test Friday July 5, PM Instructor: Victor Yu

BIOL Biometry LAB 6 - SINGLE FACTOR ANOVA and MULTIPLE COMPARISON PROCEDURES

28. SIMPLE LINEAR REGRESSION III

22s:152 Applied Linear Regression. Take random samples from each of m populations.

Homework Set 2, ECO 311, Fall 2014

Lecture 10: F -Tests, ANOVA and R 2

Midterm 2 - Solutions

LAB 2. HYPOTHESIS TESTING IN THE BIOLOGICAL SCIENCES- Part 2

Introduction and Single Predictor Regression. Correlation

Midterm 2 - Solutions

FREC 608 Guided Exercise 9

STATISTICS 110/201 PRACTICE FINAL EXAM

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Warm-up Using the given data Create a scatterplot Find the regression line

STATISTICS 174: APPLIED STATISTICS FINAL EXAM DECEMBER 10, 2002

Transcription:

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam Prof. J. Taylor You may use your 4 single-sided pages of notes This exam is 14 pages long. There are 4 questions, each worth 10 points. I understand and accept the Stanford University Honor Code. Name: Signature: 1 2 3 4 Total 1

Q. 1) In order to study the relevance of different fields of study in the job market, the Stanford alumni association followed Stanford graduates in the first few years after graduation. Students were asked their starting salaries, as well as which sector (i.e. finance, technology, education) they were employed in. Within the finance field there were (among others) (20 Math & Computational Science (MCS) & 20 English graduates). (a) The first question the researchers addressed was whether an MCS degree was worth more than a English degree in finance. After entering the data in R, they found the following: > t.test(mcs, english, var.equal=t, alternative= greater ) Two Sample t-test data: mcs and english t = 3.718, df = 38, p-value = 0.0003228 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 5509.375 Inf sample estimates: mean of x mean of y 69457.52 59377.05 Explain the results in the output above and what conclusions the researchers can draw. The above results are from a two-sample T -test testing whether the average salary is the same within MCS graduates as English graduates. The researchers conducted a one-sided test in which the null hypothesis was H 0 : µ MCS µ English and the alternative was The t statistic is computed as T = H a : µ MCS > µ English. ˆµ MCS ˆµ English SE(ˆµ MCS ˆµ English ) and was observed to be 3.718, much larger than 0. This is strong evidence against H 0, with a p-values of about 3 10 4. We reject H 0 and conclude H a is true. 2

(b) A friend of your reads the results of this study and says: Wow! If I choose MCS as my major, the probability I d earn more than a English grad if we both start in finance is about 1 3 10 4! Do you agree with your friend s statement? Explain. We should not agree. The probability above is the probability a T statistic with the appropriate degrees of freedom would be as large as the one we observed. This is not the probability the null hypothesis is true, which is the error our friend has made. (c) How would you obtain the same results in (a) using the lm command in R? Approximately correct syntax is OK here. group = factor(c(rep("mcs",20), rep("english",20))) Y = c(mcs, english) summary(lm(y ~ group)) 3

Q. 2) The study begun in Q. 1) was continued, expanding to several fields: education, finance and technology; as well as an additional degree: electrical engineering. The means in each group were (after rounding) Education Finance Technology EE 45000 70000 80000 MCS 45000 70000 65000 English 50000 60000 50000 (a) Sketch the interaction plot for this model, depicting the data in the above table. From this plot, does the type of degree affect your starting salary? What about the field you begin work in? Are there any interactions? Explain. We have used 1=Education, 2=Finance, 3=Technology in the x-axis above. As the graphs are not flat, we conclude there is evidence for a main effect of field. As the graphs do not lie one on top of each other, we concluded there is evidence for a main effect of degree earned. As the graphs are not parallel, we see there is evidence of interactions between the field one works in and the degree earned. 4

(b) The output below results from fitting this model. What kind of a regression model is it? > anova(lm(salary~degree*field)) Analysis of Variance Table Response: Salary Df Sum Sq Mean Sq F value Degree 2 6.2969e+09???????????????? Field 2 1.7624e+10???????????????? Degree:Field 4 6.7457e+09???????????????? Residuals 171 1.4860e+10????????? --- Make a table that includes all values overwritten with? s above. (Don t worry if you don t have a calculator, fractions are fine.) This is a 2-way ANOVA model. Here is the completed table: > anova(lm(salary~degree*field)) Analysis of Variance Table Response: Salary Df Sum Sq Mean Sq F value Degree 2 6.2969e+09 6.2969e+09/2 (6.2969e+09/2) / (1.4860e+10/171) = 36.2 Field 2 1.7624e+10 1.7624e+10/2 (1.7624e+10/2) / (1.4860e+10/171) = 101.4 Degree:Field 4 6.7457e+09 6.7457e+09/4 (6.7457e+09/4) / (1.4860e+10/171) = 19.4 Residuals 171 1.4860e+10 1.4860e+10/171 --- 5

(c) How would you compute p-values for each entry of the F value column if you had R? Without explicitly computing p-values, do the F statistics support your conclusions in (a)? State the hypothesis that each F value is testing. You can use the fact that, with many degrees of freedom in the denominator, an F statistic has expected value 1 if the null hypothesis is true. The p-values would be 1 - pf(36.2, 2, 171) 1 - pf(101.4, 2, 171) 1 - pf(19.4, 4, 171) As all of these F -statistics are much larger than 1, which is roughly the average of an F statistic with 171 degrees of freedom, we conclude that we would conclude there are main effects as well as an interaction effect. 6

Q. 3) The incidence of landslides in Northern California increases with the amount of rain that falls in any given year. Suppose we are given data of the following form, collected for each month over several years. Landslides: The number of landslide events reported in the Santa Cruz mountains in that specific month. Rainfall: The average rainfall in the Santa Cruz mountains in that month measured in inches. RainfallP: The average rainfall in the Santa Cruz mountains in the previous month measured in inches. (a) Consider the following R output MA = lm(landslides ~ Rainfall + RainfallP) MA ## ## Call: ## lm(formula = Landslides ~ Rainfall + RainfallP) ## ## Coefficients: ## (Intercept) Rainfall RainfallP ## 0.6699 0.1065 0.0274 MB = glm(landslides ~ Rainfall + RainfallP, family = poisson()) MB ## ## Call: glm(formula = Landslides ~ Rainfall + RainfallP, family = poisson()) ## ## Coefficients: ## (Intercept) Rainfall RainfallP ## 0.1807 0.0414 0.0103 ## ## Degrees of Freedom: 119 Total (i.e. Null); 117 Residual ## Null Deviance: 136 ## Residual Deviance: 117 AIC: 436 7

Which model would you think to be the most natural model to model the count of the number of landslide events in the Santa Cruz mountains? Explain. As we are counting earthquakes, which are events a Poisson model seems more appropriate. In the Poisson model, the default link is the log link meaning things act multiplicatively. This is model MB. (b) In whichever model you chose, what is the effect of receiving an additional 10 inches of rain in a given month? We chose the Poisson model (MB) which, when fit, estimates that for every 10 inches of rain in a given month the expected number of earthquakes is increased multiplicatively by a factor of e 10 0.0414 = e 0.414. If we had chosen model (MA), we would estimate that the number of earthquakes would increase linearly by an amount of 10 0.1065 = 1.065. 8

(c) How would you use anova to test whether the variable RainfallP is associated to Landslides. What is the null hypothesis in your test? The alternative? Continuing with model (MB), we could test this with anova(glm(landslides ~ Rainfall, family=poisson()), MB) (d) Suppose you hypothesize that there is a supersaturation effect. That is, when rainfall exceeds 15 inches in a given month, the relationship between Rainfall and Landslides. You decide to form a variable HeavyRainfall = (Rainfall > 15) Write the formula for a model that allows for a different slope and intercept in months of heavy rainfall. How many degrees of freedom does this model use to estimate the mean? (Assume that you have retained the variable RainfallP from above.) The model glm(landslides ~ HeavyRainfall * Rainfall + RainfallP) allows for different slopes and intercepts for the lines in months of heavy vs. light rainfall. This model has 5 parameters: (Intercept), Rainfall, HeavyRainfall, HeavyRainfall:Rainfall, RainfallP so we would say it has used 5 degrees of freedom to estimate the mean. 9

Gbar N T 30.4 7 80 32.1 4 90 36.2 9 100 33.5 3 110 38.4 11 120 Table 1: Data on averaged growth rate at different temperatures T. The column Gbar is the average of N different experiments at a fixed temperature T. Q. 4) A lab scientist is interested in the effect of temperature, T on the growth rate of a certain population of bacteria. For each of the temperatures in the Table 1 below, the scientist performs a number of experiments, which can be found in the column N, and records the average growth rate (averaged across all experiments of that temperature) in the column Gbar of Table 1. That is, 30.4 is the average of growth rates, G, from 7 different experiments each conducted at a temperature of 80. The scientist is interested in fitting a linear regression model for the growth rate as a function of T. Assume that a simple linear model of the form G = β 0 + β 1 T + ɛ (1) is appropriate if the actual growth rates were observed (rather than Gbar, their averages over the N experiments). That is, the variance of ɛ is constant across different values of T and is distributed as N(0, σ 2 ) independently for each experiment. (a) The quantities Gbar are averages over several experiments. What is the relationship between the variances of the Gbar entries in Table 1 and σ 2 in (1)? Is their variance constant if (1) is correct? The relationship is Var(Gbar i ) = 1 N i Var(ɛ). The variance is therefore not constant over the cases. (b) Your are given a choice between the following 3 outputs from R to find a confidence interval for the coefficient β 1 in the model (1). > # MODEL A > confint(lm(gbar~t, growth)) 2.5 % 97.5 % (Intercept) -2.45005910 35.8900591 T -0.01581187 0.3638119 10

> # MODEL B > confint(lm(gbar~t, growth, weights=growth$n)) 2.5 % 97.5 % (Intercept) 0.13407264 31.3517067 T 0.03736585 0.3399493 > # MODEL C > confint(lm(gbar~t, growth, weights=1/growth$n)) 2.5 % 97.5 % (Intercept) -1.07474463 40.0286748 T -0.06427241 0.3443322 11

Which of the three models above (A, B or C) have unbiased estimates of β 1 if model (1) is correct? Each of the models have unbiased estimates of β 1. (c) Which of the three models above (A, B or C) have valid confidence intervals for β 1 if model (1) is correct? If a model has invalid confidence intervals, what is wrong with the confidence interval? While the three models all have unbiased estimates of β 1 they compute standard errors differently. As we know the variances are proportional to 1/N, the correct weighting is model B. The other models will have valid centers for their confidence intervals, but the width will be incorrect as the standard errors were incorrect. 12