Example: 1982 State SAT Scores (First year state by state data available)

Similar documents
Lecture 18: Simple Linear Regression

> modlyq <- lm(ly poly(x,2,raw=true)) > summary(modlyq) Call: lm(formula = ly poly(x, 2, raw = TRUE))

ST430 Exam 2 Solutions

1 Multiple Regression

Inference for Regression

ST430 Exam 1 with Answers

Unit 6 - Introduction to linear regression

Introduction and Background to Multilevel Analysis

Linear Regression Model. Badr Missaoui

AMS 7 Correlation and Regression Lecture 8

Lecture 6 Multiple Linear Regression, cont.

Chapter 8 Conclusion

Unit 6 - Simple linear regression

2. Outliers and inference for regression

Stat 401B Final Exam Fall 2015

Biostatistics 380 Multiple Regression 1. Multiple Regression

Chaper 5: Matrix Approach to Simple Linear Regression. Matrix: A m by n matrix B is a grid of numbers with m rows and n columns. B = b 11 b m1 ...

Regression and the 2-Sample t

Leverage. the response is in line with the other values, or the high leverage has caused the fitted model to be pulled toward the observed response.

Introduction to Linear Regression

Stat 5102 Final Exam May 14, 2015

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

22S39: Class Notes / November 14, 2000 back to start 1

Extensions of One-Way ANOVA.

We d like to know the equation of the line shown (the so called best fit or regression line).

1 The Classic Bivariate Least Squares Model

General Linear Statistical Models - Part III

Diagnostics and Transformations Part 2

STAT 350: Summer Semester Midterm 1: Solutions

10. Alternative case influence statistics

Part II { Oneway Anova, Simple Linear Regression and ANCOVA with R

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

1 Introduction 1. 2 The Multiple Regression Model 1

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Introduction to Machine Learning Prof. Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Stat 412/512 TWO WAY ANOVA. Charlotte Wickham. stat512.cwick.co.nz. Feb

Comparing Nested Models

SCHOOL OF MATHEMATICS AND STATISTICS Autumn Semester

MATH 644: Regression Analysis Methods

Regression, Part I. - In correlation, it would be irrelevant if we changed the axes on our graph.

Activity #12: More regression topics: LOWESS; polynomial, nonlinear, robust, quantile; ANOVA as regression

UNIVERSITY OF TORONTO SCARBOROUGH Department of Computer and Mathematical Sciences Midterm Test, October 2013

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Reaction Days

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

STAT 420: Methods of Applied Statistics

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #6

Stat 401B Exam 2 Fall 2015

Variance Decomposition and Goodness of Fit

1.) Fit the full model, i.e., allow for separate regression lines (different slopes and intercepts) for each species

Lab 3 A Quick Introduction to Multiple Linear Regression Psychology The Multiple Linear Regression Model

STAT 572 Assignment 5 - Answers Due: March 2, 2007

Linear Model Specification in R

MS&E 226: Small Data

Lecture 11 Multiple Linear Regression

Introduction and Single Predictor Regression. Correlation

Intro to Linear Regression

lm statistics Chris Parrish

Lecture 2. The Simple Linear Regression Model: Matrix Approach

Intro to Linear Regression

Chapter 13. Multiple Regression and Model Building

Density Temp vs Ratio. temp

MODELS WITHOUT AN INTERCEPT

STAT 571A Advanced Statistical Regression Analysis. Chapter 8 NOTES Quantitative and Qualitative Predictors for MLR

SCHOOL OF MATHEMATICS AND STATISTICS

SMA 6304 / MIT / MIT Manufacturing Systems. Lecture 10: Data and Regression Analysis. Lecturer: Prof. Duane S. Boning

Regression. Marc H. Mehlman University of New Haven

The response variable depends on the explanatory variable.

INFERENCE FOR REGRESSION

Linear Regression is a very popular method in science and engineering. It lets you establish relationships between two or more numerical variables.

Lecture 10 Multiple Linear Regression

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

Sample solutions. Stat 8051 Homework 8

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Lecture 1 Intro to Spatial and Temporal Data

Ch Inference for Linear Regression

Multiple Regression. Dr. Frank Wood. Frank Wood, Linear Regression Models Lecture 12, Slide 1

Least-Squares Regression. Unit 3 Exploring Data

Correlation & Simple Regression

Dealing with Heteroskedasticity

Analysis of Bivariate Data

Recall that a measure of fit is the sum of squared residuals: where. The F-test statistic may be written as:

Variance Decomposition in Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Ch 13 & 14 - Regression Analysis

SCHOOL OF MATHEMATICS AND STATISTICS

Analytics 512: Homework # 2 Tim Ahn February 9, 2016

Lecture 12: Effect modification, and confounding in logistic regression

Extensions of One-Way ANOVA.

STAT 3022 Spring 2007

Chapter 14. Multiple Regression Models. Multiple Regression Models. Multiple Regression Models

Lecture 1: Linear Models and Applications

Chapter 11: Linear Regression and Correla4on. Correla4on

1 Use of indicator random variables. (Chapter 8)

Tests of Linear Restrictions

UNIVERSITY OF MASSACHUSETTS Department of Mathematics and Statistics Basic Exam - Applied Statistics January, 2018

Scatterplots and Correlation

Stat 5303 (Oehlert): Models for Interaction 1

Transcription:

Lecture 11 Review Section 3.5 from last Monday (on board) Overview of today s example (on board) Section 3.6, Continued: Nested F tests, review on board first Section 3.4: Interaction for quantitative variables (on board) Polynomial Regression Especially quadratic Second-order models (including interaction)

Example: 198 State SAT Scores (First year state by state data available) Unit = A state in the United States Response Variable: Y = Average combined SAT Score Potential Predictors: 1 = Takers = % taking the exam out of all eligible students in that state = Expend = amount spent by the state for public secondary schools, per student ($100 s) Is Y related to one or both of these variables?

Example: State SAT with 1 only SAT 800 850 900 950 1000 1050 0 10 0 30 40 50 60 70 Takers Y = Combined SAT = % Taking SAT Things to notice: Two clusters in the range. Why? Possible curved relationship As %Takers goes up, average SAT goes down.

Example: State SAT with 1 only SAT 800 850 900 950 1000 1050 Would a curved line work better? 0 10 0 30 40 50 60 70 Takers Y = Combined SAT = % Taking SAT mod1$resid -100-50 0 50 850 900 950 1000 mod1$fitted Residuals vs fitted values

Example: State SAT with only Y = Combined SAT = Expenditure Not clear what pattern is; Alaska is very influential.

Polynomial Regression p p o Y 1 Y o 1 3 3 1 Y o 1 Y o (Linear) (Quadratic; curve) (Cubic) For a single predictor :

Polynomial Regression in R Method #1: Create new columns with powers of the predictor. To avoid creating a new column Method #: Use I( )in the lm( ) quadmod=lm(sat~takers+i(takers^)) Method #3: Use poly quadmod=lm(sat~poly(takers,degree=,raw=true))

Quadratic Model > Quad<-lm(sat~takers+I(takers^), data=statesat) > summary(quad) Call: lm(formula = sat ~ takers + I(takers^), data = StateSAT) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1053.1311 9.737 113.561 < e-16 *** takers -7.16159 0.890-8.07.3e-10 *** I(takers^) 0.0710 0.01405 5.055 6.99e-06 *** Residual standard error: 9.93 on 47 degrees of freedom Multiple R-squared: 0.889, Adjusted R-squared: 0.816 F-statistic: 113.8 on and 47 DF, p-value: <.e-16

Residual Plot Looks Good (Two clusters still obvious)

How to Choose the Polynomial Degree? Use the minimum degree needed to capture the structure of the data. Check the t-test for the highest power. (Generally) keep lower powers even if not significant.

Recall: Interaction Active o 1 Rest Gender 3 Rest *Gender Product allows for different Active/Rest slopes for different genders In General: Interaction is present if the relationship between two variables (e.g. Y and 1 ) changes depending on a third variable (e.g. ). Modeling tip: Include a product term to account for interaction.

Complete Second-order Models Definition: A complete second-order model for two predictors would be: 1 5 4 1 3 1 1 Y o First order Quadratic Interaction

Just linear for both Linear + 1 Interaction Linear + Quadratic Complete second-order

Second-order Model for State SAT Example: Try a full second-order model for Y = SAT using 1 = Takers and = Expend. Y o 11 31 4 51 secondorder=lm(sat~takers+i(takers^) +Expend+I(Expend^)+Takers:Expend, data=statesat)

summary(secondorder) Second-order Model for State SAT lm(formula = sat ~ takers + I(takers^) + expend + I(expend^) + takers:expend, data = StateSAT) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 893.6683 36.14094 4.77 < e-16 *** takers -7.05561 0.83740-8.46 9.96e-11 *** I(takers^) 0.0775 0.0138 5.816 6.8e-07 *** expend 10.33333.49600 4.140 0.000155 *** I(expend^) -0.11775 0.0446 -.660 0.010851 * takers:expend -0.03344 0.03716-0.900 0.373087 Residual standard error: 3.68 on 44 degrees of freedom Multiple R-squared: 0.8997, Adjusted R-squared: 0.8883 F-statistic: 78.96 on 5 and 44 DF, p-value: <.e-16 Do we really need the quadratic terms? Nested F-test

anova(secondorder) [FULL MODEL] Df Sum Sq Mean Sq F value Pr(>F) takers 1 18104 18104 3.8794 <.e-16 *** I(takers^) 1 886 886 40.8198 9.035e-08 *** expend 1 11700 11700 0.8678 3.956e-05 *** I(expend^) 1 578 578 9.4148 0.003677 ** takers:expend 1 454 454 0.8098 0.373087 Residuals 44 4669 561 firstorder=lm(sat~takers*expend) [REDUCED = NO QUADRATIC] anova(firstorder) anova(firstorder,secondorder) [COMPARE] The quadratic terms are significant as a pair (as well as individually).

Do we really need the terms with Expend? Nested F-test Simultaneously test all three terms involving Expend in the second order model with Takers to predict SAT scores. anova(secondordermodel)

Do we really need the terms with Expend? Nested F-test Simultaneously test all three terms involving Expend in the second order model with Takers to predict SAT scores. anova(secondordermodel) 1743 explained by adding the three predictors Compare to F(3,44) P-value=0.00008 (next slide) 1743 t. s. 3 10.36 4669 44

anova(secondorder) [FULL MODEL] Df Sum Sq Mean Sq F value Pr(>F) takers 1 18104 18104 3.8794 <.e-16 *** I(takers^) 1 886 886 40.8198 9.035e-08 *** expend 1 11700 11700 0.8678 3.956e-05 *** I(expend^) 1 578 578 9.4148 0.003677 ** takers:expend 1 454 454 0.8098 0.373087 Residuals 44 4669 561 Takersmodel=lm(SAT~Takers+I(Takers^)) [REDUCED MODEL] anova(takersmodel) anova(takersmodel,secondorder) [COMPARE] Three new predictors reduce the SSE by 1743, a sig. amount.

summary(secondorder) Second-order Model for State SAT lm(formula = sat ~ takers + I(takers^) + expend + I(expend^) + takers:expend, data = StateSAT) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 893.6683 36.14094 4.77 < e-16 *** takers -7.05561 0.83740-8.46 9.96e-11 *** I(takers^) 0.0775 0.0138 5.816 6.8e-07 *** expend 10.33333.49600 4.140 0.000155 *** I(expend^) -0.11775 0.0446 -.660 0.010851 * takers:expend -0.03344 0.03716-0.900 0.373087 Residual standard error: 3.68 on 44 degrees of freedom Multiple R-squared: 0.8997, Adjusted R-squared: 0.8883 F-statistic: 78.96 on 5 and 44 DF, p-value: <.e-16 Do we really need the interaction? T-test for takers:expend

SUMMARY Full second-order model is better than the model with no quadratic terms Full second-order model is better than the quadratic model with Takers only Model with no interaction seems acceptable Comparing Adjusted R-squared: Full model: 88.83%; No interaction: 88.88% No quadratic terms: 76.38% Takers only, quadratic: 8.16% Expend only, quadratic: -3.59%, partly because of the extreme outlier for Alaska!