Regression Analysis: Exploring relationships between variables. Stat 251

Similar documents
df=degrees of freedom = n - 1

MATH 1150 Chapter 2 Notation and Terminology

8th Grade Common Core Math

Ch Inference for Linear Regression

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Can you tell the relationship between students SAT scores and their college grades?

Introduction and Single Predictor Regression. Correlation

Review of Multiple Regression

Chapter 16. Simple Linear Regression and Correlation

y response variable x 1, x 2,, x k -- a set of explanatory variables

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

AMS 7 Correlation and Regression Lecture 8

appstats27.notebook April 06, 2017

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Chapter 27 Summary Inferences for Regression

STAT 350 Final (new Material) Review Problems Key Spring 2016

STAT Chapter 11: Regression

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

Keller: Stats for Mgmt & Econ, 7th Ed July 17, 2006

SLR output RLS. Refer to slr (code) on the Lecture Page of the class website.

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Black White Total Observed Expected χ 2 = (f observed f expected ) 2 f expected (83 126) 2 ( )2 126

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

THE PEARSON CORRELATION COEFFICIENT

Review. Midterm Exam. Midterm Review. May 6th, 2015 AMS-UCSC. Spring Session 1 (Midterm Review) AMS-5 May 6th, / 24

Chapter 16. Simple Linear Regression and dcorrelation

Sampling distribution of t. 2. Sampling distribution of t. 3. Example: Gas mileage investigation. II. Inferential Statistics (8) t =

Warm-up Using the given data Create a scatterplot Find the regression line

Estimating σ 2. We can do simple prediction of Y and estimation of the mean of Y at any value of X.

Inferences for Regression

Chapter 12 : Linear Correlation and Linear Regression

1 A Review of Correlation and Regression

Business Statistics. Lecture 10: Correlation and Linear Regression

Lectures on Simple Linear Regression Stat 431, Summer 2012

INFERENCE FOR REGRESSION

ANOVA (Analysis of Variance) output RLS 11/20/2016

Regression I - the least squares line

22 Approximations - the method of least squares (1)

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Bivariate Data Summary

Review of Statistics 101

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

Tables Table A Table B Table C Table D Table E 675

Business Statistics. Lecture 10: Course Review

y n 1 ( x i x )( y y i n 1 i y 2

Overview. 4.1 Tables and Graphs for the Relationship Between Two Variables. 4.2 Introduction to Correlation. 4.3 Introduction to Regression 3.

Chapter 4 Describing the Relation between Two Variables

Approximations - the method of least squares (1)

Unit 6 - Introduction to linear regression

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

appstats8.notebook October 11, 2016

Lecture 18: Simple Linear Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Intro to Linear Regression

Analysis of Covariance. The following example illustrates a case where the covariate is affected by the treatments.

Notes 6: Multivariate regression ECO 231W - Undergraduate Econometrics

Intro to Linear Regression

MATH 1070 Introductory Statistics Lecture notes Relationships: Correlation and Simple Regression

Inference with Simple Regression

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Business Statistics. Lecture 9: Simple Regression

REVIEW 8/2/2017 陈芳华东师大英语系

STAT 3022 Spring 2007

Chapter 8 Handout: Interval Estimates and Hypothesis Testing

HOMEWORK ANALYSIS #3 - WATER AVAILABILITY (DATA FROM WEISBERG 2014)

Section 3: Simple Linear Regression

Chapter 4: Regression Models

AP Statistics Cumulative AP Exam Study Guide

HOLLOMAN S AP STATISTICS BVD CHAPTER 08, PAGE 1 OF 11. Figure 1 - Variation in the Response Variable

Lesson 5: Solving Linear Systems Problem Solving Assignment solutions

: The model hypothesizes a relationship between the variables. The simplest probabilistic model: or.

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Simple Linear Regression

Density Temp vs Ratio. temp

Business Statistics. Chapter 14 Introduction to Linear Regression and Correlation Analysis QMIS 220. Dr. Mohammad Zainal

Scatter plot of data from the study. Linear Regression

Inference for Regression Inference about the Regression Model and Using the Regression Line

ST Correlation and Regression

Correlation Analysis

Characteristics of Linear Functions (pp. 1 of 8)

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Talking feet: Scatterplots and lines of best fit

The Mean Version One way to write the One True Regression Line is: Equation 1 - The One True Line

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Inference for Regression Simple Linear Regression

Math Section MW 1-2:30pm SR 117. Bekki George 206 PGH

Psychology 282 Lecture #3 Outline

Lecture 4 Scatterplots, Association, and Correlation

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Lecture 4 Scatterplots, Association, and Correlation

A discussion on multiple regression models

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Correlation & Simple Regression

Design of Experiments. Factorial experiments require a lot of resources

Scatter plot of data from the study. Linear Regression

y = a + bx 12.1: Inference for Linear Regression Review: General Form of Linear Regression Equation Review: Interpreting Computer Regression Output

Lectures 5 & 6: Hypothesis Testing

Transcription:

Regression Analysis: Exploring relationships between variables Stat 251

Introduction Objective of regression analysis is to explore the relationship between two (or more) variables so that information can be acquired about one of them through knowing values of the other(s) When x and y are deterministically related, it means that once we are told the value of x, y is completely specified (by an equation, or what we call a model; think of functions from pre-calculus) As an example, suppose we want to rent a van for a day and the rental cost is $25.00 plus $0.30 per mile driven Let x = # miles driven and y = the rental charge Then the model for this situation is y = 25 + 0.3x Drive the van 100 miles and the cost is y = 25+.3(100) = $55 We call the relationship analyses regression, least-squares regression, ordinary least-squares regression, and more

Visual Exploration (graphs) The graph of choice to visually examine the relationship between x and y is called a scatterplot When we determine the linear relationship between x and y we can add in a regression line

Terms and Concepts I We call the relationship analyses regression, or least-squares regression Prediction Reference to future values Estimation or Explanation Reference to current or past events Both fall under blanket term prediction The simplest deterministic mathematical relationship is a linear relationship (think y = mx + b)

Terms and Concepts II The model we will use is : is the y-intercept (the b part in y = mx + b) is the slope of the line (the m part in y = mx + b) is an error term (the difference b/w the predicted y- value and the observed y-value) The regression equation we will use is:

Assumptions of Regression 1. 2. 3. 4. * Means that the expected value (mean) of the residuals (errors) is 0 * Means that the variance is constant for all values of x * Means that the residuals are independent * Means that the residuals should be approximately normal with mean = 0 and constant variance (5. underlying assumption is that there is in fact a linear relationship between x and y)

Formulas /n /n

Example I Given are the dataset and summary statistics for a dataset that investigates the relationship between the amount of hours spent studying for the Math SAT and the score earned. Find the equation of the least squares regression line To graph the line, choose the smallest x value and the largest x value to plug into the equation for the two points needed to graph Use the equation to predict y (find y-hat) when x = 2 x = 15

Terms and Concepts III The slope can be interpreted as the change in y due to x Meaning that for every unit increase in x, y will change by the amount of the slope (positive or negative depending on the sign of the slope) The intercept can be interpreted as the value of y when x = 0 A lot of the time, this may not make sense within the context of the dataset (the equation may give a value of y that may be impossible to occur when x = 0; more especially when there is not a value of x = 0 in the original dataset)

How Strong is the Relationship? There is a measure of the strength and direction of the linear relationship between x and y (when both x and y are quantitative) Called correlation Notation ρ, called rho is the population parameter I know it looks like a p but it should as it s where our language got the p from r, just called r is the sample statistic Properties of correlation Cannot be used to measure qualitative (categorical) variables like gender and marital status Is unitless meaning it has no units of measurement associated with it changing units of measurement does not change r -1 r 1 If r is -1, the linear relationship is strong (perfect in this case) and negative (the direction of the slope) If r is 1, the linear relationship is strong (perfect in this case) and positive (the direction of the slope) If r is 0, then that means there is no linear relationship r makes no distinction between x and y

Correlation

More About Correlation In addition to all the other things we have learned about r, the most important is when we square r (and convert to a %) R 2 = r 2 X 100 to obtain a % R 2 is called the Coefficient of Determination It explains the % variation that can be explained by the linear relationship between x and y The higher it is, the better our model is (meaning it fits the data) 0% R 2 100% It is used in model selection to determine if the model we chose is the best fit for the data

Example II Given are the dataset and summary statistics for a dataset Calculate r and R 2

Tests and CI s There are several tests associated with regression, all doing different things, while looking similar (but not exactly alike) Tests: Test and CI for parameter estimates (slope and intercept) Test for overall goodness-of-fit (does the data fit the model to some degree?) Requires hefty calculations (statistical software output is easier) CI for mean response Involves use of the regression line equation PI (Prediction Interval) for a single response Involves use of the regression line equation

Tests and CI s for β 1 Slope CI: where Test: Hypotheses (2-tail tests only) Ho: β 1 = 0 Ha: β 1 0

Tests and CI s for β 0 Intercept CI: where Test: Hypotheses (2-tail tests only) Ho: β 0 = 0 Ha: β 0 0

Overall Goodness-of-Fit (GoF) Test You will not be calculating this test by hand, rather you will read the output from R, a statistical software package, and interpret the results Ho: The linear regression model is not appropriate Ha: The linear regression model is appropriate Look at the F test statistic and pvalue to determine whether or not the linear regression model is appropriate In simple linear regression (SLR), the F test for the model is the same as the t- test for the slope (the results will be the same) The main difference will be the values of the test statistic but the pvalues will be the same F and t distributions are related as follows: Reject Ho iff pvalue α

Test and CI for ρ (correlation) Test only The CI is a bit on the complex side (feasible but complex) so we will only worry about the test t = r n 2 1 R 2 df = n 2 Rejection Region is the same as usual Hypotheses: H 0 : ρ = 0 H a : ρ 0

The equation values are found here: y = Understanding R Output The lm() function in R for linear model Std errors (se) Function to display the results Test statistics Coefficients Intercept Slope R 2 (X100)% pvalues for 2-tail tests of β 0 and β 1 (Ho: =0; Ha: 0) df of t-tests for β 0 and β 1 F test statistic df1 and df2 for F test pvalue for F test

Checking Assumptions There are diagnostic graphs we will check to make sure the assumptions of regression analysis are met Listed below are the assumptions along with the graphs to check and a description of what to look for

Diagnostics I The residuals should have a mean = 0. That means that roughly, the histogram should be centered around 0. This one has the highest bar right around 0, so this assumption has been met

Diagnostics II Constant variance Decreasing variance Increasing variance

Diagnostics III Is a bit more complex than the other assumptions Independence of errors is important because if the errors are dependent, then we have issues with reliability So no checking this calculation, assume it s met

Diagnostics IV Histogram is centered at 0 and normal Points are mostly along the y = x line so it s normal

(Diagnostics V linear relationship)