Unit 6 - Introduction to linear regression

Similar documents
Unit 6 - Simple linear regression

AMS 7 Correlation and Regression Lecture 8

9. Linear Regression and Correlation

11 Correlation and Regression

ST430 Exam 1 with Answers

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

Lecture 18: Simple Linear Regression

Chapter 5 Friday, May 21st

Inferences for Regression

THE PEARSON CORRELATION COEFFICIENT

Correlation & Simple Regression

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

Announcements: You can turn in homework until 6pm, slot on wall across from 2202 Bren. Make sure you use the correct slot! (Stats 8, closest to wall)

Correlation and Linear Regression

ST Correlation and Regression

appstats27.notebook April 06, 2017

3.2: Least Squares Regressions

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Exam Applied Statistical Regression. Good Luck!

Chapter 10. Regression. Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania

Chapter 27 Summary Inferences for Regression

Warm-up Using the given data Create a scatterplot Find the regression line

Intro to Linear Regression

Mrs. Poyner/Mr. Page Chapter 3 page 1

Intro to Linear Regression

Objectives. 2.3 Least-squares regression. Regression lines. Prediction and Extrapolation. Correlation and r 2. Transforming relationships

Chi-square tests. Unit 6: Simple Linear Regression Lecture 1: Introduction to SLR. Statistics 101. Poverty vs. HS graduate rate

Important note: Transcripts are not substitutes for textbook assignments. 1

Stat 101 Exam 1 Important Formulas and Concepts 1

IT 403 Practice Problems (2-2) Answers

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

Scatterplots and Correlation

Correlation. What Is Correlation? Why Correlations Are Used

Area1 Scaled Score (NAPLEX) .535 ** **.000 N. Sig. (2-tailed)

Inference for Regression Inference about the Regression Model and Using the Regression Line, with Details. Section 10.1, 2, 3

Psychology Seminar Psych 406 Dr. Jeffrey Leitzel

Related Example on Page(s) R , 148 R , 148 R , 156, 157 R3.1, R3.2. Activity on 152, , 190.

2. Outliers and inference for regression

Announcements. Unit 6: Simple Linear Regression Lecture : Introduction to SLR. Poverty vs. HS graduate rate. Modeling numerical variables

Information Sources. Class webpage (also linked to my.ucdavis page for the class):

Density Temp vs Ratio. temp

Announcements. Lecture 18: Simple Linear Regression. Poverty vs. HS graduate rate

Stat 101: Lecture 6. Summer 2006

INFERENCE FOR REGRESSION

7.0 Lesson Plan. Regression. Residuals

Analysis of Bivariate Data

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Regression. Marc H. Mehlman University of New Haven

Lecture 4 Scatterplots, Association, and Correlation

Lecture 4 Scatterplots, Association, and Correlation

Introduction and Single Predictor Regression. Correlation

Linear Regression. Linear Regression. Linear Regression. Did You Mean Association Or Correlation?

Can you tell the relationship between students SAT scores and their college grades?

STA Module 5 Regression and Correlation. Learning Objectives. Learning Objectives (Cont.) Upon completing this module, you should be able to:

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

Inference for the Regression Coefficient

6.0 Lesson Plan. Answer Questions. Regression. Transformation. Extrapolation. Residuals

Correlation and simple linear regression S5

appstats8.notebook October 11, 2016

Analysing data: regression and correlation S6 and S7

The response variable depends on the explanatory variable.

Big Data Analysis with Apache Spark UC#BERKELEY

AP Statistics L I N E A R R E G R E S S I O N C H A P 7

Chapter 8. Linear Regression /71

LECTURE 15: SIMPLE LINEAR REGRESSION I

Chapter 3: Examining Relationships

9 Correlation and Regression

Econometrics Review questions for exam

BIVARIATE DATA data for two variables

Inference for Regression

Chapter 8. Linear Regression. Copyright 2010 Pearson Education, Inc.

Simple Linear Regression: One Qualitative IV

AP Statistics Two-Variable Data Analysis

Math 2311 Written Homework 6 (Sections )

1 Multiple Regression

What s a good way to show how the results came out? The relationship between two variables can be represented visually by a SCATTER DIAGRAM.

Linear Regression Communication, skills, and understanding Calculator Use

Review of Statistics 101

Multiple Regression. Inference for Multiple Regression and A Case Study. IPS Chapters 11.1 and W.H. Freeman and Company

Correlation and Regression

Looking at data: relationships

Determine is the equation of the LSRL. Determine is the equation of the LSRL of Customers in line and seconds to check out.. Chapter 3, Section 2

STAT 512 MidTerm I (2/21/2013) Spring 2013 INSTRUCTIONS

Lecture 11: Simple Linear Regression

y ˆ i = ˆ " T u i ( i th fitted value or i th fit)

Chapter 5 Least Squares Regression

Answer Key: Problem Set 6

Bivariate Regression Analysis. The most useful means of discerning causality and significance of variables

28. SIMPLE LINEAR REGRESSION III

7. Do not estimate values for y using x-values outside the limits of the data given. This is called extrapolation and is not reliable.

Relationship Between Interval and/or Ratio Variables: Correlation & Regression. Sorana D. BOLBOACĂ

STATS DOESN T SUCK! ~ CHAPTER 16

Simple Linear Regression

HOMEWORK (due Wed, Jan 23): Chapter 3: #42, 48, 74

Midterm 2 - Solutions

Final Exam - Solutions

Chapter 9 - Correlation and Regression

Chapter 12 : Linear Correlation and Linear Regression

Chapter 12 - Part I: Correlation Analysis

Transcription:

Unit 6 - Introduction to linear regression Suggested reading: OpenIntro Statistics, Chapter 7 Suggested exercises: Part 1 - Relationship between two numerical variables: 7.7, 7.9, 7.11, 7.13, 7.15, 7.25, 7.27 Part 2 - Linear regression with a single predictor: 7.17, 7.19, 7.23, Part 3 - Inference for linear regression: 7.29, 7.31, 7.33 Reading: Section 7.1 of OpenIntro Statistics LO 1. Define the explanatory variable as the independent variable (predictor), and the response variable as the dependent variable (predicted). LO 2. Plot the explanatory variable (x) on the x-axis and the response variable (y) on the y-axis, and fit a linear regression model y = β 0 + β 1 x, where β 0 is the intercept, and β 1 is the slope. - Note that the point estimates (estimated from observed data) for β 0 and β 1 are b 0 and b 1, respectively. LO 3. When describing the association between two numerical variables, evaluate - direction: positive (x, y ), negative (x, y ) - form: linear or not - strength: determined by the scatter around the underlying relationship LO 4. Define correlation as the linear association between two numerical variables. - Note that a relationship that is nonlinear is simply called an association. LO 5. Note that correlation coefficient (R, also called Pearson s R) has the following properties: - the magnitude (absolute value) of the correlation coefficient measures the strength of the linear association between two numerical variables - the sign of the correlation coefficient indicates the direction of association - the correlation coefficient is always between -1 and 1, -1 indicating perfect negative linear association, +1 indicating perfect positive linear association, and 0 indicating no linear relationship - the correlation coefficient is unitless - since the correlation coefficient is unitless, it is not affected by changes in the center or scale of either variable (such as unit conversions) - the correlation of X with Y is the same as of Y with X 1

- the correlation coefficient is sensitive to outliers LO 6. Recall that correlation does not imply causation. LO 7. Define residual (e) as the difference between the observed (y) and predicted (ŷ) values of the response variable. e i = y i ŷ i Reading: Section 7.1 of OpenIntro Statistics Test yourself: 1. Someone hands you the scatter diagram shown below, but has forgotten to label the axes. Can you calculate the correlation coefficient? Or do you need the labels? 2. A teaching assistant gives a quiz. There are 10 questions on the quiz and no partial credit is given. After grading the papers the TA writes down for each student the number of questions the student got right and the number wrong. What is the correlation of the number of questions right and wrong? Hint: Make up some data for number of questions right, calculate number of questions wrong, and plot them against each other. 3. Suppose you fit a linear regression model predicting score on an exam from number of hours studied. Say you ve studied for 4 hours. Would you prefer to be on the line, below the line, or above the line? What would the residual for your score be (0, negative, or positive)? Reading: Section 7.2 of OpenIntro Statistics LO 8. Define the least squares line as the line that minimizes the sum of the squared residuals, and list conditions necessary for fitting such line: (1) linearity (2) nearly normal residuals (3) constant variability LO 9. Define an indicator variable as a binary explanatory variable (with two levels). LO 10. Calculate the estimate for the slope (b 1 ) as b 1 = R s y s x, 2

where R is the correlation coefficient, s y is the standard deviation of the response variable, and s x is the standard deviation of the explanatory variable. LO 11. Interpret the slope as - when x is numerical: For each unit increase in x, we would expect y to be lower/higher on average by b 1 units - when x is categorical: The value of the response variable is predicted to be b 1 units higher/lower between the baseline level and the other level of the explanatory variable. - Note that whether the response variable increases or decreases is determined by the sign of b 1. LO 12. Note that the least squares line always passes through the average of the response and explanatory variables ( x, ȳ). LO 13. Use the above property to calculate the estimate for the intercept (b 0 ) as b 0 = ȳ b 1 x, where b 1 is the slope, ȳ is the average of the response variable, and x is the average of explanatory variable. LO 14. Interpret the intercept as - When x = 0, we would expect y to equal, on average, b 0. when x is numerical. - The expected average value of the response variable for the reference level of the explanatory variable is b 0. when x is categorical. LO 15. Predict the value of the response variable for a given value of the explanatory variable, x, by plugging in x in the linear model: ŷ = b 0 + b 1 x - Only predict for values of x that are in the range of the observed data. - Do not extrapolate beyond the range of the data, unless you are confident that the linear pattern continues. LO 16. Define R 2 as the percentage of the variability in the response variable explained by the the explanatory variable. - For a good model, we would like this number to be as close to 100% as possible. - This value is calculated as the square of the correlation coefficient. Test yourself: 1. We would not want to fit a least squares line to the data shown in the scatterplot below. Which of the conditions does it appear to violate? 3

2. Derive the formula for b 0 given the fact that the linear model is ŷ = b 0 + b 1 x and that the least squares line goes through ( x, ȳ). 3. One study on male college students found their average height to be 70 inches with a standard deviation of 2 inches. Their average weight was 140 pounds, with a standard deviation of 25 pounds. The correlation between their height and weight was 0.60. Assuming that the two variables are linearly associated, write the linear model for predicting weight from height. 4. Is a male who is 72 inches tall and who weighs 115 pounds on the line, below the line, or above the line? 5. Describe what is an indicator variable, and what levels 0 and 1 mean for such variables. 6. If the correlation between two variables y and x is 0.6, what percent of the variability in y does x explain? 7. The model below predicts GPA based on an indicator variable (0: not premed, 1: premed). Interpret the intercept and slope estimates in context of the data. ĝpa = 3.57 0.01 premed Reading: Section 7.3 of OpenIntro Statistics LO 17. Define a leverage point as a point that lies away from the center of the data in the horizontal direction. LO 18. Define an influential point as a point that influences (changes) the slope of the regression line. - This is usually a leverage point that is away from the trajectory of the rest of the data. LO 19. Do not remove outliers from an analysis without good reason. LO 20. Be cautious about using a categorical explanatory variable when one of the levels has very few observations, as these may act as influential points. Test yourself: Determine if each of the three unusual observations in the plot below would be considered just an outlier, a leverage point, or an influential point. 4

Reading: Section 7.4 of OpenIntro Statistics LO 21. Determine whether an explanatory variable is a significant predictor for the response variable using the t-test and the associated p-value in the regression output. LO 22. Set the null hypothesis testing for the significance of the predictor as H 0 : β 1 = 0, and recognize that the standard software output yields the p-value for the two-sided alternative hypothesis. - Note that β 1 = 0 means the regression line is horizontal, hence suggesting that there is no relationship between the explanatory and response variables. LO 23. Calculate the T score for the hypothesis test as with df = n 2. T df = b 1 null value SE b1 - Note that the T score has n 2 degrees of freedom since we lose one degree of freedom for each parameter we estimate, and in this case we estimate the intercept and the slope. LO 24. Note that a hypothesis test for the intercept is often irrelevant since it s usually out of the range of the data, and hence it is usually an extrapolation. LO 25. Calculate a confidence interval for the slope as b 1 ± t dfse b1, where df = n 2 and t df is the critical score associated with the given confidence level at the desired degrees of freedom. - Note that the standard error of the slope estimate SE b1 can be found on the regression output. Test yourself: 1. Given the regression output below for predicting y from x where n = 100, confirm the T score and the p-value, determine whether x is a significant predictor of y, and interpret the p-value in context. 5

Estimate Std. Error t value Pr(> t ) (Intercept) 19.8581 6.1379 3.24 0.0017 x 0.2557 0.1055 2.42 0.0172 2. Calculate a 95% confidence interval for the slope given above. 6