bivariate correlation bivariate regression multiple regression

Similar documents
Multiple Regression Introduction to Statistics Using R (Psychology 9041B)

REVIEW 8/2/2017 陈芳华东师大英语系

Intro to Linear Regression

Linear Regression. Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x).

Simple Linear Regression Using Ordinary Least Squares

Intro to Linear Regression

AMS 7 Correlation and Regression Lecture 8

THE PEARSON CORRELATION COEFFICIENT

Correlation and Linear Regression

Chapter Goals. To understand the methods for displaying and describing relationship among variables. Formulate Theories.

9. Linear Regression and Correlation

Inferences for Regression

9 Correlation and Regression

Objectives Simple linear regression. Statistical model for linear regression. Estimating the regression parameters

1 Introduction 1. 2 The Multiple Regression Model 1

STAT 3900/4950 MIDTERM TWO Name: Spring, 2015 (print: first last ) Covered topics: Two-way ANOVA, ANCOVA, SLR, MLR and correlation analysis

Chapter 27 Summary Inferences for Regression

1 A Review of Correlation and Regression

y n 1 ( x i x )( y y i n 1 i y 2

Chapter 9 - Correlation and Regression

Introduction and Single Predictor Regression. Correlation

Chapter 10 Regression Analysis

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Self-Assessment Weeks 6 and 7: Multiple Regression with a Qualitative Predictor; Multiple Comparisons

Data Analysis and Statistical Methods Statistics 651

Regression. Marc H. Mehlman University of New Haven

Statistics and Quantitative Analysis U4320. Segment 10 Prof. Sharyn O Halloran

Correlation and simple linear regression S5

x3,..., Multiple Regression β q α, β 1, β 2, β 3,..., β q in the model can all be estimated by least square estimators

STAT Chapter 11: Regression

Chapter 9. Correlation and Regression

Can you tell the relationship between students SAT scores and their college grades?

appstats27.notebook April 06, 2017

Correlation & Simple Regression

Simple linear regression

Simple Linear Regression

Mrs. Poyner/Mr. Page Chapter 3 page 1

Analysis of Bivariate Data

STA 218: Statistics for Management

Business Statistics. Lecture 10: Correlation and Linear Regression

Confidence Intervals, Testing and ANOVA Summary

Multiple linear regression S6

Exam Applied Statistical Regression. Good Luck!

Multiple Regression. More Hypothesis Testing. More Hypothesis Testing The big question: What we really want to know: What we actually know: We know:

SMAM 319 Exam1 Name. a B.The equation of a line is 3x + y =6. The slope is a. -3 b.3 c.6 d.1/3 e.-1/3

Business Statistics. Lecture 9: Simple Regression

Correlation. A statistics method to measure the relationship between two variables. Three characteristics

HUDM4122 Probability and Statistical Inference. February 2, 2015

Binary Logistic Regression

CHAPTER 4 DESCRIPTIVE MEASURES IN REGRESSION AND CORRELATION

Self-Assessment Weeks 8: Multiple Regression with Qualitative Predictors; Multiple Comparisons

Section 3: Simple Linear Regression

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Model Selection Procedures

Warm-up Using the given data Create a scatterplot Find the regression line

Chapter 4 Describing the Relation between Two Variables

36-309/749 Experimental Design for Behavioral and Social Sciences. Sep. 22, 2015 Lecture 4: Linear Regression

MATH 1150 Chapter 2 Notation and Terminology

Chapter 12 Summarizing Bivariate Data Linear Regression and Correlation

Review 6. n 1 = 85 n 2 = 75 x 1 = x 2 = s 1 = 38.7 s 2 = 39.2

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

AMS 315/576 Lecture Notes. Chapter 11. Simple Linear Regression

Regression and correlation. Correlation & Regression, I. Regression & correlation. Regression vs. correlation. Involve bivariate, paired data, X & Y

Chapter 13 Correlation

Simple Linear Regression

Business Statistics. Lecture 10: Course Review

Overview. Overview. Overview. Specific Examples. General Examples. Bivariate Regression & Correlation

Simple Linear Regression

Psy 420 Final Exam Fall 06 Ainsworth. Key Name

4.1 Introduction. 4.2 The Scatter Diagram. Chapter 4 Linear Correlation and Regression Analysis

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Introduction to Regression

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Linear Regression Measurement & Evaluation of HCC Systems

10. Alternative case influence statistics

Chapter 7. Linear Regression (Pt. 1) 7.1 Introduction. 7.2 The Least-Squares Regression Line

The Simple Linear Regression Model

Correlation and Regression Analysis. Linear Regression and Correlation. Correlation and Linear Regression. Three Questions.

IT 403 Practice Problems (2-2) Answers

Swarthmore Honors Exam 2012: Statistics

Midterm 2 - Solutions

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

2. Outliers and inference for regression

SPSS Output. ANOVA a b Residual Coefficients a Standardized Coefficients

Univariate analysis. Simple and Multiple Regression. Univariate analysis. Simple Regression How best to summarise the data?

Answer Key. 9.1 Scatter Plots and Linear Correlation. Chapter 9 Regression and Correlation. CK-12 Advanced Probability and Statistics Concepts 1

DSST Principles of Statistics

MAT2377. Rafa l Kulik. Version 2015/November/26. Rafa l Kulik

SMAM 314 Exam 42 Name

PS2: Two Variable Statistics

ANOVA Situation The F Statistic Multiple Comparisons. 1-Way ANOVA MATH 143. Department of Mathematics and Statistics Calvin College

THE ROYAL STATISTICAL SOCIETY 2008 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE (MODULAR FORMAT) MODULE 4 LINEAR MODELS

STATISTICS 110/201 PRACTICE FINAL EXAM

Estimating a Population Mean

Prob and Stats, Sep 23

Multiple Linear Regression II. Lecture 8. Overview. Readings

Multiple Linear Regression II. Lecture 8. Overview. Readings. Summary of MLR I. Summary of MLR I. Summary of MLR I

Tutorial 6: Linear Regression

AP Statistics Unit 6 Note Packet Linear Regression. Scatterplots and Correlation

Transcription:

bivariate correlation bivariate regression multiple regression Today

Bivariate Correlation Pearson product-moment correlation (r) assesses nature and strength of the linear relationship between two continuous variables! (X X)(Y Ȳ ) r = (X X) 2 (Y Ȳ )2! r^2 represents proportion of variance shared by the two variables e.g. r=0.663, r^2=0.439: X and Y share 43.9% of the variance in common

Bivariate Correlation r > 0 r < 0 r = 0 r = 0 r > 0 r = 0 remember: r measures linear correlation

Significance Tests we can perform significance tests on r H0: (population) r = 0; H1: (population) r not equal to 0 (two-tailed) H1: (population) r < 0 (or >0) : one-tailed sampling distribution of r IF we were to randomly draw two samples from two populations that were not correlated at all, what proportion of the time would we get a value of r as as extreme as we observe? if p <.05 we reject H0

Significance Tests We can perform an F-test: df = (1,N-2) or we could also do a t-test: df = N-2! F = r2 (N 2) 1 r 2 t = r 1 r 2 N 2 so for example, if we have an observed r = 0.663 based on a sample of 10 (X,Y) pairs Fobs = 6.261 Fcrit(1,8,0.05) = 5.32 (or compute p = 0.018) therefore reject H0

Significance Tests be careful! statistical significance does not equal scientific significance e.g. let s say we have 112 data points we compute r = 0.2134 we do an F-test: Fobs(1,110) = 5.34, p <.05 reject H0! we have a significant correlation if r=0.2134, r^2 = 0.046 only 4.6% of the variance is shared between X and Y 95.4% of the variance is NOT shared H0 is that r = 0, not that r is large (not that r is significant)

Bivariate Regression X, Y continuous variables Y is considered to be dependent on X we want to predict a value of Y, given a value of X e.g. Y is a person s weight, X is a person s height! Ŷ i = β 0 + β 1 X i estimate of Y, Yhat_i, is equal to a constant (beta_0) plus another constant (beta_1) times the value of X this is the equation for a straight line beta_0 is the Y-intercept, beta_1 is the slope

Weight Y 230 220 210 200 190 180 170 Bivariate Regression we want to predict Y given X we are modelling Y using a linear equation Ŷ i = β 0 + β 1 X i Height (X) Weight (Y) 55 140 61 150 67 152 83 220 65 190 82 195 70 175 58 130 65 155 61 160 160 150 140 130 120 50 55 60 65 70 75 80 85 90 Height X β 0 = 7.2 β 1 =2.6

Weight Y 230 220 210 200 190 180 170 Bivariate Regression slope means that every inch in height is associated with 2.6 pounds of weight Ŷ i = β 0 + β 1 X i Height (X) Weight (Y) 55 140 61 150 67 152 83 220 65 190 82 195 70 175 58 130 65 155 61 160 160 150 140 130 120 50 55 60 65 70 75 80 85 90 Height X β 0 = 7.2 β 1 =2.6

Bivariate Regression How do we estimate the coefficients beta_0 and beta_1? for bivariate regression there are formulas:!!! β 1 = (X X)(Y Ȳ ) (X X) 2 β 0 = Ȳ β 1 X These formulas estimate beta_0 and beta_1 according to a least-squares criterion they are the two beta values that minimize the sum of squared deviations between the estimated values of Y (the line of best fit) and the actual values of Y (the data)

Bivariate Regression How good is our line of best fit? common measure is Standard Error of Estimate! (Y Ŷ ) SE = 2! N 2 N is number of (X,Y) pairs of data SE gives a measure of the typical prediction error in units of Y e.g. in our height/weight data SE = sqrt(1596 / 8) = 14.1 lbs

Bivariate Regression we can use SE to generate confidence intervals for our estimated values! Ŷ =(β 0 + β 1 X) ± 1.96SE so for example if height = 72 inches, predicted weight is -7.2 + 2.6*72 = 180 pounds, +/- 1.96(14.1) = 180 +/- 27.6 pounds this means that if we take repeated samples from the population, and recompute the regression line, that 95% of the time we will find a confidence interval that will contain the true population mean weight of a 72 inch tall individual, within the endpoints of the CI of that sample obviously SE and thus CI depends on size of sample (N)

Bivariate Regression another measure of fit: r^2 r^2 gives the proportion of variance accounted for e.g. r^2 = 0.58 means that 58% of the variance in Y is accounted for by X r^2 is bounded by [0,1] r 2 = ( Ŷ Ȳ )2 (Y Ȳ ) 2

X Linear Regression with Non-Linear Terms Y = β 0 + β 1 X obviously non-linear relationship Y = β 0 + β 1 X 2 better but not great Y Y X much better fit X Y Y = β 0 + β 1 X 3

Linear Regression with Non-Linear Terms! Y = β 0 + β 1 X 3! Y How do we do this? Just create a new variable X^3 X then perform linear regression using that instead of X you will get your beta coefficients and r^2 you can generate predicted values of Y if you want

Always plot your data this poor fitting regression line gives the following F-test: F(1,99)=266.2, p <.001 r^2 = 0.85 so we have accounted for 85% of the variance in Y using a straight line Y Y = β 0 + β 1 X obviously non-linear relationship X is this good enough? what is H0? (y = B0) if you never plotted the data you would never know that you can do a LOT better Y with Y = B0 + B1(X^3) we get r^2 = 0.99 X

Anscombe's quartet four datasets that have nearly identical simple statistical properties, yet appear very different when graphed each dataset consists of eleven (x,y) points constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties http://en.wikipedia.org/wiki/anscombe's_quartet

Anscombe's quartet

Anscombe's quartet in all 4 cases: mean(x) = 9 var(x) = 11 mean(y) = 7.50 var(y) = 4.122 or 4.127 cor(x,y) = 0.816 regression: y = 3.00 + 0.500 (x)

Multiple Regression same idea as bivariate regression we want to predict values of a continuous variable Y but instead of basing our prediction on a single variable X, we will use several independent variables X1.. Xk the linear model is:! Ŷ = β 0 + β 1 X 1 + β 2 X 2 +... + β k X k betas are constants, X1,..., Xk are predictor variables beta weights are found which minimize the total sum of squared error between the predicted and actual Y values

An Example basketball data http://www.gribblelab.org/stats/data/bball.csv data from 105 NBA players # games played last season points scored per minute minutes played per game height field goal percentage age free throw percentage You are the new coach. You want to develop a model that will let you predict points scored per minute based on the other 6 variables

> mydata <- read.table( http://www.gribblelab.org/stats/data/bball.csv, header=t, sep=, )! > plot(mydata) 0.2 0.5 0.8 160 190 25 30 35 GAMES 20 60 0.2 0.5 0.8 PPM MPG 10 30 160 190 HGT FGP 35 45 55 25 30 35 AGE FTP 40 60 80 20 60 10 30 35 45 55 40 60 80

Questions answered by Multiple Regression What is the best single predictor? What is the best equation (model)? Does a certain variable add significantly to the predictive power? 0.2 0.5 0.8 160 190 25 30 35 GAMES 20 60 0.2 0.5 0.8 PPM MPG 10 30 160 190 HGT FGP 35 45 55 25 30 35 AGE FTP 40 60 80 20 60 10 30 35 45 55 40 60 80

What is the best single predictor? simply obtain the bivariate correlations between the dependent variable (Y) and each of the individual predictor variables (X1-X6) which predictors have a significant correlation? predictor with the maximum (absolute) correlation coefficient is the best single predictor (note largest r can be negative) points per minute PPM vs: predictor r p age -0.0442 0.654 field goal % 0.4063 0.00... free throw % 0.1655 0.092 games/season -0.0598 0.544 height 0.2134 0.029 minutes/game 0.3562 0.00...

3 ways to do this: What is the best model? forward regression backward regression stepwise regression

Forward Regression 1. start with no IVs in the equation 2. check to see if any IVs significantly predict DV 3. if no, stop if yes, add the best IV and go to step 4 4. check to see if any remaining IVs predict a significant unique amount of variance 5. if no, stop if yes, add the best and go to step 4 unique contributions of variance above and beyond other variables problem: we can still end up with variables in the equation that don t account for a significant unique proportion of variance

significant proportion of variance add X1 Y = β 0 + β 1 X 1 significant proportion of variance add X2 Y = β 0 + β 1 X 1 + β 2 X 2 significant proportion of variance PROBLEM! no longer significant unique portion of variance add X3 β 2 X 2 + Y = β 0 + β 1 X 1 + β 3 X 3

Backward Regression 1. start with all IVs in the equation 2. check to see if any IVs are not significantly adding to the equation 3. if no, stop if yes, remove the worst IV (smallest r^2) and go back to step 2! backward regression avoids the problem of ending up with variables in the equation that don t account for significant unique portions of variance

not sig portion of unique variance remove X1 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 significant unique proportion of variance significant unique proportion of variance stop Y = β 0 + β 2 X 2 + β 3 X 3

Stepwise Regression 1. start with no IVs in the equation 2. check to see if any IVs significantly predict the DV 3. if no, stop if yes, add best IV (largest r^2) and go to step 4 4. check to see if any IVs add significantly to the equation 5. if no, stop if yes, add best IV (largest r^2), go to step 6 6. check each IV currently in the equation to make sure they contribute unique portions of variance 7. remove any that don t 8. go to step 4

significant proportion of variance add X1 Y = β 0 + β 1 X 1 significant proportion of variance significant proportion of variance significant proportion of variance add X2 PROBLEM! no longer significant unique portion of variance Y = β 0 + β 1 X 1 + β 2 X 2 remove X1 significant proportion of variance add X3 Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3

Building Models stepwise regression is almost exclusively used these days backward and forward regression not very common any more how to decide if a variable when added or removed is significant? F-tests, using p-value cutoff (e.g. 5%) - this is how SPSS does it Akaike Information Criterion (AIC) - another measure of the tradeoff between model simplicity and model goodness-of-fit (this is how R does it) http://en.wikipedia.org/wiki/akaike_information_criterion

Benchmarking when we ask the question does variable X3 contribute unique variance we are comparing one model against another this is known as benchmarking news-flash: we have been doing this all along! we are comparing a full model and a restricted model restricted: Y = β 0 + β 1 X 1 full: Y = β 0 + β 1 X 1 + β 2 X 2 F-test tests whether X2 adds unique variance over and above that already accounted for by the restricted model