Correlation and Simple Linear Regression

Similar documents
Simple Linear Regression

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Objectives of Multiple Regression

Probability and. Lecture 13: and Correlation

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Simple Linear Regression

ENGI 3423 Simple Linear Regression Page 12-01

Econometric Methods. Review of Estimation

Linear Regression with One Regressor

Chapter 13 Student Lecture Notes 13-1


Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Multiple Choice Test. Chapter Adequacy of Models for Regression

4. Standard Regression Model and Spatial Dependence Tests

Chapter Two. An Introduction to Regression ( )

Statistics MINITAB - Lab 5

Summary of the lecture in Biostatistics

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Homework Solution (#5)

ESS Line Fitting

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

Multiple Linear Regression Analysis

Lecture 1: Introduction to Regression

Correlation and Regression Analysis

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Lecture 1: Introduction to Regression

Simple Linear Regression and Correlation. Applied Statistics and Probability for Engineers. Chapter 11 Simple Linear Regression and Correlation

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Previous lecture. Lecture 8. Learning outcomes of this lecture. Today. Statistical test and Scales of measurement. Correlation

Lecture 8: Linear Regression

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Simple Linear Regression - Scalar Form

Lecture Notes Types of economic variables

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

Simple Linear Regression and Correlation.

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Multivariate Transformation of Variables and Maximum Likelihood Estimation

= 1. UCLA STAT 13 Introduction to Statistical Methods for the Life and Health Sciences. Parameters and Statistics. Measures of Centrality

Statistics: Unlocking the Power of Data Lock 5

: At least two means differ SST

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Simulation Output Analysis

Chapter 2 Supplemental Text Material

residual. (Note that usually in descriptions of regression analysis, upper-case

i 2 σ ) i = 1,2,...,n , and = 3.01 = 4.01

Simple Linear Regression. How To Study Relation Between Two Quantitative Variables? Scatter Plot. Pearson s Sample Correlation.

CHAPTER 8 REGRESSION AND CORRELATION

Lecture 3. Sampling, sampling distributions, and parameter estimation

ENGI 4421 Propagation of Error Page 8-01

Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Lecture 1 Review of Fundamental Statistical Concepts

Maximum Likelihood Estimation

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Functions of Random Variables

r y Simple Linear Regression How To Study Relation Between Two Quantitative Variables? Scatter Plot Pearson s Sample Correlation Correlation

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Line Fitting and Regression

MEASURES OF DISPERSION

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Continuous Distributions

Chapter 8. Inferences about More Than Two Population Central Values

Lecture 2: The Simple Regression Model

Linear Regression. Can height information be used to predict weight of an individual? How long should you wait till next eruption?

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Chapter Statistics Background of Regression Analysis

Module 7: Probability and Statistics

An Algebraic Connection between Ordinary Lease-square Regression and Regression though the Origin

Linear Regression. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 3 Probability review (cont d)

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

Analyzing Two-Dimensional Data. Analyzing Two-Dimensional Data

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Handout #8. X\Y f(x) 0 1/16 1/ / /16 3/ / /16 3/16 0 3/ /16 1/16 1/8 g(y) 1/16 1/4 3/8 1/4 1/16 1

Lecture 2: Linear Least Squares Regression

Sampling Theory MODULE V LECTURE - 14 RATIO AND PRODUCT METHODS OF ESTIMATION

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

Third handout: On the Gini Index

Measures of Dispersion

Applied Statistics and Probability for Engineers, 5 th edition February 23, b) y ˆ = (85) =

Summarizing Bivariate Data. Correlation. Scatter Plot. Pearson s Sample Correlation. Summarizing Bivariate Data SBD - 1

Study of Correlation using Bayes Approach under bivariate Distributions

Chapter 2 Simple Linear Regression

Linear Regression. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

Topic 4: Simple Correlation and Regression Analysis

Transcription:

Correlato ad Smple Lear Regresso Berl Che Departmet of Computer Scece & Iformato Egeerg Natoal Tawa Normal Uverst Referece:. W. Navd. Statstcs for Egeerg ad Scetsts. Chapter 7 (7.-7.3) & Teachg Materal

Itroducto (/) Ofte, scetsts ad egeers collect data order to determe the ature of the relatoshp betwee two quattes A eample s: heghts ad forearm legths of me The pots ted to slop upward ad to the rght, dcatg that taller me ted to have loger forearms A postve assocato betwee heght ad forearm legth Statstcs-Berl Che

Itroducto (/) Ma tmes, ths ordered pars of measuremets fall appromatel alog a straght le whe plotted I those stuatos, the data ca be used to compute a equato for the le that best fts the data Ths le ca be used for varous thgs, oe s predctg for future values Statstcs-Berl Che 3

Correlato Somethg we ma be terested s how closel related two phscal characterstcs are For eample, heght ad weght of a two-ear-old chld The quatt called the correlato coeffcet s a measure of ths We look at the drecto of the relatoshp, postve or egatve, stregth of relatoshp, ad the we fd a le that best fts the data I computg correlato, we ca ol use quattatve data (stead of qualtatve data) Statstcs-Berl Che 4

Eample Ths s a plot of heght vs. forearm legth for me We sa that there s a postve assocato betwee heght ad forearm legth Ths s because the plot dcates that taller me ted to have loger forearms The slope s roughl costat throughout the plot, dcatg that the pots are clustered aroud a straght le The le supermposed o the plot s a specal le kow as the leastsquares le Statstcs-Berl Che 5

Correlato Coeffcet The degree to whch the pots a scatterplot ted to cluster aroud a le reflects the stregth of the lear relatoshp betwee ad The correlato coeffcet s a umercal measure of the stregth of the lear relatoshp betwee two varables The correlato coeffcet s usuall deoted b the letter r Also called sample correlato (cf. populato cov correlato ( X, Y ) E[ ( X E[ X ])( Y E[ Y ])] ρ ) X, Y σ X σ Y E [ X ] ( [ ]) [ ] E X E Y E[ Y ] ( ) Statstcs-Berl Che 6

Computg Correlato Coeffcet r Let (, ),,(, ) represet pots o a scatterplot Compute the meas ad the stadard devatos of the s ad s The covert each ad to stadard uts. That s, compute the z-scores: ( )/ s ad ( )/ s. The correlato coeffcet s the average of the products of the z-scores, ecept that we dvde b stead of r s s Sometmes, ths computato s more useful ( )( ) r ( ) ( ) s s ( ) ( ) Statstcs-Berl Che 7

Commets o Correlato Coeffcet I prcple, the correlato coeffcet ca be calculated for a set of pots I ma cases, the pots costtute a radom sample from a populato of pots I ths case, the correlato coeffcet s called the sample correlato, ad t s a estmate of the populato correlato It s a fact that r s alwas betwee - ad Postve values of r dcate that the least-squares le has a postve slope. The greater values of oe varable are assocated wth greater values of the other Negatve values of r dcate that the least-squares le has a egatve slope. The greater values of oe varable are assocated wth lesser values of the other Statstcs-Berl Che 8

Eamples of Varous Levels of Correlato Statstcs-Berl Che 9

More Commets Values of r close to - or dcate a strog lear relatoshp Values of r close to 0 dcate a weak lear relatoshp Whe r s equal to - or, the all the pots o the scatterplot le eactl o a straght le If the pots le eactl o a horzotal or vertcal le, the r s udefed If r 0, the ad are sad to be correlated. If r 0, the ad are ucorrelated Statstcs-Berl Che 0

Propertes of Correlato Coeffcet r (/) A mportat feature of r s that t s utless. It s a pure umber that ca be compared betwee dfferet samples r remas uchaged uder each of the followg operatos: Multplg each value of a varable b a postve costat Addg a costat to each value of a varable r Iterchagg the values of ad If r 0, ths does ot mpl that there s ot a relatoshp betwee ad. It just dcates that there s o lear relatoshp 64 6 ( ) ( ) quadratc relatoshp ( )( ) Statstcs-Berl Che

Propertes of Correlato Coeffcet r (/) Outlers ca greatl dstort r, especall, small data sets, ad preset a serous problem for data aalsts correlato coeffcet r0.6 Correlato s ot causato For eample, vocabular sze s strogl correlated wth shoe sze, but ths s because both crease wth age. Learg more words does ot cause feet to grow ad vce versus. Age s cofoudg the results Statstcs-Berl Che

Iferece o the Populato Correlato If the radom varables X ad Y have a certa jot dstrbuto called a bvarate ormal dstrbuto, the the sample correlato r ca be used to costruct cofdece tervals ad perform hpothess tests o the populato correlato, ρ. The followg results make ths possble Let (, ),,(, ) be a radom sample from the jot dstrbuto of X ad Y ad r s the sample correlato of the pots. The the quatt W l + r r s appromatel ormal wth mea (a fucto of r) + ρ μ l W ρ X Z, μz Y σ X, X ΣZ σ X, Y μ μy σ σ X X, Y X, X ad varace σ W. 3 Statstcs-Berl Che 3

Eample 7.3 Questo: Fd a 95% cofdece for the correlato betwee the reacto tme of vsual stmulus ( ) ad that of audo stmulus ( ), gve the followg sample 6 59 03 06 35 4 76 63 0 97 88 93 8 09 89 9 69 78 0 The samplecorrelato betwee ad s r 0.859 W σ W + l / s gve b r r ( 0 3) A 95% (two -sded) cofdeceterval for μ.444.96 + 0.859 l.444 0.859 0.3780 ( 0.3780) μ.444 +.96( 0.3780) 0.4036 μ W W.885 W Note that the populato correlato ρ ca be epressed as e ρ e μ μ + The correspodg 95% cofdeceterval for ρ e e W W 0. 4036 0. 4036 μ e W e μ + e W + e 0.383 ρ 0.955.885.885 + Statstcs-Berl Che 4

Lear Model Whe two varables have a lear relatoshp, the scatterplot teds to be clustered aroud a le kow as the least-squares le The le that we are trg to ft s deal value l β 0 + β measured value 0 + β β + ε ε (measuremet error ) depedet varable depedet varable β 0 ad β are called the regresso coeffcets We ol kow the values of ad, we must estmate the other quattes Ths s what we call smple lear regresso We use the data to estmate these quattes Statstcs-Berl Che 5

The Least-Squares Le β 0 ad β caot be determed because of measuremet error, but the ca be estmated b calculatg the least-squares le βˆ 0 βˆ + ˆ 0 β ad βˆ are called the least-squares coeffcets The least-squares le s the le that fts the data best (?) data cotamated wth radom errors ftted value ˆ ˆ β ˆ 0 + β resdual e ˆ Statstcs-Berl Che 6

Resduals For each data pot (, ), the vertcal dstace to the pot (, ˆ ) o the least squares le s e ˆ. The quatt ŷ s called the ftted value ad the quatt e s called the resdual assocated wth the pot Pots above the least-squares le have postve resduals Pots below the le have egatve resduals The closer the resduals are to 0, the closer the ftted values are to the observatos ad the better the le fts the data The least-squares le s the oe that mmzes the sum of squared resduals S e (, ) Statstcs-Berl Che 7

Statstcs-Berl Che 8 Fdg the Equato of the Le To fd the least-squares le, we must determe estmates for the slope β 0 ad β tercept that mmze the sum of the squared resduals These quattes are ( )( ) ( ) 0 β β β ˆ ˆ ˆ ( ) 0 ˆ ˆ e E β β

Statstcs-Berl Che 9 Some Shortcut Formulas The epressos o the rght are equvalet to those o the left, ad are ofte easer to compute ( )( ) ( ) ( )

Cautos Do ot etrapolate the ftted le (such as the leastsquares le) outsde the rage of the data. The lear relatoshp ma ot hold there We leared that we should ot use the correlato coeffcet whe the relatoshp betwee ad s ot lear. The same holds for the least-squares le. Whe the scatterplot follows a curved patter, t does ot make sese to summarze t wth a straght le If the relatoshp s curved, the we would wat to ft a regresso le that cota squared terms (.e., polomal regresso) Statstcs-Berl Che 0

Measures of Goodess of Ft A goodess of ft statstc s a quatt that measures how well a model eplas a gve set of data The quatt r s the square of the correlato coeffcet ad we call t the coeffcet of determato r ( ) ( ˆ ) ( ) total sum of squares The proporto of varace eplaed b regresso s the terpretato of r ( ) ( ˆ ) error sum of squares measures the reducto of spread of the pots obtaed b usg the leas-squares le rather tha Statstcs-Berl Che

Sums of Squares ( ˆ ) s the error sum of squares (SSE) ad measures the overall spread of the pots aroud the least-squares le ( ) s the total sum of squares (SST) ad measures the overall spread of the pots aroud the le ( ) ( ) ( ) The dfferece ˆ ˆ s called the regresso sum of squares (SSR) Clearl, the followg relatoshp holds: Total sum of squares (SST) regresso sum of squares (SSR) + error sum of squares (SSE) aalss of varace dett Statstcs-Berl Che

Ucertates the Least-Squares Coeffcets Assumptos for Errors Lear Models 0 + β β + ε I the smplest stuato, the followg assumptos are satsfed:. The errors ε,,ε are radom ad depedet. I partcular, the magtude of a error ε does ot fluece the value of the et error ε +. The errors ε,,ε all have mea 0 3. The errors ε,,ε all have the same varace, whch we deote b σ (varace of the error) 4. The errors ε,,ε are ormall dstrbuted Statstcs-Berl Che 3

Dstrbuto of I the lear model β 0 +β +ε, uder assumptos through 4, the observatos,, are depedet radom varables that follow the ormal dstrbuto. The mea ad varace of are gve b μ l β 0 + β σ σ. The slope β represets the chage the mea of assocated wth a crease oe ut the value of Statstcs-Berl Che 4

Statstcs-Berl Che 5 Dstrbutos of ad (/) Uder assumptos 4: The quattes ad are ormall dstrbuted radom varables After further mapulato, we have βˆ 0 βˆ ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) βˆ 0 βˆ 0 ( ) ( ) 0 βˆ βˆ ˆ 0 ˆ 0 β μ β μ β β ad are ubased estmates 0 βˆ βˆ

Dstrbutos of ad (/) The stadard devatos of βˆ ad are estmated 0 βˆ wth σ ˆ β 0 σ + ( ) βˆ 0 σ β ˆ βˆ σ ( ) σ? s ( ) ( r ) ( ) e ˆ s s a estmate of the error stadard devato σ Statstcs-Berl Che 6

Notes. Sce there s a measure of varato of the deomator both of the ucertates we just defed, the more spread out s are the smaller the ucertates βˆ ad 0 βˆ. Use cauto, f the rage of values eteds beod the rage where the lear model holds, the results wll ot be vald 3. The quattes ( ˆ β β )/ s ad ( ˆ 0 0 ˆ β β β )/ s ˆ 0 β have Studet s t dstrbuto wth degrees of freedom Statstcs-Berl Che 7

Cofdece Itervals for β 0 ad β Level 00(-α)% cofdece tervals for β 0 ad β are gve b ˆ β ± 0 t, α / s ˆ β 0 ad two-sded cofdece tervals ˆ β ± t, α / s ˆ β Statstcs-Berl Che 8

Summar We dscussed Correlato Least-squares le / regresso Ucertates the least-squares coeffcets Cofdece tervals (ad hpothess tests) for least-squares coeffcets Statstcs-Berl Che 9