SIMPLE LINEAR REGRESSION and CORRELATION

Similar documents
Chapter 14 Simple Linear Regression

Chapter 11: Simple Linear Regression and Correlation

Comparison of Regression Lines

Y = β 0 + β 1 X 1 + β 2 X β k X k + ε

Statistics for Economics & Business

Statistics for Managers Using Microsoft Excel/SPSS Chapter 13 The Simple Linear Regression Model and Correlation

Department of Quantitative Methods & Information Systems. Time Series and Their Components QMIS 320. Chapter 6

Chap 10: Diagnostics, p384

Regression. The Simple Linear Regression Model

Correlation and Regression. Correlation 9.1. Correlation. Chapter 9

[The following data appear in Wooldridge Q2.3.] The table below contains the ACT score and college GPA for eight college students.

Negative Binomial Regression

Basic Business Statistics, 10/e

x yi In chapter 14, we want to perform inference (i.e. calculate confidence intervals and perform tests of significance) in this setting.

Introduction to Regression

Statistics MINITAB - Lab 2

Psychology 282 Lecture #24 Outline Regression Diagnostics: Outliers

Chapter 15 Student Lecture Notes 15-1

STAT 3008 Applied Regression Analysis

Chapter 13: Multiple Regression

Chapter 9: Statistical Inference and the Relationship between Two Variables

28. SIMPLE LINEAR REGRESSION III

SIMPLE LINEAR REGRESSION

18. SIMPLE LINEAR REGRESSION III

Linear Regression Analysis: Terminology and Notation

F statistic = s2 1 s 2 ( F for Fisher )

Biostatistics 360 F&t Tests and Intervals in Regression 1

/ n ) are compared. The logic is: if the two

Learning Objectives for Chapter 11

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Linear regression. Regression Models. Chapter 11 Student Lecture Notes Regression Analysis is the

Statistics for Business and Economics

Lecture Notes for STATISTICAL METHODS FOR BUSINESS II BMGT 212. Chapters 14, 15 & 16. Professor Ahmadi, Ph.D. Department of Management

Unit 10: Simple Linear Regression and Correlation

Lecture 6: Introduction to Linear Regression

Systematic Error Illustration of Bias. Sources of Systematic Errors. Effects of Systematic Errors 9/23/2009. Instrument Errors Method Errors Personal

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Economics 130. Lecture 4 Simple Linear Regression Continued

β0 + β1xi. You are interested in estimating the unknown parameters β

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

STATISTICS QUESTIONS. Step by Step Solutions.

Interval Estimation in the Classical Normal Linear Regression Model. 1. Introduction

2016 Wiley. Study Session 2: Ethical and Professional Standards Application

Chapter 15 - Multiple Regression

Statistics II Final Exam 26/6/18

NANYANG TECHNOLOGICAL UNIVERSITY SEMESTER I EXAMINATION MTH352/MH3510 Regression Analysis

17 - LINEAR REGRESSION II

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

where I = (n x n) diagonal identity matrix with diagonal elements = 1 and off-diagonal elements = 0; and σ 2 e = variance of (Y X).

x i1 =1 for all i (the constant ).

Topic 7: Analysis of Variance

Resource Allocation and Decision Analysis (ECON 8010) Spring 2014 Foundations of Regression Analysis

ANOVA. The Observations y ij

Chapter 14 Simple Linear Regression Page 1. Introduction to regression analysis 14-2

The SAS program I used to obtain the analyses for my answers is given below.

Correlation and Regression

Reduced slides. Introduction to Analysis of Variance (ANOVA) Part 1. Single factor

β0 + β1xi. You are interested in estimating the unknown parameters β

Biostatistics. Chapter 11 Simple Linear Correlation and Regression. Jing Li

Lecture 3 Stat102, Spring 2007

The Ordinary Least Squares (OLS) Estimator

Properties of Least Squares

Introduction to Analysis of Variance (ANOVA) Part 1

Statistics for Managers Using Microsoft Excel/SPSS Chapter 14 Multiple Regression Models

ANSWERS CHAPTER 9. TIO 9.2: If the values are the same, the difference is 0, therefore the null hypothesis cannot be rejected.

Lecture 6 More on Complete Randomized Block Design (RBD)

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

e i is a random error

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 4 Hypothesis Testing

Statistics Chapter 4

January Examinations 2015

Regression Analysis. Regression Analysis

Lecture 2: Prelude to the big shrink

Lecture 16 Statistical Analysis in Biomaterials Research (Part II)

IV. Modeling a Mean: Simple Linear Regression

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Activity #13: Simple Linear Regression. actgpa.sav; beer.sav;

Measuring the Strength of Association

Chapter 3 Describing Data Using Numerical Measures

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

Reminder: Nested models. Lecture 9: Interactions, Quadratic terms and Splines. Effect Modification. Model 1

Number of cases Number of factors Number of covariates Number of levels of factor i. Value of the dependent variable for case k

Professor Chris Murray. Midterm Exam

STAT 3340 Assignment 1 solutions. 1. Find the equation of the line which passes through the points (1,1) and (4,5).

( )( ) [ ] [ ] ( ) 1 = [ ] = ( ) 1. H = X X X X is called the hat matrix ( it puts the hats on the Y s) and is of order n n H = X X X X.

β0 + β1xi and want to estimate the unknown

a. (All your answers should be in the letter!

First Year Examination Department of Statistics, University of Florida

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Systems of Equations (SUR, GMM, and 3SLS)

On the Influential Points in the Functional Circular Relationship Models

General Linear Models

7.1. Single classification analysis of variance (ANOVA) Why not use multiple 2-sample 2. When to use ANOVA

Introduction to Generalized Linear Models

Scatter Plot x

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

4.3 Poisson Regression

Transcription:

Expermental Desgn and Statstcal Methods Workshop SIMPLE LINEAR REGRESSION and CORRELATION Jesús Pedrafta Arlla jesus.pedrafta@uab.cat Departament de Cènca Anmal dels Alments

Items Correlaton: degree of assocaton Regresson: predcton The model Assumptons Matrx notaton Protocol of analss Plottng data ANOVA n regresson Confdence ntervals Analss of resduals Influental observatons Basc commands cor.test lm anova seq nfluence.measures Lbrares car (scatterplot)

Analss of several varables Two man nterests:. Estmatng the degree of assocaton between two varables: CORRELATION analss.. Predctng the values of one varable gven that we know the realsed value of another varable(s): REGRESSION analss. Ths analss can also be used to understand the relatonshp among varables. a) A response varable and an ndependent varable: smple (lnear) regresson. b) A response varable and two or more ndependent varables: multple (lnear) regresson. c) When the relatonshp among varables s not lnear: nonlnear regresson. d) If the varable s a dchotomous or bnar varable: logstc regresson. 3

Data example Suppose we have recorded the age (ears) and blood pressure (mm Hg) of 0 people, obtanng the data presented n the table. Age Blood pressure 0 0 43 8 63 4 6 6 53 34 3 8 58 36 46 3 58 40 70 44 46 8 53 36 70 46 0 4 63 43 43 30 6 4 9 3 6 3 3 4

Smple statstcs > ## Importng data > BLOODP<-read.csv("bloodpress.csv", header=t) > attach(bloodp) > optons(na.acton=na.exclude) > summar(bloodp) AGE BLPRESS Mn. :9.0 Mn. :0.0 st Qu.:6.0 st Qu.:5.5 Medan :44.5 Medan :9.0 Mean :43. Mean :3.5 3rd Qu.:58.0 3rd Qu.:37.0 Max. :70.0 Max. :46.0 To avod problems n predcton when mssng values are present, we must use optons(na.acton=na.exclude). Wth the current data set t would be unnecessar. 5

BLPRESS 0 5 30 35 40 45 Plot of raw data A plot for a par of varables gves us a frst mpresson about ther relatonshp. It s also useful for detectng some extreme values. > plot(age,blpress) > lbrar(car) > scatterplot(age,blpress) 0 30 40 50 60 70 AGE Data not obvousl non lnear and no evdence of non-normalt (boxplots not asmmetrcal). No evdence of extreme values. 6

Correlaton (Pearson) The correlaton s a measure of the degree of assocaton between two varables. It s calculated as r s an estmator of, the populaton parameter. r cov( x, ) s x s ( x)( x) ) ) > cor.test(blpress, AGE) x ( x Pearson's product-moment correlaton ( data: BLPRESS and AGE t = 6.06, df = 8, p-value = 4.39e- alternatve hpothess: true correlaton s not equal to 0 95 percent confdence nterval: 0.96050 0.9869976 sample estmates: cor 0.966699 r The denomnator s the geometrc mean of the sample varances estmates. Ths makes r to range from - to. As close s an estmate to - or, the correlaton s larger. H 0 : ( = 0), s rejected 7

Correlaton sample sze - The sample sze requred to have a partcular correlaton statstcall dfferent from 0 depends upon the same correlaton coeffcent: z' 0.5ln r r Fsher s classc z-transformaton to normalze the dstrbuton of Pearson correlaton coeffcent. n z / z z' r z' 0 r 3 r 0 = 0 and r s the magntude of the coeffcent we want to estmate. Sample sze for a power of 80% 90% 0. 78 044 0. 94 58 0.3 85 3 0.4 47 6 0.5 9 38 0.6 0 5 0.7 4 7 0.8 0 0.9 7 8 8

Smple lnear regresson - the model - Dependent varable 0 0 x Intercept Regresson coeffcent (slope) = tg s the ncrease of the dependent varable when the ndependent varable ncreases unt Independent varable x Random error To estmate 0 and we resort to the Least Squares methodolog,.e., mnmze the sum of the squares of the devatons (red arrows) between actual (blue damond) an predcted values (on the slope). ˆ ˆ 0 cov( x, ) s x ˆ x 9

Assumptons n regresson analss. The varables x and are lnearl related (defnton of the model).. Both varables are measured for each of n observatons. 3. Varable x s measured wthout error (fxed). 4. Varable s a set of random observatons measured wth error. 5. The errors are ndependent and normall dstrbuted wth homogeneous varance: ε ~ N( 0, I e ) Some of the above condtons can be seen n the fgure. For each value (fxed) of x, there s a normal dstrbuton of (random), wth mean on the regresson lne. x x. x n 0

Matrx notaton n n x n x x x............ 3 0 3 3 X X X β X X Xβ ' ) ' ( ˆ ' ˆ ' As n ANOVA, we can mnmze the sum of the squared errors and then we have the normal equatons: ε Xβ Now X X s not sngular and can be solved wthout need of a generalsed nverse (or restrctons). Note that X s not a matrx of 0 and, but contans the values of the ndependent varable x.

Smple lnear regresson Protocol -. Decde whch varable s to be and whch s to be x.. Plot data, n the vertcal axs. 3. Check evenness of x and varables b a box-plot. 4. Transform x and/or f not even. 5. Compute regresson, save resduals, ftted values and nfluence statstcs. Calculate Durbn-Watson statstc f data are n a logcal order. 6. Plot studentzed or standardzed resduals aganst ftted values (or x varable). Examne resdual plots for outlers, and consder rejecton of outlers wth studentzed or standardzed resduals > 3 and go to step 5. 7. Compare nfluence statstcs wth crtcal values: Leverage > p/n Dffts (absolute value) > (p/n) Cook s D > 4/n Dfbetas > /n Where p = number of parameters n the model (number of ) and n = number of data ponts n the regresson. If two or more nfluence statstcs (among the frst three) are greater than the crtcal values, consder rejectng ponts and return to step 5. 8. If outlers or leverage ponts are a problem, consder usng a robust regresson method. (Adapted from Fr, 993)

Smple lnear regresson - Results () - > BLOODP.REG <- lm(blpress ~ AGE); summar(bloodp.reg) Resduals: Mn Q Medan 3Q Max -4.7908 -.777 0.688.875.786 ˆ 0 ˆ Coeffcents: Estmate Std. Error t value Pr(> t ) (Intercept).3666.8744 87.4 < e-6 *** AGE 0.44509 0.0777 6.03 4.4e- *** --- Sgnf. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. H 0 : =0, s rejected For each ncrement of ear, blood pressure ncreases 0.445 mm Hg ˆ.367 0. 445 x t( ˆ ) ˆ s. e.( ˆ ) 0.44509 0.0777 6.03 3

Smple lnear regresson - Results () - Resdual standard error:. on 8 degrees of freedom Multple R-squared: 0.9345, Adjusted R-squared: 0.9309 F-statstc: 56.8 on and 8 DF, p-value: 4.39e- > anova(bloodp.reg) Analss of Varance Table F t H 0 ( =0) s rejected Response: BLPRESS Df Sum Sq Mean Sq F value Pr(>F) AGE 54. 54. 56.84 4.39e- *** Resduals 8 80.88 4.49 --- Sgnf. codes: 0 *** 0.00 ** 0.0 * 0.05. 0. R-Squared s the square of the correlaton coeffcent. It represents the fracton of the total varaton n blood pressure that s explaned b the lnear relatonshp wth age. Adj R-Sq ncludes a correcton to overcome the ncrement n R-Squared wth the number of regressors (k). R R adj SS AGE / SS ( R TOTAL N ) N k 4

ANOVA n regresson ( ˆ) ( ˆ ) ( ) ( ˆ) ( ˆ ) Devated to regresson Due to regresson 0 x Squarng and and summng on both sdes of the equaton we can arrve at the followng ANOVA table: Source d.f. S.S. M.S. E(M.S.) F Due to regresson Devatons to regresson SP x ˆ SP x n- SS ˆ SPx ˆ SS x ( SS ˆ SP ) /( n ) x MS Reg / MS Error 5

Smple lnear regresson - Results (3) - Confdence ntervals for (for 0 s smlar): ( ˆ..( ˆ ), ˆ..( ˆ t / s e t / s e )) Ths can be done easl wth R (both for b 0 and b ): > confnt(bloodp.reg,level=0.95).5 % 97.5 % (Intercept) 09.68594 5.0463 AGE 0.3867409 0.5034373 6

Smple lnear regresson - Results (4) - > data.frame(bloodp, Predcted=ftted(BLOODP.REG), Resduals=resd(BLOODP.REG), + RIstudent=rstandard(BLOODP.REG), Restudent=rstudent(BLOODP.REG)) AGE BLPRESS Predcted Resduals RIstudent Restudent 0 0.84 -.8440-0.60388-0.6094600 43 8 3.4555-3.4554909 -.6745066 -.7685383 3 63 4 40.3573 0.64778 0.38488 0.34659 4 6 6 3.8890.0338.0498400.05300889 5 53 34 35.9064 -.9063896-0.9309644-0.9733538 6 3 8 6.44.88557795 0.94933 0.90499 7 58 36 38.38 -.38739 -.0533803 -.0565337 8 46 3 3.7908-0.79075835-0.383060-0.373750 9 58 40 38.38.86876 0.989068 0.9889 0 70 44 43.479 0.570357 0.73634 0.6647648 46 8 3.7908-4.79075835 -.304789 -.6937696 53 36 35.9064 0.0936804 0.045775 0.044430 3 70 46 43.479.570357.387546.3406303 4 0 4.84.7855790.4676.460594 5 63 43 40.3573.64778.3744606.35840 6 43 30 3.4555 -.4554909-0.70445473-0.694439 7 6 4 3.8890 0.0338 0.055340 0.0536633 8 9 0.7734 0.664698 0.5949 0.7449 9 3 6 6.44-0.4405-0.056736-0.05455077 0 3 3.5537 0.4469064 0.43470 0.83376 ˆ ˆ.37 0.4450.84; r 0.84.84 RIstudent s an Internall studentzed resdual,.e., the resdual dvded b the own standard error (not unform across observatons). REstudent s an Externall studentzed resdual. Onl observaton s a weak outler. 7

Some statstcs useful for regresson analss Internall studentzed resdual Weak outler, rs > (95% confdence) Strong outler, rs >3 (95% confdence) rs r MSE( h ) ~ t N k Externall studentzed resduals (-) Calculated as the prevous one, but removng the observaton to calculate the s. Under H 0, t follows a t dstrbuton wth N-k- df. Leverage (h ) Standardzed value of how much an observaton devates from the centre of the space of x values. Observatons wth hgh leverage can ndcate an outler n the x and are potentall nfluent. Computed as the dagonal elements of X(X X) - X. DFFITS ˆ s ˆ, h R student h h where c jj are the dagonal elements of (X X) -. Analse onl DFBETAS correspondng to hgh values of DFFITS. DFBETAS Cook s D Essentall a DFFITS statstc scaled and squared to make extreme values stand out more clearl. j, ˆ j ˆ s c j, jj 8

Smple lnear regresson - Results (5) - > nfluence.measures(bloodp.reg) dfb._ dfb.age dfft cov.r cook.d hat nf -0.395 0.99083-0.475.5 3.7e-0 0.46-0.559 0.00377-0.4057 0.84 7.36e-0 0.0500 3-0.05363 0.087353 0.5.56 6.97e-03 0.80 4 0.36-0.48700 0.354.098 6.4e-0 0.00 5 0.03674-0.45-0.48.088 3.0e-0 0.0668 6 0.000-0.583 0.65.00 3.47e-0 0.075 7 0.0973-0.5983-0.384.083 5.36e-0 0.088 8-0.0804-0.04580-0.0870.63 3.98e-03 0.054 9-0.09543 0.87846 0.856.6 4.e-0 0.088 0-0.0795 0.03347 0.4.346 7.90e-03 0.74 * -0.3000-0.05085-0.673 0.58.46e-0 0.054 * -0.0076 0.005966 0.09.0 7.48e-05 0.0668 3-0.3695 0.5997 0.657.0.8e-0 0.74 4 0.5730-0.476956 0.5930.03.65e-0 0.46 5-0.35 0.37706 0.4967.034.8e-0 0.80 6-0.0595 0.000933-0.593.6.3e-0 0.0500 7 0.0644-0.0674 0.079.46.70e-04 0.00 8 0.04595-0.038599 0.0473.37.8e-03 0.497 9-0.0303 0.00899-0.055..8e-04 0.075 0 0.076-0.0668 0.0804.66 3.4e-03 0.93 All values of DFFIT are below the crtcal value 0.63 (= (/0)),.e., not nfluental observatons on the predcted values. DFBETAS (dfb.) test nfluence on the parameter estmates, and do not need to be examned because DFFIT values are low. Cook s D values are below the crtcal value 0. (=4/0). Leverage s presented n hat. All values are lower than 0., the crtcal value. 9

Crtera to flag an observaton as nfluental n R In slde we presented some crtcal ponts to decde f an observaton can be nfluental or not. These crtcal ponts are not statstcal tests but rules of thumb. Furthermore, there are not agreement among statstcans on the values. In fact, R puts a flag (a star) on an observaton, when: an of ts absolute dfbetas value s greater than, or ts absolute dffts value s greater than 3(p/(n-p), or abs(-covrato) s greater than 3p/(n-p), or ts Cook s dstance s greater than the 50% percentle of an F- dstrbuton wth p and n-p degrees of freedom, or ts hat value s greater than 3p/n Where p denotes the number of model coeffcents, ncludng the ntercept. 0

Some graphcs about nfluental observatons Hgh leverage, nfluental Hgh leverage, not nfluental x x Low leverage, nfluental Low leverage, not nfluental x x

Standardzed resduals - - 0 Standardzed resduals - - 0 Resduals -4-0 Standardzed resduals 0.0 0.5.0.5 Smple lnear regresson - dagnostcs - > laout(matrx(c(,,3,4),,)) # optonal 4 graphs/page > plot(bloodp.reg) Resduals are dstrbuted approxmatel at random: homogenet of varance met. No mportant devatons n Q-Q plot: response varable normal. None of the ponts approach the hgh Cook s dstance contour(s): none of the observatons are nfluental. 4 Resduals vs Ftted 0 5 30 35 40 Ftted values Normal Q-Q 4 4 Scale-Locaton 0 5 30 35 40 Ftted values Resduals vs Leverage Cook's dstance 4 3 0.5 - - 0 Theoretcal Quantles 0.00 0.05 0.0 0.5 Leverage

Some plots of resduals Ideal resdual plot (random dstrbuton around 0) e 0 ŷ e 0 Model should nvolve curvature e 0 Heterogeneous varance ŷ ŷ 3

Blood pressure (mm Hg) Smple lnear regresson - Regresson lne and CL - 45 Blood presure = 0.44*Age+.3 R 0.935 95% upper lmt Regresson lne 95% lower lmt 40 Observatons ( ) 35 30 5 ( x x) sˆ MSRES n SSxx Note that for greater values of x the standard error of predcted values s greater, and thus CL. Ths s lttle dstngushable when the predcton s made n the nterval of the observed x s. 0 0 30 40 50 60 70 Age (ears) 4

Smple lnear regresson program of the graphc - > ## Summar scatterplot > #cretate a plot wth sold dots (pch=6) and no axs or labels > plot(blpress~age, pch=6, axes=f, xlab="", lab="") > #put the x-axs (axs) wth smaller label font sze > axs(, cex.axs=.8) > #put the x-axs label 3 lnes down from the axs > mtext(text="age (ears)", sde=, lne=3) > #put the -axs (axs ) wth horzontal tck labels > axs(, las=) > #put the -ax label 3 lnes to the left of the axs > mtext(text= "Blood pressure (mm Hg)", sde=, lne=3) > #add the regresson lne from the ftted model > ablne(bloodp.reg) > #add the regresson formula > text(50,45,"blood presure = 0.44*Age+.3", pos=) > #add the r squared value > text(50,43,expresson(paste(r^==0.935)), pos=) > #create a sequence of 00 numbers spannng the range of ages > x<-seq(mn(age), max(age), l=000) > #for each value of x, calculate the upper and lower 95% confdence > <-predct(bloodp.reg, data.frame(age=x), nterval="c") > #plot the upper and lower 95% confdence lmts > matlnes(x,, lt=, col=) > #put an L-shaped box to complete the axs > box(bt="l") 5

References Fr J.C. 993. Bologcal Data Analss. IRL Press, Oxford. 6