Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Size: px

Start display at page:

Download "Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0"

Brittany Flowers
6 years ago
Views:

Correlation Bivariate normal densities with ρ 0 Example: Obesity index and blood pressure of n people randomly chosen from a population Two-dimensional / bivariate normal density with correlation 0

1 Correlation Bivariate normal densities with ρ 0 Example: Obesity index and blood pressure of n people randomly chosen from a population Two-dimensional / bivariate normal density with correlation 0 Correlation? In everyday language: some sort of a relationship In mathematical language: well-defined parameter Model: Sample from distribution of (X, Y ): (X 1,Y 1 ), (X 2,Y 2 ), (X n,y n ), assumed to be from a two-dimensional normal distribution: X i N 2 µ x, σ2 x ρσ x σ y Y i µ y ρσ x σ y σy 2 ρ = ρ xy is called the correlation ρσ x σ y is called the covariance

2 Bivariate densities with contour plots Example (contd.) Variables: OBESE: obesity index, i.e. weight/ideal weight BP: Systolic blood pressure Data set: OBS SEX OBESE BP Scatter plot (different symbols for each sex) 1 male male male male male male male female female female female female female

3 Scatter plot after logarithmic transformation The correlation measures: To what extent does the plot look like a straight line? Not: How near are the points to the straight line? Coefficient of correlation is estimated by: r = S xy r xy = Sxx S yy = n i=1 (x i x)(y i ȳ) n i=1 (x i x) 2 n i=1 (y i ȳ) 2 S xy = n i=1 (x i x)(y i ȳ) S xx = n i=1 (x i x) 2 S yy = n i=1 (y i ȳ) 2 assumes values between -1 and 1 0 corresponds to independence +1 and -1 correspond to a perfect linear relationship >0 (<0) : positive (negative) slope Test of independence (no correlation) H 0 : ρ xy =0 Given a sample: Is r xy = This is measured by: T = Sxy SxxSyy near 0? r xy n 2 1 rxy 2 Under H 0, i.e. if ρ xy is equal to 0, then T has a t distribution with n 2 degrees of freedom. Whether the correlation is significantly different from 0 depends on the magnitude of the true correlation, ρ xy the number of observations, n chance

4 The above correlation coefficient is based on the bivariate normal distribution the so-called Pearson correlation SAS Analyst: Statistics/Descriptive/Correlations Correlation Analysis 2 VAR Variables: LOBESE LBP Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum LOBESE LBP Pearson Correlation Coefficients / Prob > R under Ho: Rho=0 / N = 102 LOBESE LBP LOBESE LBP Nonparametric correlation coefficients: Spearman s ρ and Kendall s τ SAS Analyst: Statistics/Descriptive/Correlations, Options, tick... Correlation Analysis 2 VAR Variables: LOBESE LBP Simple Statistics Variable N Mean Std Dev Median Minimum Maximum LOBESE LBP Spearman Correlation Coefficients / Prob > R under Ho: Rho=0 / N = 102 LOBESE LBP LOBESE LBP Kendall Tau b Correlation Coefficients / Prob > R under Ho: Rho=0 / N = 102 LOBESE LBP LOBESE LBP Spearman s rank correlation Each variable is rank ordered on its own obese robese bp rbp This gives the rank differences d i (dif): PERSON ROBESE RBP DIF Estimated correlation coefficient (no ties): r s =1 6 i d2 i n 3 n r s =1 (-1): scatter plot is strictly increasing (decreasing); not necessarily linear For the example we obtain: n = 102, d 2 i = r s = =0.30 Correction for ties: complicated, but... i

Test of independence (no correlation) T = H 0 : ρ s =0 r s 1 r 2 s n 2 approximately t n

5 Correction for ties (Spearman) Use formula r s = S rx,ry Srx,rx S ry,ry where rx and ry denote the rank values robese and rbp, respectively. Test of independence (no correlation) T = H 0 : ρ s =0 r s 1 r 2 s n 2 approximately t n 2 distributed under H 0. The approximation holds for n 30; otherwise use tables. Here: t =3.14,n= 102, i.e. use t 100 p =0.002

6 Regression Examples Relationship between 2 continuous variables. Not necessarily causality! Purpose of regression analysis: Prediction Test of relationship Estimation Correction by comparing inhomogeneous groups Y : Response variable, dependent variable X: Explaining variable, covariate DATA: Paired observations of X and Y per row (individuals / units ): (x i,y i ),i=1,...,n Note: The x i s can be chosen beforehand! 1. Relationship between colinesterase activity (CE) and time till awakening (TIME) Response: TIME Explaining variable: CE Questions: How long is the expected time till awakening for a given value of CE? How large is the uncertainty about this prediction? 2. Comparison of lung capacity (FEV 1 )for smokers and non-smokers Problem: FEV 1 depends also on, e.g. height Response: FEV 1 Explaining variables: height, smoking habits Question: How much worse is the lung function in smokers?

7 Example (DGA p.300) Relationship between fasting blood glucose level (blodsuk) andmean velocity of circumferential shortening of the left heart ventricle (vcf) in diabetics?(n = 23) Response: Y =vcf, %/sec. Covariate: X=blodsuk, mmol/l Scatter plot Graphs/Scatter Plot/Two-Dimensional blodsuk X Axis vcf Y Axis (here one can also choose titles for the axes) Model for a straight line: Y (X) =α + βx OBS BLODSUK VCF Interpretation: α: intercept (intersection with Y axis) e.g.: Vcf of a diabetic with blood glucose value 0. Often an inadmissable extrapolation! β: slope, regression coefficient e.g.: Difference in vcf of two diabetics, who differ in their blood glucose values by 1 mmol/l. Often the parameter of greatest interest.

8 Statistical model: Solution / Least squares estimators SAS Analyst Y i = Y (X i )=α+βx i +ε i, ε i N(0,σ 2 ), indep. Estimation of α and β is done via the least squares method: Determine α and β, such that the sum of the squared vertical deviations, n n (y i (α + βx i )) 2 = ε 2 i, i=1 gets as small as possible i=1 Slope: ˆβ = S xy = s xy S xx s 2, x where the empirical covariance s xy = 1 n (x i x)(y i ȳ) = S xy n 1 n 1 i=1 is a measure for the co-variation between the observed X and Y values, and s 2 x = 1 n (x i x) 2 = S xx n 1 n 1 i=1 is the usual variance estimator for the X values. Intercept: ˆα =ȳ ˆβ x Example: Vcf vs. blood glucose ˆα =1.10, ˆβ = In the regression setting (cf. below): click Statistics, andtheretickat Plot observed vs. independent Estimated regression line: vcf= blodsuk

9 Dependent Variable: vcf Output Analysis of Variance Re-parametrization often there is no good interpretation of ˆα good idea: re-parametrize / use new explaining variable, e.g.: Regression analysis in SAS Analyst Statistics/Regression/Simple vcf Dependent blodsuk Explanatory Statistics Confidence limits for estimates Correlation matrix of estimates Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 blodsuk Variable DF 95% Confidence Limits Intercept blodsuk regression of Y on Z: Z = X 10 Y (Z) = α + βz (1) = α + β(x 10) = (α 10β)+βX thus, α = α +10β, i.e. α is the y value on the original line at x=10. Correlation of Estimates Variable Intercept blodsuk Intercept blodsuk Interpretation of α in the example: Vcf for a diabetic with blood glucose 10mmol/l.

10 Realization in SAS Analyst new variable sukker10 : Data/Transform/Compute sukker10 = blodsuk - 10 regression with blodsuk replaced by sukker10 : Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 sukker Parameter Estimates Variable DF 95% Confidence Limits Intercept sukker Correlation of Estimates Variable Intercept sukker10 Variance estimation estimate of σ 2, i.e. the variance around the regression line: ˆσ 2 = s 2 = 1 n 2 (= s 2 y ˆβs xy ) n (y i ˆα ˆβx i ) 2 i=1 estimate of the standard deviation around the regression line / residual standard deviation (here called Root Mean Square Error ) ˆσ = s = s 2 How good / precise are the estimates of the unknown parameters α and β? Slope It can be shown that ˆβ N(β, σ2 S xx ) i.e. we make a precise estimate of the slope, if the observations are close to the line the variation in the x values is large The estimate s is used instead of σ, so: SE( ˆ ˆβ) = s Sxx Intercept sukker This is the estimated standard error of ˆβ.

11 Intercept Similarly, thus ˆα N ( )) (α, σ 2 1 n + x2, Sxx SE(ˆα) ˆ 1 =s n + x2 Sxx Note: The two estimates ˆα and ˆβ are correlated! Cov(ˆα, ˆβ) = x Var( ˆβ) = σ2 x S xx If we center the covariates, i.e. use z i = x i x instead of x i (call the new intercept α ), we get the estimates ˆβ = S xy S xx ˆα = ȳ These estimators are independent! Tests and c.i.s for the slope typical null hypothesis: test statistic: H 0 : β =0 ˆβ T = SE( ˆ ˆβ) t n 2 95% confidence interval: ˆβ ± t 97.5%,n 2 ˆ SE( ˆβ) In the example we get: ˆβ=0.0220, s 2 =0.0470= SE( ˆ ˆβ)= n = 23, t 97.5%,21 =2.080 t = =2.10 t21, p = % confidence interval: ± = (0.0002, ) Tests and c.i.s for the intercept null hypothesis: H 0 : α = α 0 test statistic: 95% c.i. for α: T = ˆα α 0 SE(ˆα) ˆ t n 2 ˆα ± t 97.5%,n 2 ˆ SE(ˆα) e.g.: ± = (0.854, 1.342) however, this is not so interesting... instead, we could replace blodsuk by blodsuk then the new intercept estimate would be 1.317(0.045), with 95% c.i ± = (1.223, 1.411) this can be interpreted...

12 We can also test hypotheses on both α and β. But: Don t rely on two tests at the same time. Don t accept two parallel hypotheses Fitted / predicted values ŷ(x) =ˆα + ˆβx Moreover, we can construct confidence interval for line itself in order to compare with other groups of people constructed with help of SE(ˆα) ˆ and SE( ˆ ˆβ), and their mutual covariance prediction interval (normal region) for single observations to use as a diagnostic tool constructed with help of SE(ˆα) ˆ and SE( ˆ ˆβ), their mutual covariance, and s 2 Confidence intervals for the line ŷ(x) =ˆα + ˆβx Var (ŷ(x 0)) = σ 2 1 n (x0 x)2 + S xx large uncertainty, if x 0 is far from x narrowest interval at x 0 = x 95% confidence intervals (pointwise, i.e. for each x 0 ): ˆα + ˆβx 0 ± t 97.5%,n 2 sr 1 (x0 x)2 + n S xx These limits get arbitrarily narrow, as the sample size increases. This is often irrelevant!

13 Regression line with c.i.s in SAS Analyst In the regression setting click Statistics, andtickat Plot observed vs. independent Confidence limits Prediction intervals In which region will typical observations of y = α + βx i + ε lie, given x = x 0? Var(y(x 0) ŷ(x 0)) = σ (x0 x)2 + n S xx 95% prediction intervals (pointwise): ˆα + ˆβx 1n (x0 x)2 0 ± t 97.5%,n 2 s r1 + + S xx Prediction limits in SAS Analyst In the regression setting click Statistics, andtickat Plot observed vs. independent Prediction limits Interpretation: The prediction intervals include about 95% of the future observations, also for large n. These limits don t get much narrower, as the number of observations increases They are used to assess whether a new person is atypical as compared to the norm. Again, they are narrowest at x 0 = x.

14 Analysis of variance scheme in regression analysis Underlying question: Does x have an important impact as an explaining variable? Estimate models with and without the explaining variable x. Without x With x Residual sum of squares (SS) n (y i ȳ) 2 = S yy = SS total i=1 n (y i (ˆα + ˆβx i )) 2 = SS resid i=1 x is a good explaining variable, if SS resid is small compared to SS total Partition of the variation SS total = n (y i ȳ) 2 = SS resid + SS model i=1 Total variation = variation, which cannot be explained + variation, which can be explained degrees of freedom: (n-1)=(n-2)+1 null hypothesis: H 0 : β =0 test statistic: F = under H 0 SS model/1 SS resid /(n 2) F 1,n 2 Note: ˆβ T = SE( ˆ ˆβ) = F Here: f =4.414 = = t 2 Coefficient of determination, R 2 Proportion of the variation explained by the model as compared to the toal variation (in y): R 2 =1 SS resid SS total = SS model SS total Sxy =( ) 2 = r 2 SxxSyy For simple linear regression: square of Corr(x, y), i.e. grade of linear relationship For multiple regression models: square of Corr(ŷ, y) Here: R 2 =0.17 (r =0.42)

15 Regression vs. correlation But take care: If (X, Y ) is bivariate normally distributed, then we can calculate the conditional distribution of Y given X = x. It can be seen, that this is again a normal distribution E(Y X = x) is linear in x Var(Y X = x) is independent of x This means that one can perform a linear regression analysis of Y on X, aswellas calculate a correlation coefficient: ˆβ = Sxy Sxx r xy = Sxy SxxSyy ˆβ = r xy Syy Sxx The test for β =0isidentical to the test for ρ xy =0 1 r 2 xy = If ˆβ and s 2 are fixed: s 2 s 2 + ˆβ 2 Sxx n 2 S xx large, then 1 r 2 xy is near 0, and r 2 xy is near 1 r 2 xy canbearbitrarilycloseto1incaseof strongly varying x values e.g., if the central ones are left out The correlation is irrelevant, ifthex values are influenced!

16 Spurious correlation Correlation coefficient expresses a relationship not agreement (e.g., there is a relationship between age and blood pressure, but of course there is no agreement) The number itself is only meaningful, if we have sampled randomly from a well-defined population. For Pearson s correlation this population should be well described by a bivariate normal distribution. In case of selection sampling the number can be manipulated and (in theory) get arbitrarily close to 1 (or -1). Pearson s correlation measures the grade of a linear relationship. For nonlinear relationships one should use rank correlations (Spearman). Test of H 0 : ρ = 0 (independence) is OK, if the conditions for a linear regression are fulfilled. A statistically significant correlation can be theoretically interesting, but clinically not interesting. The existence of a significant correlation between two variables does not necessarily mean that there is a causal relationship between them. X and Y are positively correlated for men positively correlated for women negatively correlated for human beings

17 Model checking in simple linear regression X and Y are apparently positively correlated but uncorrelated for each age group X and Y both increase with age Misuse of correlations The correlation coefficient is very often used to measure relationships between two variables, but: The correlation coefficient expresses relationships, not agreement The correlation depends on the selection of the patients When comparing two measurement methodsitisacompletely senseless conclusion just to state that there is a signficant relationship. Of course there is one, since the same thing was measured twice! The statistical model was Y i = α + βx i + ε i, ε i N(0,σ 2 ) indep. What should we check here? linearity independence between the ε i variance homogeneity (constant σ 2 ) normally distributed errors ε i To this end we use the residuals (model deviations; observed - fitted values): ˆε i = y i ŷ i used mainly for graphical model checking Note: No assumption of normality for the x i!!

18 We have assumed that ε i N(0,σ 2 ) indep., so we would expect that the same holds for the residuals ˆε i = y i ŷ i. This is not true! They are not independent (sum up to 0) doesn t mean much if there are sufficiently many observations They do not all have the same variance where Var(ˆε i )=σ 2 (1 h ii ), h ii = 1 n + (x i x) 2 S xx is the leverage of the i th observation Normalized / studentized residuals: r i = ˆε i s 1 h ii Var(r i ) 1 Residual plots Residuals ˆε i or r i are plotted vs. the explaining variable x i to check linearity the fitted values ŷ i to check variance homogeneity and normality of the errors time or consecutively to check independence normal scores, i.e. probability plot or histogram to check normality The first three should give an impression of disorder (evenly scattered values, nothing systematic). The probability plot should fit to a straight line. SAS Analyst: Variance homogeneity? Certain plots can be produced directly in the regression setting, by clicking Plots/Residual and then choosing Residual vs. Predicted

19 What happens if the assumptions don t hold? Linearity Model gets uninterpretable transformation more explaining variables non-linear regression Variance homogeneity Estimation is inefficient (have unnecessarily large variance) transformation weighted regression Independence Variance estimate gets wrong difficult (repeated measurements) Normally distributed errors Estimation is inefficient (a little) transformation robust regression If linearity is dubious: Linearity add more covariates quadratical term blodsuk 2 vcf=α+β 1 blodsuk+β 2 blodsuk 2 Test of linearity: H 0 : β 2 =0 alder transform variables by logarithms square root inverse non-linear regression

20 Model with quadratic term: Dependent Variable: vcf Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total Root MSE R-Square Dependent Mean Adj R-Sq Coeff Var Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > t Intercept <.0001 sukker blodsuk_i_anden Parameter Estimates Variable DF 95% Confidence Limits Intercept sukker blodsuk_i_anden Correlation of Estimates blodsuk_ Variable Intercept sukker10 i_anden Intercept sukker blodsuk_i_anden Variance homogeneity (homoscedasticity) Var(ε i )=σ 2, i =1,,n constant variance (or standard deviation) Which alternatives could there be? constant relative standard deviation = constant coefficient of variation (CV ) CV = standard deviation mean value often constant, if one measures small positive quantities, e.g. concentrations this will cause a trumpet shape in the residual plot transform by logarithm Stratified experiment, e.g. in case of several instruments or laboratories difference in variances can be checked with Bartlett s test (cf. next week) Normally distributed errors not critical for the fit itself Least squares method yields the best estimate at any rate the t distribution is based on the normality assumption, but actually on the normality assumption for the estimate ˆβ, and this is often okay in case of sufficiently many observations, due to : The central limit theorem, which states that sums (and certain other functions) of many observations get more and more normally distributed.

21 Transformation logarithm, squareroot,inverse Why take logarithms? of the explaining variable to achieve linearity: if there are successive doublings, which have a constant effect: Use logarithms to the basis 2! of the response variable to achieve linearity to achieve variance homogeneity Var(log(y)) Var(y) y 2 i.e. a constant coefficient of variation of Y means a constant variance of log(y ) (the natural logarithm, to the basis e). Regression diagnostics Are the conclusions supported by the whole data set? Or are there observations with rather large influence on the results? Leverage = potential influence (hat-matrix) h ii = 1 n + (x i x) 2 S xx Observations with extreme x values can have a large influence on the results,... y but they do not necessarily! if they lie nicely with respect to the regression line, i.e. have a little residual x

22 Influencing observations Those, which have a combination of high leverage large residual Regression diagnostics Leave out the i th person and find new estimates, ˆα (i) and ˆβ (i) Calculate Cook s distance, an aggregate measure for the changes in the parameter estimates Split Cook s distance into its coordinates and specify: By how many SE s is ˆβ changed, e.g., if the i th person is left out? What do with influencing observations? leave them out? Regression with the whole data set ŷ(x) = x, ˆβ =0.022(0.010) t = =2.1, p = Regression without obs. no. 13: ŷ(x) = x, ˆβ =0.011(0.010) t = =1.05, p = state a measure for their influence?

23 dfbetas(lm.velo)[, 1] dfbetas(lm.velo)[, 2] Changes in parameter estimates and predicted values (leaving out ith obs.) blood.glucose dfbetas(lm.velo)[, 1] dfbetas(lm.velo)[, 2] dffits(lm.velo) blood.glucose blood.glucose Outliers Observations, which don t fit into the relationship they are not necessarily influencing they don t necessarily have a large residual Predicted residuals Residuals, which are obtained at each x i,if the corresponding observation (x i,y i )is excluded from the estimation. used for detecting outliers PRESS: Predicted Residuals SS What to do with outliers? look more closely at them, they are often quite interesting When can we exclude them? if they lie quite far away, i.e. have high leverage remember to distinguish the conclusions accordingly! if one can find the reason and then all these should be excluded!

24 Model checking and Diagnostics in SAS Analyst In the regression setting, use Save Data tick at Create and save diagnostics data insert (click Add) the quantities to be saved (typically: Predicted, Residual, Student, Rstudent, Cookd, Press). Double-click at Diagnostics Table in the project tree Save that by clicking File/Save as By SAS Name

Statistics for exp. medical researchers Regression and Correlation

Faculty of Health Sciences Regression analysis Statistics for exp. medical researchers Regression and Correlation Lene Theil Skovgaard Sept. 28, 2015 Linear regression, Estimation and Testing Confidence