Measuring relationships among multiple responses

Size: px

Start display at page:

Download "Measuring relationships among multiple responses"

Eileen Mitchell
6 years ago
Views:

1 Measuring relationships among multiple responses Linear association (correlation, relatedness, shared information) between pair-wise responses is an important property used in almost all multivariate analyses. It can be quantified as Covariance or its standardized form Correlation (r for a sample to estimate for the population) Parametric: Pearson product-moment correlation coefficient calculated by all statistical software (default in SAS) It is the covariance between pair of responses standardized based on the standard deviations of the two responses as calculated before > when to use parametric or non-parametric tests? >> a misconception is that non-parametric procedures are assumption free procedures. Non-parametric procedures are relaxed with regard to normality and equality of variance Non-parametric (in Proc CORR of Base SAS Software, SAS Procedures Guide, Procedures): > Spearman rank-order correlation (based on ranks to be used in the Pearson correlation formula) > Kendall s tau-b based on the number of correspondences and discordances of ranks in paired responses > Hoeffding s measure of dependence (D), which is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of values are cut points for the classification. 50

2 The formula for Hoeffding's D is where is the rank of, is the rank of, and (also called the bivariate rank) is 1 plus the number of points with both and values less than the point. A point that is tied on only the value or value contributes 1/2 to if the other value is less than the corresponding value for the point. A point that is tied on both and contributes 1/4 to. PROC CORR obtains the values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding's D statistic is computed using the number of interchanges of the first variable. When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228). 51

3 Correlation coefficient measures linear association > if approaches zero, it only means the lack of linear association, it does not mean that the variables are not related; there may be a non-linear relation > mathematically, it ranges between 1 and 1, and thus it can be zero but you may never see r = 0 with real data How to make an inference from an r calculated from a sample to its population (testing a hypothesis)? Hypothesis testing Looking at the value of r? > how large is large? >> depends on the field Statistical test Ho: r = 0? or = 0 (which form is correct?) HA: 1) Two sided r 2 test statistics: a t-test as t, wherese (1 r )/(n 2) to be r SE r compared with a tabular ta, (n-2) from a two-sided t-table 2) One-sided test statistics: a t-test as above to be compared with a tabular t2a,, (n-2) from a two-sided t-table 52

4 Assumptions and ideal conditions for testing the hypothesis Calculation of r does not rely on any assumption but its statistical test of significance does when using the laws of probabilities 1. Independence of EUs 2. Bi-variate normality 3. Randomness of both responses > usually the case in observational studies > not the case in experimental studies where one variable (x) is fixed and the response to it (y) is random) 53

5 Example using SAS: This example produces a correlation analysis with descriptive statistics, Pearson product-moment correlation, Spearman rank-order correlation, and Hoeffding's measure of dependence, D SAS Program and DATA (I will try to explain all related SAS codes but I may miss some; ask questions when needed) options nodate pageno=1 linesize=80 pagesize=60; /*The data set FITNESS contains measurements from a study of physical fitness for 30 participants between the ages 38 and 57. Each observation represents one person. Two observations contain missing values. */ data fitness; input Age Weight Runtime datalines; ; 54

6 proc corr data=fitness Pearson Spearman Hoeffding; var weight oxygen runtime; run; The CORR Procedure Variables: Weight Oxygen Runtime Simple Statistics Variable N Mean Std Dev Median Minimum Maximum Weight Oxygen Runtime Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight Oxygen < Runtime <

7 Spearman Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight Oxygen < Runtime < Hoeffding Dependence Coefficients Prob > D under H0: D=0 Number of Observations Weight Oxygen Runtime Weight 0.99 < Oxygen < Runtime <

8 Note: P values reported in the SAS output are for two-sided tests. If your HA is directional, and the statistical results confirm the hypothesized direction, then the actual P value is half of what is reported in the SAS output. As an option in the Proc Corr statement, you may add COV to get the variance/covariance matrix. e.g., proc corr data=fitness COV; *pearson spearman hoeffding options not included here, thus, it runs Pearson Correlation by default; var weight oxygen runtime; run; 57

9 Output: Measures of Association for a Physical Fitness Study The CORR Procedure 3 Variables: Weight Oxygen Runtime Variances and Covariances Covariance / Row Var Variance / Col Var Variance / DF Weight Oxygen Runtime Weight Oxygen Runtime Why there are three values that are sometimes the same and sometimes different for the statistic (var/cov) in each cell? 58

10 The same as before Simple Statistics Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight Oxygen < Runtime <

11 Partial Correlation (SAS online Doc.) A partial correlation measures the strength of a relationship between two variables, while controlling the effect of one or more additional variables. The Pearson partial correlation for a pair of variables may be defined as the correlation of errors after regression on the controlling variables. Let be the set of variables to correlate. Also let and be sets of regression parameters and be the set of controlling variables, where, is the slope, and. Suppose is a regression model for given. The population Pearson partial correlation between the and the variables of given is defined as the correlation between errors and. If the exact values of and are unknown, you can use a sample Pearson partial correlation to estimate the population Pearson partial correlation. For a given sample of observations, you estimate the sets of unknown parameters and using the least-squares estimators and. Then the fitted least-squares regression model is The partial corrected sums of squares and crossproducts (CSSCP) of given are the corrected sums of squares and crossproducts of the residuals. Using these partial corrected sums of squares and crossproducts, you can calculate the partial variances, partial covariances, and partial correlations. PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let be the partitioned CSSCP matrix between two sets of variables, and : 60

12 PROC CORR calculates, the partial CSSCP matrix of after controlling for, by applying the Cholesky decomposition algorithm sequentially on the rows associated with, the variables being partialled out. After applying the Cholesky decomposition algorithm to each row associated with variables, PROC CORR checks all higher numbered diagonal elements associated with for singularity. After the Cholesky decomposition, a variable is considered singular if the value of the corresponding diagonal element is less than times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion using the SINGULAR= option. For Pearson partial correlations, a controlling variable is considered singular if the for predicting this variable from the variables that are already partialled out exceeds. When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the for predicting this variable from the controlling variables exceeds. When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero. After the Cholesky decomposition algorithm is performed on all rows associated with, the resulting matrix has the form where is an upper triangular matrix with 61

13 If is positive definite, then the partial CSSCP matrix is identical to the matrix derived from the formula The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix. To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available. When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation matrix) is positive definite, the resulting partial correlation between variables and after adjusting for a single variable is identical to that obtained from the first-order partial correlation formula where,, and are the appropriate correlations. 62

14 The formula for higher-order partial correlations is a straightforward extension of the above first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between and controlling for both and is identical to the second-order partial correlation formula where,, and are first-order partial correlations among variables,, and given. 63

15 SAS Example for partial correlation: options nodate pageno=1 linesize=120 pagesize=60; proc corr data=fitness spearman kendall cov nosimple outp=fitcorr; var weight oxygen runtime; partial age; run; Partial Correlations for a Fitness and Exercise Study The CORR Procedure 1 Partial Variables: Age 3 Variables: Weight Oxygen Runtime Partial Covariance Matrix, DF = 26 Weight Oxygen Runtime Weight Oxygen Runtime

16 Pearson Partial Correlation Coefficients, N = 28 Prob > r under H0: Partial Rho=0 Weight Oxygen Runtime Weight Oxygen Runtime < <.01 Spearman Partial Correlation Coefficients, N = 28 Prob > r under H0: Partial Rho=0 Weight Oxygen Runtime Weight Oxygen Runtime

17 Kendall Partial Tau b Correlation Coefficients, N = 28 Weight Oxygen Runtime Weight Oxygen Runtime

18 Confidence Interval for Correlation Coefficient Is having the actual value of a correlation coefficient and its significance level sufficient to make a judgment regarding the association between two responses? What else could improve the inference?, Why? Measured by what? 67

19 Confidence Interval for correlation coefficient A. Graphic (Johnson D. E., p. 37) 1. r = 0.85, n = 5, C.I.: -.12 < <.95, May we test Ho with C.I.?, conclusion here? 2. r = 0.7, n =25, C.I.:.41 < <.85, conclusion regarding Ho with C.I.? 68

20 B. Fisher s Approximation: used for n > 25 {Tanh[invtanh(r) za/2 /(n-3).5 ]}< < Tanh {[invtanh (r) +za/2 / (n-3).5 ]} For r = 0.7, n =2 5, C.I.:.42 < <.86, which is close to what we obtained from the graph. C. Ruben s Approximation Computation a little complex, See Johnson D.E., page 39 for relevant SAS codes Power and sample size calculations for correlation? Use of correlation matrix in grouping variables > reducing dimensionality > relation to PCA and FA 69

Correlation and the Analysis of Variance Approach to Simple Linear Regression

Correlation and the Analysis of Variance Approach to Simple Linear Regression Biometry 755 Spring 2009 Correlation and the Analysis of Variance Approach to Simple Linear Regression p. 1/35 Correlation