Measuring relationships among multiple responses

Similar documents
Correlation and the Analysis of Variance Approach to Simple Linear Regression

SPSS LAB FILE 1

df=degrees of freedom = n - 1

Chapter 13 Correlation

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Nemours Biomedical Research Biostatistics Core Statistics Course Session 4. Li Xie March 4, 2015

E509A: Principle of Biostatistics. (Week 11(2): Introduction to non-parametric. methods ) GY Zou.

SAS Example 3: Deliberately create numerical problems

Business Statistics. Lecture 10: Course Review

Instrumental variables regression on the Poverty data

2.1: Inferences about β 1

y response variable x 1, x 2,, x k -- a set of explanatory variables

Regression without measurement error using proc calis

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Subject CS1 Actuarial Statistics 1 Core Principles

N Utilization of Nursing Research in Advanced Practice, Summer 2008

Lecture notes on Regression & SAS example demonstration

Textbook Examples of. SPSS Procedure

Variance. Standard deviation VAR = = value. Unbiased SD = SD = 10/23/2011. Functional Connectivity Correlation and Regression.

Ch. 16: Correlation and Regression

Contents. Acknowledgments. xix

In many situations, there is a non-parametric test that corresponds to the standard test, as described below:

Handout 1: Predicting GPA from SAT

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Lecture 1 Linear Regression with One Predictor Variable.p2

BIOL 4605/7220 CH 20.1 Correlation

Data files for today. CourseEvalua2on2.sav pontokprediktorok.sav Happiness.sav Ca;erplot.sav

EXST Regression Techniques Page 1. We can also test the hypothesis H :" œ 0 versus H :"

Bivariate Relationships Between Variables

Statistics in Stata Introduction to Stata

Count data page 1. Count data. 1. Estimating, testing proportions

ST505/S697R: Fall Homework 2 Solution.

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

Statistics for exp. medical researchers Regression and Correlation

1 A Review of Correlation and Regression

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

SPECIAL TOPICS IN REGRESSION ANALYSIS

Unit 14: Nonparametric Statistical Methods

Exam details. Final Review Session. Things to Review

Failure Time of System due to the Hot Electron Effect

ssh tap sas913, sas

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

The MIANALYZE Procedure (Chapter)

Statistics Handbook. All statistical tables were computed by the author.

Associations between variables I: Correlation

GROUPED DATA E.G. FOR SAMPLE OF RAW DATA (E.G. 4, 12, 7, 5, MEAN G x / n STANDARD DEVIATION MEDIAN AND QUARTILES STANDARD DEVIATION

Statistics Introductory Correlation

Business Statistics. Lecture 10: Correlation and Linear Regression

SAS/STAT 14.1 User s Guide. Introduction to Nonparametric Analysis

Multicollinearity Exercise

Next is material on matrix rank. Please see the handout

An Introduction to Multivariate Statistical Analysis

Chapter 4: Factor Analysis

SPSS Guide For MMI 409

Review of Statistics 101

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Ordinal Variables in 2 way Tables

Paper: ST-161. Techniques for Evidence-Based Decision Making Using SAS Ian Stockwell, The Hilltop UMBC, Baltimore, MD

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Multiple Variable Analysis

Covariance Structure Approach to Within-Cases

Correlation. Bivariate normal densities with ρ 0. Two-dimensional / bivariate normal density with correlation 0

Introduction and Descriptive Statistics p. 1 Introduction to Statistics p. 3 Statistics, Science, and Observations p. 5 Populations and Samples p.

REVIEW 8/2/2017 陈芳华东师大英语系

In Class Review Exercises Vartanian: SW 540

Appendix A Summary of Tasks. Appendix Table of Contents

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture 11: Simple Linear Regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Correlation & Linear Regression. Slides adopted fromthe Internet

Inter-Rater Agreement

Hypothesis Testing for Var-Cov Components

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

SAS/STAT 13.1 User s Guide. The MIANALYZE Procedure

Chapter 9. Multivariate and Within-cases Analysis. 9.1 Multivariate Analysis of Variance

ANALYSIS OF VARIANCE OF BALANCED DAIRY SCIENCE DATA USING SAS

Introduction to Nonparametric Analysis (Chapter)

MULTIVARIATE HOMEWORK #5

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

4.8 Alternate Analysis as a Oneway ANOVA

τ xy N c N d n(n 1) / 2 n +1 y = n 1 12 v n n(n 1) ρ xy Brad Thiessen

3 Joint Distributions 71

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA

Week 7.1--IES 612-STA STA doc

Physics 509: Non-Parametric Statistics and Correlation Testing

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Analyzing Small Sample Experimental Data

A Non-parametric bootstrap for multilevel models

Bootstrapping, Randomization, 2B-PLS

Nonparametric Independence Tests

General Linear Model (Chapter 4)

Correlation. We don't consider one variable independent and the other dependent. Does x go up as y goes up? Does x go down as y goes up?

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

One-way ANOVA Model Assumptions

Data Analysis as a Decision Making Process

Understand the difference between symmetric and asymmetric measures

Structural Equation Modeling and Confirmatory Factor Analysis. Types of Variables

Transcription:

Measuring relationships among multiple responses Linear association (correlation, relatedness, shared information) between pair-wise responses is an important property used in almost all multivariate analyses. It can be quantified as Covariance or its standardized form Correlation (r for a sample to estimate for the population) Parametric: Pearson product-moment correlation coefficient calculated by all statistical software (default in SAS) It is the covariance between pair of responses standardized based on the standard deviations of the two responses as calculated before > when to use parametric or non-parametric tests? >> a misconception is that non-parametric procedures are assumption free procedures. Non-parametric procedures are relaxed with regard to normality and equality of variance Non-parametric (in Proc CORR of Base SAS Software, SAS Procedures Guide, Procedures): > Spearman rank-order correlation (based on ranks to be used in the Pearson correlation formula) > Kendall s tau-b based on the number of correspondences and discordances of ranks in paired responses > Hoeffding s measure of dependence (D), which is a nonparametric measure of association that detects more general departures from independence. The statistic approximates a weighted sum over observations of chi-square statistics for two-by-two classification tables (Hoeffding 1948). Each set of values are cut points for the classification. 50

The formula for Hoeffding's D is where is the rank of, is the rank of, and (also called the bivariate rank) is 1 plus the number of points with both and values less than the point. A point that is tied on only the value or value contributes 1/2 to if the other value is less than the corresponding value for the point. A point that is tied on both and contributes 1/4 to. PROC CORR obtains the values by first ranking the data. The data are then double sorted by ranking observations according to values of the first variable and reranking the observations according to values of the second variable. Hoeffding's D statistic is computed using the number of interchanges of the first variable. When no ties occur among data set observations, the D statistic values are between -0.5 and 1, with 1 indicating complete dependence. However, when ties occur, the D statistic may result in a smaller value. That is, for a pair of variables with identical values, the Hoeffding's D statistic may be less than 1. With a large number of ties in a small data set, the D statistic may be less than -0.5. For more information on Hoeffding's D, see Hollander and Wolfe (1973, p. 228). 51

Correlation coefficient measures linear association > if approaches zero, it only means the lack of linear association, it does not mean that the variables are not related; there may be a non-linear relation > mathematically, it ranges between 1 and 1, and thus it can be zero but you may never see r = 0 with real data How to make an inference from an r calculated from a sample to its population (testing a hypothesis)? Hypothesis testing Looking at the value of r? > how large is large? >> depends on the field Statistical test Ho: r = 0? or = 0 (which form is correct?) HA: 1) Two sided r 2 test statistics: a t-test as t, wherese (1 r )/(n 2) to be r SE r compared with a tabular ta, (n-2) from a two-sided t-table 2) One-sided test statistics: a t-test as above to be compared with a tabular t2a,, (n-2) from a two-sided t-table 52

Assumptions and ideal conditions for testing the hypothesis Calculation of r does not rely on any assumption but its statistical test of significance does when using the laws of probabilities 1. Independence of EUs 2. Bi-variate normality 3. Randomness of both responses > usually the case in observational studies > not the case in experimental studies where one variable (x) is fixed and the response to it (y) is random) 53

Example using SAS: This example produces a correlation analysis with descriptive statistics, Pearson product-moment correlation, Spearman rank-order correlation, and Hoeffding's measure of dependence, D SAS Program and DATA (I will try to explain all related SAS codes but I may miss some; ask questions when needed) options nodate pageno=1 linesize=80 pagesize=60; /*The data set FITNESS contains measurements from a study of physical fitness for 30 participants between the ages 38 and 57. Each observation represents one person. Two observations contain missing values. */ data fitness; input Age Weight Runtime Oxygen @@; datalines; 57 73.37 12.63 39.407 54 79.38 11.17 46.080 52 76.32 9.63 45.441 50 70.87 8.92. 51 67.25 11.08 45.118 54 91.63 12.88 39.203 51 73.71 10.47 45.790 57 59.08 9.93 50.545 49 76.32. 48.673 48 61.24 11.5 47.920 52 82.78 10.5 47.467 44 73.03 10.13 50.541 45 87.66 14.03 37.388 45 66.45 11.12 44.754 47 79.15 10.6 47.273 54 83.12 10.33 51.855 49 81.42 8.95 40.836 51 77.91 10.00 46.672 48 91.63 10.25 46.774 49 73.37 10.08 50.388 44 89.47 11.37 44.609 40 75.07 10.07 45.313 44 85.84 8.65 54.7 42 68.15 8.17 59.571 38 89.02 9.22 49.874 47 77.45 11.63 44.811 40 75.98 11.95 45.681 43 81.19 10.85 49.091 44 81.42 13.08 39.442 38 81.87 8.63 60.055 ; 54

proc corr data=fitness Pearson Spearman Hoeffding; var weight oxygen runtime; run; The CORR Procedure Variables: Weight Oxygen Runtime Simple Statistics Variable N Mean Std Dev Median Minimum Maximum Weight 30 77.71 8.34 77.68 59.08 91.63 Oxygen 47.06 5.32 46.67 37.39 60.06 Runtime 10.61 1.42 10.47 8.17 14.03 Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight 30-0.20 0.30 0.15 0.43 Oxygen -0.20 0.30-0.78 <.01 28 Runtime 0.15 0.43-0.78 <.01 28 55

Spearman Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight 30-0.13 0.50 0.11 0.59 Oxygen -0.13 0.50-0.68 <.01 28 Runtime 0.11 0.59-0.68 <.01 28 Hoeffding Dependence Coefficients Prob > D under H0: D=0 Number of Observations Weight Oxygen Runtime Weight 0.99 <.01 30-0.02 0.98-0.02 Oxygen -0.02 0.98 0.17 <.01 28 Runtime -0.02 0.17 <.01 28 56

Note: P values reported in the SAS output are for two-sided tests. If your HA is directional, and the statistical results confirm the hypothesized direction, then the actual P value is half of what is reported in the SAS output. As an option in the Proc Corr statement, you may add COV to get the variance/covariance matrix. e.g., proc corr data=fitness COV; *pearson spearman hoeffding options not included here, thus, it runs Pearson Correlation by default; var weight oxygen runtime; run; 57

Output: Measures of Association for a Physical Fitness Study The CORR Procedure 3 Variables: Weight Oxygen Runtime Variances and Covariances Covariance / Row Var Variance / Col Var Variance / DF Weight Oxygen Runtime Weight 69.58-8.88 1.82 69.58 70.34 72.00 69.58 28.32 2.01 28 28 Oxygen -8.88 28.32-5.95 28.32 28.32.27 70.34 28.32 1.97 28 28 27 Runtime 1.82-5.95 2.01 2.01 1.97 2.01 72.00.27 2.01 28 27 28 Why there are three values that are sometimes the same and sometimes different for the statistic (var/cov) in each cell? 58

The same as before Simple Statistics Pearson Correlation Coefficients Prob > r under H0: Rho=0 Number of Observations Weight Oxygen Runtime Weight -0.20 0.15 0.30 0.43 30 Oxygen -0.20-0.78 0.30 <.01 28 Runtime 0.15-0.78 0.43 <.01 28 59

Partial Correlation (SAS online Doc.) A partial correlation measures the strength of a relationship between two variables, while controlling the effect of one or more additional variables. The Pearson partial correlation for a pair of variables may be defined as the correlation of errors after regression on the controlling variables. Let be the set of variables to correlate. Also let and be sets of regression parameters and be the set of controlling variables, where, is the slope, and. Suppose is a regression model for given. The population Pearson partial correlation between the and the variables of given is defined as the correlation between errors and. If the exact values of and are unknown, you can use a sample Pearson partial correlation to estimate the population Pearson partial correlation. For a given sample of observations, you estimate the sets of unknown parameters and using the least-squares estimators and. Then the fitted least-squares regression model is The partial corrected sums of squares and crossproducts (CSSCP) of given are the corrected sums of squares and crossproducts of the residuals. Using these partial corrected sums of squares and crossproducts, you can calculate the partial variances, partial covariances, and partial correlations. PROC CORR derives the partial corrected sums of squares and crossproducts matrix by applying the Cholesky decomposition algorithm to the CSSCP matrix. For Pearson partial correlations, let be the partitioned CSSCP matrix between two sets of variables, and : 60

PROC CORR calculates, the partial CSSCP matrix of after controlling for, by applying the Cholesky decomposition algorithm sequentially on the rows associated with, the variables being partialled out. After applying the Cholesky decomposition algorithm to each row associated with variables, PROC CORR checks all higher numbered diagonal elements associated with for singularity. After the Cholesky decomposition, a variable is considered singular if the value of the corresponding diagonal element is less than times the original unpartialled corrected sum of squares of that variable. You can specify the singularity criterion using the SINGULAR= option. For Pearson partial correlations, a controlling variable is considered singular if the for predicting this variable from the variables that are already partialled out exceeds. When this happens, PROC CORR excludes the variable from the analysis. Similarly, a variable is considered singular if the for predicting this variable from the controlling variables exceeds. When this happens, its associated diagonal element and all higher numbered elements in this row or column are set to zero. After the Cholesky decomposition algorithm is performed on all rows associated with, the resulting matrix has the form where is an upper triangular matrix with 61

If is positive definite, then the partial CSSCP matrix is identical to the matrix derived from the formula The partial variance-covariance matrix is calculated with the variance divisor (VARDEF= option). PROC CORR can then use the standard Pearson correlation formula on the partial variance-covariance matrix to calculate the Pearson partial correlation matrix. Another way to calculate Pearson partial correlation is by applying the Cholesky decomposition algorithm directly to the correlation matrix and by using the correlation formula on the resulting matrix. To derive the corresponding Spearman partial rank-order correlations and Kendall partial tau-b correlations, PROC CORR applies the Cholesky decomposition algorithm to the Spearman rank-order correlation matrix and Kendall tau-b correlation matrix and uses the correlation formula. The singularity criterion for nonparametric partial correlations is identical to Pearson partial correlation except that PROC CORR uses a matrix of nonparametric correlations and sets a singular variable's associated correlations to missing. The partial tau-b correlations range from -1 to 1. However, the sampling distribution of this partial tau-b is unknown; therefore, the probability values are not available. When a correlation matrix (Pearson, Spearman, or Kendall tau-b correlation matrix) is positive definite, the resulting partial correlation between variables and after adjusting for a single variable is identical to that obtained from the first-order partial correlation formula where,, and are the appropriate correlations. 62

The formula for higher-order partial correlations is a straightforward extension of the above first-order formula. For example, when the correlation matrix is positive definite, the partial correlation between and controlling for both and is identical to the second-order partial correlation formula where,, and are first-order partial correlations among variables,, and given. 63

SAS Example for partial correlation: options nodate pageno=1 linesize=120 pagesize=60; proc corr data=fitness spearman kendall cov nosimple outp=fitcorr; var weight oxygen runtime; partial age; run; Partial Correlations for a Fitness and Exercise Study The CORR Procedure 1 Partial Variables: Age 3 Variables: Weight Oxygen Runtime Partial Covariance Matrix, DF = 26 Weight Oxygen Runtime Weight 72.44-12.75 2.07 Oxygen -12.75 27.02-5.59 Runtime 2.07-5.59 1.95 64

Pearson Partial Correlation Coefficients, N = 28 Prob > r under H0: Partial Rho=0 Weight Oxygen Runtime Weight -0. 0.14 0.17 0.38 Oxygen -0.28 0.14 Runtime 0.17 0.38-0.77 <.01-0.77 <.01 Spearman Partial Correlation Coefficients, N = 28 Prob > r under H0: Partial Rho=0 Weight Oxygen Runtime Weight -0.16 0.41 0.09 0.67 Oxygen -0.16 0.41 Runtime 0.09 0.67-0.67 0.01-0.67 0.01 65

Kendall Partial Tau b Correlation Coefficients, N = 28 Weight Oxygen Runtime Weight Oxygen Runtime -0.09 0.03-0.09-0.52 0.03-0.52 66

Confidence Interval for Correlation Coefficient Is having the actual value of a correlation coefficient and its significance level sufficient to make a judgment regarding the association between two responses? What else could improve the inference?, Why? Measured by what? 67

Confidence Interval for correlation coefficient A. Graphic (Johnson D. E., p. 37) 1. r = 0.85, n = 5, C.I.: -.12 < <.95, May we test Ho with C.I.?, conclusion here? 2. r = 0.7, n =25, C.I.:.41 < <.85, conclusion regarding Ho with C.I.? 68

B. Fisher s Approximation: used for n > 25 {Tanh[invtanh(r) za/2 /(n-3).5 ]}< < Tanh {[invtanh (r) +za/2 / (n-3).5 ]} For r = 0.7, n =2 5, C.I.:.42 < <.86, which is close to what we obtained from the graph. C. Ruben s Approximation Computation a little complex, See Johnson D.E., page 39 for relevant SAS codes Power and sample size calculations for correlation? Use of correlation matrix in grouping variables > reducing dimensionality > relation to PCA and FA 69