Correlation analysis. Contents

Similar documents
Joint Gaussian Graphical Model Review Series I

18 Bivariate normal distribution I

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Introduction to bivariate analysis

Introduction to bivariate analysis

WISE International Masters

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Linear Models and Estimation by Least Squares

Statistical Hypothesis Testing

Bivariate Paired Numerical Data

Econometrics Summary Algebraic and Statistical Preliminaries

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Simple Linear Regression

Review of Statistics

Covariance and Correlation

Multivariate probability distributions and linear regression

Probability Theory and Statistics. Peter Jochumzen

MULTIVARIATE PROBABILITY DISTRIBUTIONS

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Review of probability and statistics 1 / 31

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Multivariate Regression

Homework 10 (due December 2, 2009)

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

ECON 4160, Autumn term Lecture 1

Simple Linear Regression

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Applied Econometrics (QEM)

STAT 4385 Topic 03: Simple Linear Regression

[y i α βx i ] 2 (2) Q = i=1

Simple Linear Regression Estimation and Properties

ENGG2430A-Homework 2

STT 843 Key to Homework 1 Spring 2018

Evaluation. Andrea Passerini Machine Learning. Evaluation

18.440: Lecture 26 Conditional expectation

5 Operations on Multiple Random Variables

Stat 704 Data Analysis I Probability Review

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Chapter 12 - Lecture 2 Inferences about regression coefficient

Simple and Multiple Linear Regression

conditional cdf, conditional pdf, total probability theorem?

Evaluation requires to define performance measures to be optimized

Topics in Probability and Statistics

Joint Probability Distributions and Random Samples (Devore Chapter Five)

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Bivariate Distributions

Multivariate Random Variable

ECON2228 Notes 2. Christopher F Baum. Boston College Economics. cfb (BC Econ) ECON2228 Notes / 47

The regression model with one fixed regressor cont d

Chapter 5 continued. Chapter 5 sections

Lecture: Simultaneous Equation Model (Wooldridge s Book Chapter 16)

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Chapter 10. Correlation and Regression. McGraw-Hill, Bluman, 7th ed., Chapter 10 1

Statistics. Statistics

Lecture 2: Review of Probability

This does not cover everything on the final. Look at the posted practice problems for other topics.

UCSD ECE153 Handout #34 Prof. Young-Han Kim Tuesday, May 27, Solutions to Homework Set #6 (Prepared by TA Fatemeh Arbabjolfaei)

Review of Basic Probability Theory

Wooldridge, Introductory Econometrics, 4th ed. Chapter 2: The simple regression model

If we want to analyze experimental or simulated data we might encounter the following tasks:

Lecture 3: Multiple Regression

WISE MA/PhD Programs Econometrics Instructor: Brett Graham Spring Semester, Academic Year Exam Version: A

Linear models and their mathematical foundations: Simple linear regression

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Bus 216: Business Statistics II Introduction Business statistics II is purely inferential or applied statistics.

Final Exam - Solutions

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

Mock Exam - 2 hours - use of basic (non-programmable) calculator is allowed - all exercises carry the same marks - exam is strictly individual

Multivariate Distribution Models

Probability and Statistics Notes

Lecture 2: Repetition of probability theory and statistics

Bivariate Distributions. Discrete Bivariate Distribution Example

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

STAT/MATH 395 PROBABILITY II

Multivariate Statistics

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Bivariate distributions

Statistics, Data Analysis, and Simulation SS 2015

, 0 x < 2. a. Find the probability that the text is checked out for more than half an hour but less than an hour. = (1/2)2

Statistics Introductory Correlation

STATISTICS 3A03. Applied Regression Analysis with SAS. Angelo J. Canty

Ch 2: Simple Linear Regression

Psychology 282 Lecture #4 Outline Inferences in SLR

Introduction to Computational Finance and Financial Econometrics Matrix Algebra Review

Ref.: Spring SOS3003 Applied data analysis for social science Lecture note

Chapter 5. Chapter 5 sections

Applied Econometrics - QEM Theme 1: Introduction to Econometrics Chapter 1 + Probability Primer + Appendix B in PoE

Chapter Eight: Assessment of Relationships 1/42

MAS223 Statistical Inference and Modelling Exercises

Chapter 16. Simple Linear Regression and dcorrelation

Hypothesis testing Goodness of fit Multicollinearity Prediction. Applied Statistics. Lecturer: Serena Arima

Theory of probability and mathematical statistics

Unit 10: Simple Linear Regression and Correlation

A Probability Review

Transcription:

Correlation analysis Contents 1 Correlation analysis 2 1.1 Distribution function and independence of random variables.......... 2 1.2 Measures of statistical links between two random variables........... 4 1.3 Variance and correlation matrix.......................... 5 1.4 Verification of independence........................... 6 1.5 Sample variance matrix.............................. 7 2 Excercises 8 3 Sample correlation coefficient 9 3.1 The mean and variance of the sample correlation coefficient........... 9 3.2 The confidence interval for the correlation coefficient.............. 10 3.3 Testing hypotheses about the correlation coefficient............... 10 4 Coefficient of multiple correlation 12 4.1 The introduction of the multiple correlation coefficient............. 12 4.2 The sample multiple correlation coefficient................... 13 5 Partial correlation coefficient 14 5.1 The introduction of the partial correlation coefficient.............. 14 5.2 The sample partial correlation coefficient.................... 15 6 Excercises 16

1 Correlation analysis In practical situations it is very common to decide what is the relationship between two or more random variables we are talking about the statistical linkage between these random variables. Correlation analysis methods are often used for a description of the intensity and its numeric expression. Regression analysis methods are used for analytical description of this linkage. Motivation 1 The figure shows the rise in prices in Taiwan in years 1940 1946. Independent variable X represents the year of observation, the dependent variable (regressor) Y is the index describing the rise in prices. It is clear that the points follow approximately linear trend (with one exception point at 1943) which is described in the figure. The variation of points around this line is considerable. Motivation 2 The picture shows the dependence of the stopping distance of a car Y (measured in meters) and its speed X (measured at km/hr). Data was obtained when testing the quality of manufactured tires. It could be seen that the stopping distance follows the nonlinear trend and the variability of the measured values around the fitted curve is small. Simply, the task of correlation analysis is the statistical description of the size for the linkage between X and Y, and objectives of regression analysis the description of the stochastic relationship by mathematical functions. 1.1 Distribution function and independence of random variables When describing the statistical links between random variables X and Y there are two extreme situations: Deterministic linkage Independence of variables 2

Joint distribution function F (x, y) is equal to the probability of events X x and Y y. Thus F (x, y) = P (X x Y y). In general, it is said that the distribution function F (x, y) exhaustively describes the probabilistic behavior of the random variables X and Y. Distribution function of two-dimensional normal distribution can be taken as an example. This distribution depends on the mean values EX = µ X, EY = µ Y, variances DX = σx 2, DY = σy 2 and the parameter ρ. The distribution is a generalized form of the previously defined one-dimensional normal distribution, we can denote N 2 (µ X, µ Y, σ 2 X, σ 2 Y, ρ). A graph of the density function is the well known bell-shaped function. Figure 1: N 2 (µ X, µ Y, σ 2 X, σ2 Y, ρ) for different values of the parameters µ X, µ Y, σ 2 X, σ2 Y and ρ Definition 1.1. Random variables X and Y are independent if and only if for the joint distribution function and distribution functions F x (x) of the random variables X and F Y (y) of the random variable Y (marginal distribution functions) applies multiplicative relationship for arbitrary values of variables x and y. F (x, y) = F X (x) F Y (y) Similarly, independence can be expressed by the joint density function in the continuous case, or by the joint probability function in the discrete case. In the discrete case, when a range of values of a random variable X is at most countable set M 1 and a range of the random variable Y is at most countable set M 2, we introduce a joint probability function for X and Y by p(x, y) = P (X = x Y = y) for (x, y) M 1 M 2. 3

If p 1 (x) = P (X = x) and p 2 (y) = P (Y = y) are probability functions of variables X and Y, respectively, then independence of discrete random variables X and Y could be simply characterized by p(x, y) = p 1 (x)p 2 (y) for (x, y) M 1 M 2. Analogously, probabilistic behavior of the random variable can be described by the density function in the continuous case. The joint distribution function F (x, y) corresponds with the density f(x, y) which can be determined by the formula f(x, y) = 2 F (x,y) for all real x and y. x y Independence of continuous random variables X and Y can be characterized by the relation f(x, y) = f 1 (x)f 2 (y) for any real values of x and y, f 1 (x) is the density of the random variable X and f 2 (y) is the density of the random variable Y. When describing group independence of k random variables X 1, X 2,..., X k we introduce the joint distribution function F (x 1, x 2,..., x k ) = P (X 1 x 1 X 2 x 2 X k x k ). Then the random variables X 1, X 2,..., X k are independent, if and only if F (x 1, x 2,..., X k ) = F 1 (x 1 ) F 2 (x 2 ) F k (X k ), where the distribution functions on the right side are the marginal distribution functions of random variables X 1, X 2,..., x k. The k-tuple of random variables (X 1, X 2,..., X k ) is called a random vector and is denoted X. We will write random vectors as columns, i.e., X 1 X 2 X = (X 1, X 2,..., X k ) =., where (X 1, X 2,..., X k ) denotes the transposition of the vector (X 1, X 2,..., X k ). Analogously, independence can be expressed also by discrete (or continuous) random variables (X 1, X 2,..., X k ) via the joint probability function (or the joint density function). 1.2 Measures of statistical links between two random variables First, we will focus on the statistical relationship between two random variables X and Y. We will describe it using the covariance and correlation coefficient. Definition 1.2. Covariance cov(x, Y ) of random variables X and Y is defined as cov(x, Y ) = E(X EX)(Y EY ) = E(X µ X )(Y µ Y ). The covariance cov(x, Y ) of random variables takes values between σ X σ Y and σ X σ Y. For a random variable X is cov(x, X) = DX. If X and Y are independent, their covariance is zero. In case that the joint distribution of random variables X and Y is normal, cov(x, Y ) is equal to zero, if and only if the random variables X and Y are independent. Using covariance we introduce the correlation coefficient of random variables X and Y, sometimes called Pearson correlation coefficient, denoted as ρ or precisely ρ(x, Y ). Definition 1.3. The correlation coefficient is defined by the formula ρ(x, Y ) = cov(x, Y ) σ X σ Y. The correlation coefficient is the commonly used measure of statistical link between random variables X and Y. Its advantage (in comparison with the covariance) is that it takes values between 1 and 1. X k 4

If the value is 1, between X and Y is a direct linear relationship. The value is 1 between X and Y is indirect linear relationship. In both cases, the process of the statistical linkage between Y and X can be described by the line and the observed values of couples of X and Y lie on the line. Thus, between Y and X is the deterministic linear relationship in this situation. In the case that the value of the correlation coefficient is equal to zero, we say that the random variables X and Y are uncorrelated. For a random variable X, the correlation coefficient is ρ(x, X) = 1. For N 2 (µ X, µ Y, σx 2, σ2 Y, ρ) is the parameter ρ equal to the correlation coefficient ρ(x, Y ). The correlation coefficient ρ(x, Y ) = 0, if and only if the two variables X and Y are independent. The square of the correlation coefficient is called the coefficient of determination. Its value (expressed in percents) is denoted by d and indicates the percentage of variability of the variable Y which can be explained by the variability of the variable X. Thus d = 100ρ 2. 1.3 Variance and correlation matrix A description of statistical links between the k random variables X 1, X 2,..., X k is often done simply by describing the statistical links between pairs of these variables. Let us introduce the covariance and correlation coefficients between the variables X i and X j. Definition 1.4. The matrix DX 1 cov(x 1, X 2 ) cov(x 1, X k ) cov(x 2, X 1 ) DX 2 cov(x 2, X k ) var(x) =...... cov(x k, X 1 ) cov(x k, X 2 ) DX k is called the variance (covariance) matrix of a random vector X = (X 1, X 2,..., X k ). Definition 1.5. The matrix of correlation coefficients 1 ρ(x 1, X 2 ) ρ(x 1, X k ) ρ(x 2, X 1 ) 1 ρ(x 2, X k ) cor(x) =...... ρ(x k, X 1 ) ρ(x k, X 2 ) 1 is called the correlation matrix of a random vector X = (X 1, X 2,..., X k ). The variance-covariance matrix var(x) describes the probabilistic behavior of a random vector, similarly to the fact that the variance describes the probabilistic behavior of the random variable. The correlation matrix cor(x) describes the structure of the statistical links between studied random variables. Other correlation coefficients 5

For a description of the statistical links of random variable Y and a random vector X the multiple correlation coefficient ρ(y, X) can be introduced. It is a correlation coefficient between the random variable Y and the best linear prediction of Y which is obtained from the random vector X. Finally, we mention the description of statistical links between random variables Y and Z while eliminating the effect of other variables X 1, X 2,..., X k the so-called the partial correlation coefficients ρ(y, Z X). In addition, there are a number of other statistical linkages measures (e.g. Spearman s correlation coefficient, Kendall s correlation coefficient, etc.) which are used depending on what type of random variables are available. 1.4 Verification of independence Assume two random variables X and Y, and the task is to verify their independence. We accomplish the random sample, the sample size is n, from a joint distribution of random variables X and Y, denote x i and y i the observation of the pair X and Y which was identified for the i-th statistical unit, i = 1, 2,..., n. From these values we calculate: Sample mean x of X and the sample mean ȳ of Y by the formulas x = 1 n n x i and ȳ = 1 n i=1 n y i. i=1 It can be shown that E x = µ x, Eȳ = µ y. This means that the values of x and ȳ fluctuate around unknown mean values µ x, µ y and such estimates are called unbiased. Sample variances s 2 x and s 2 y by the formulas s 2 x = 1 n 1 n i=1 (x i x) 2 and s 2 y = 1 n 1 n (y i ȳ) 2. Similarly, the sample variances s 2 x and s 2 y are an unbiased estimates of σ 2 X and σ2 Y. The sample covariance s xy is given by the formula The estimate is again unbiased. s xy = 1 n 1 n (x i x)(y i ȳ). The sample correlation coefficient r xy is given by the formula i=1 r xy = s xy s x s y. This estimate is not unbiased, but for large values of n is approximately unbiased. It means that the values fluctuate around the unknown value of correlation coefficient ρ(x, Y ). i=1 6

Example 1.1. Suppose that the joint distribution of random variables X and Y is normal N 2 (µ X, µ Y, σx 2, σ2 Y, ρ). Then the independence is equivalent to uncorrelation. It can be verified by the statistical test which is based on the test statistic t = r xy 1 r 2 xy n 2. If t > t 1 α (n 2) we reject the hypothesis of independence for random variables X and Y 2 at the significance level α. In other words, the dependence of X and Y is statistically proven at the significance level α. Example 1.2. Assume a shift operation of a firm. We observe the number of operation hours for a production line variable X and also the monthly cost of its maintenance in thousands CZK variable Y. The results are given in the table below. The aim is to determine how the number of hours correlates with the cost of its maintenance and also to test whether the statistical link between these two variables is statistically significant. x i 275 350 250 325 375 400 300 y i 149 170 140 164 192 200 165 1.5 Sample variance matrix We concentrate on the calculation of the sample variance and correlation matrix of a random vector X = (X 1, X 2,..., X k ). As in the case of two random variables, we assume that for n statistical units is observed vector X. The result of these observations is a data matrix x 11 x 1k D =..... x n1 x nk In the i-th row is the observation of vector X for the i-th statistical unit and in the j-th column are observations of the variable X j for all statistical units. The sample variance matrix is a matrix var(x), where covariances cov(x i, X j ) are replaced with sample equivalents s ij. Example 1.3. The sample variance matrix is denoted S and can be determined from the formula S = 1 (I n 1 D 1n ) E D, where I stands for the identity matrix of the type n n and E is a matrix of ones with the type is n n. Example 1.4. Similarly, the sample correlation matrix is denoted by R and can be estimated by the formula R = D 1 (s 1, s 2,..., s k ) S D 1 (s 1, s 2,..., s k ), where D 1 (s 1, s 2,..., s k ) denotes the inverse matrix to the diagonal matrix D(s 1, s 2,..., s k ), elements on the main diagonal are s i = s ii, i = 1, 2..., n.. 7

2 Excercises 1. Random variables X and Y have the joint density { x + y for0 < x < 1, 0 < y < 1, f(x, y) = 0 otherwise. Determine the correlation coefficient ρ(x, Y ). 2. It was examined how many mg of lactic acid is in 100 ml of blood of primiparous mothers (values X i ) and their newborns (values Y i ). Obtained results are shown in the following table: X i 40 64 34 15 57 45 Y i 33 46 23 12 56 40 Calculate the sample correlation coefficient and determine whether there is a statistically significant difference in the amount of lactic acid in the blood of mothers and their newborns. 3. In chosen European countries was examined the relationship between an alcohol consumption (variable X) and mortality from cirrhosis of the liver (the number of deaths per 100 000 inhabitants variable Y ). The data are taken from the monograph Andel: Statisticke metody. Obtained results are shown in the following table: Country FIN NOR IRL NLD SWE GBR BEL AUT DEU ITA FRA X 3,9 4,2 5,6 5,7 6,6 7,2 10,8 10,9 12,3 15,7 24,7 Y 3,6 4,3 3,4 3,7 7,2 3,0 12,3 7,0 23,7 23,6 46,1 Calculate the sample correlation coefficient and decide whether there is a statistically significant difference between the amount of alcohol consumption and mortality from the cirrhosis of the liver. 4. The table below shows the data according to the number of deaths in London (values of variable Y ) from 1 Dec to 15 Dec 1952, when London suffered a particularly severe fog. Furthermore, there are values of the variable X, which represents the average air pollution in Hall County in mg/m 3 and the values of the variable Z, which represents the average content of sulfur dioxide (the number of parts per one million). Day Y i x i z i Day Y i x i z i 1 112 0,30 0,09 9 430 1,22 0,47 2 140 0,49 0,16 10 274 1,22 0,47 3 143 0,61 0,22 11 255 0,32 0,22 4 120 0,49 0,14 12 236 0,29 0,23 5 196 2,64 0,75 13 256 0,50 0,26 6 294 3,45 0,86 14 222 0,32 0,16 7 513 4,46 1,34 15 213 0,32 0,16 8 518 4,46 1,34 Determine the correlation coefficients r(x, Y ), r(x, Z) and r(y, Z) and test the hypothesis that between each pair of these variables is a statistically significant linkage. 8

3 Sample correlation coefficient 3.1 The mean and variance of the sample correlation coefficient The sample correlation coefficient r is defined by formula r = r xy = s xy s 2 x s 2 y and its use for the independence testing of random variables X and Y, assuming their twodimensional normal distribution, was already presented. There will be introduced its other properties and its use, especially those based on Fisher s z-transformation. First of all, let us remind the simple formula for calculating the sample correlation coefficient Xi Y i n r = XȲ ( X 2 i n X 2) ( Yi 2 nȳ 2). The basic properties of r are: 1. Assuming normality, if ρ = 0, then for n 3 Er = 0, var r = 1 n 1. 2. In general, there is only approximate equality Er. = ρ, thus r is not an unbiased estimate of correlation coefficient ρ. 3. The variance of sample correlation coefficient is var r. = (1 ρ2 ) 2 n and it can be shown that for an increasing sample size n, the variance var r converges to zero. Definition 3.1. The variable Z given by the formula Z = 1 2 ln 1 + r 1 r is called the Fisher s z-transformation of the correlation coefficient r. Fisher s z-transformation of the correlation coefficient r has the mean value EZ. = 1 2 ln 1 + ρ 1 ρ, where ρ is the correlation coefficient of the distribution from which the random sample was taken. The variance of the variable Z is approximately var r. = 1 n 3. 9

The distribution of variable Z approaches a normal distribution with increasing n and and Then the random variable EZ. = 1 2 ln 1 + ρ 1 ρ var r. = 1 n 3. U = n 3(Z EZ) has approximately the normal distribution N(0, 1). 3.2 The confidence interval for the correlation coefficient We can also construct (approximately) the confidence interval for ρ via the z-transformation. If ρ is the actual value of the correlation coefficient, then [ n P 3 Z 1 2 ln 1 + ρ ].= 1 ρ < u 1 α 2 1 α. From this formula can be derived the 100(1 α)% confidence interval for ρ ( D 1 D + 1, H 1 ), H + 1 where [ D = exp 2Z 2u ] [ 1 α 2, H = exp 2Z + 2u ] 1 α 2. n 3 n 3 3.3 Testing hypotheses about the correlation coefficient To test the hypothesis H 0 : ρ = ρ 0, where ρ 0 ( 1, 1) is a given number, against the alternative H 1 : ρ ρ 0, at first we calculate Z = 1 2 ln 1 + r 1 r, z 0 = 1 2 ln 1 + ρ 0 1 ρ 0. Under the condition of validity H 0 the random variable U = n 3(Z z 0 ) has approximately normal distribution N(0, 1). Thus H 0 is rejected if U u 1 α 2. If there are two independent random samples from the bivariate normal distributions with correlation coefficients ρ 1 and ρ 2, the testing of the hypothesis H 0 : ρ 1 = ρ 2 against the alternative H 1 : ρ 1 ρ 2, can be based on statistics U = Z 1 Z 2 1 + 1 n 1 3 n 2 3, where Z i = 1 2 ln 1 + r i for i = 1, 2. 1 r i If U u 1 α, then the null hypothesis H 0 is reject at the significance level α. 2 10

If there are k independent random samples from bivariate normal distributions with correlation coefficients ρ i for i = 1,..., k, the testing of the null hypothesis H 0 : ρ 1 = ρ 2 = = ρ k against the alternative, that H 0 does not apply, can be based on statistics Q = k (n i 3)(Z i b) 2, i=1 where b = n 3k k i=1 (n i 3)Z i and Z i = 1 2 ln 1 + r i 1 r i for i = 1, 2..., k. Then H 0 is rejected at the significance level α, if Q > χ 2 1 α. Example 3.1. It was examined how many mg of lactic acid is in 100 ml of blood of primiparous mothers (values X i ) and their newborns (values Y i ). Obtained results are shown in the following table: X i 40 64 34 15 57 45 Y i 33 46 23 12 56 40 Determine the 95% confidence interval for ρ and using Fisher s z-transformation test the hypothesis H 0 : ρ = 0 against the alternative that H 0 does not apply. 11

4 Coefficient of multiple correlation 4.1 The introduction of the multiple correlation coefficient The commonly used correlation coefficient ρ measures dependence between two random variables. But there are many situations where it is necessary to describe the relationship between the random variable Y and the random vector X = (X 1,..., X k ). Definition 4.1. Let V = var X be a regular matrix and P = cor X the correlation matrix. Further, let cor(y, X) = (ρ Y X1, ρ Y X2,... ρ Y Xk ) be the vector of correlation coefficients (so called the correlation matrix Y and X). Then the coefficient of multiple correlation ρ 2 Y,X is introduced by the formula ρ 2 Y,X = cor(y, X)P 1 (cor(y, X)). Theorem 4.1. The multiple correlation coefficient ρ Y,X is the greatest of all correlation coefficients between Y and any nonzero linear function of the vector X. The statement very well describes the meaning of the multiple correlation coefficient. It also describes the way in which the multiple correlation coefficient measures the statistical link between Y and X. The theorem also ensures that the coefficient ρ Y,X is never less than absolute value of any correlation coefficient ρ Y,Xi, i = 1,..., k. This is widely used as a rule for checking the calculations. There are very often cases when the vector X has 2 components. Let us simplify the notation: X 0 = Y. Then ρ ij, i, j = 0, 1, 2, is the correlation coefficient between X i and X j and instead of ρ Y,X it is written ρ 0.1,2. Substituting into the theorem we get ρ 2 0.1,2 = ρ2 01 + ρ 2 02 2ρ 01 ρ 02 ρ 12. 1 ρ 2 12 The coefficient ρ 0.1,2 measures the total dependence of the variable Y on the vector (X 1, X 2 ). It should be emphasized that this dependency can be very strong, even if the dependence of Y on each component X 1, X 2 separately is small. We illustrate the mentioned fact on an example. Example 4.1. Let ρ 01 = 0 and ρ 02 0. Then ρ 0.1,2 = ρ2 02. 1 ρ 2 12 It is clear from the previous theorem that the fraction ρ 2 02/(1 ρ 2 12) is never greater than 1. But the number ρ 2 0.1,2 can be arbitrarily close to 1, if ρ 2 12 is sufficiently large. More specifically, it can be explained as follows. Let Y and X 1 are independent variables, put X 2 = Y X 1 and denote var Y = σ 2 0, var X 1 = σ 2 1. Then ρ 01 = ρ 02 = (σ 2 0 + σ 2 1) 1/2 σ 0, ρ 12 = (σ 2 0 + σ 2 1) 1/2 σ 1, ρ 2 0.1,2 = 1. If σ 0 is sufficiently small and σ 1 large enough, ρ 02 will have arbitrarily small value. Thus the variable Y is not correlated with X 1 at all, with X 2 very few, but Y is via two variables X 1 and X 2 determined unambiguously, because Y = X 1 + X 2. 12

4.2 The sample multiple correlation coefficient The sample multiple correlation coefficient R Y,X will be introduced with the same formula as for the multiple correlation coefficient ρ Y,X, only instead of correlation coefficients will be used sample correlation coefficients. Definition 4.2. The sample multiple correlation coefficient is defined R 2 Y,X = R Y,X R 1 (R Y,X ). where R is a matrix of sample correlation coefficients between the components of the vector X and R Y,X is the vector of sample correlation coefficients between Y and components of vector X. Theorem 4.2. Assume that the multiple correlation coefficient is determined by samples from the (k + 1)-dimensional normal distribution (i.e. k is the number of components X), which has the multiple correlation coefficient ρ Y,X equal to zero. Then the statistics Z = n k 1 k R 2 Y,X 1 R 2 Y,X has the Fisher-Snedecor s distribution F with k and n k 1 degrees of freedom. The null hypothesis H 0 that the multiple correlation coefficient ρ Y,X is equal to zero can be tested against the alternative that H 0 does not apply using the statistics Z. The null hypothesis H 0 is rejected at the significance level α, when Z > F 1 α (k, n k 1). 13

5 Partial correlation coefficient 5.1 The introduction of the partial correlation coefficient The correlation coefficient measures the dependence of two random variables. This dependence may not be causal and it could be influenced by other associated parameters. Consider two random variables Y and Z and a random vector X = (X 1,..., X k ). In general, the vector X can influence Y and Z. But the question is, what would be the relationship between Y and Z if the vector X did not influence these two variables. One solution is to create a situation where the vector X remains constant. However, in most cases it is impossible, and it requires the mathematical solution. Let us imagine that the effect of the vector X is subtracted from the variable Y formally expressed this means that we subtract the best linear prediction of Y determined by the vector X. Similarly, we "clean" variable Z from the influence of the vector X. Then we calculate the correlation coefficient between modified (in described manner) random variables. Definition 5.1. This correlation coefficient is called the partial correlation coefficient between Y and Z, under the assumption of given vector X. Let us denote it ρ Y,Z.X and define by the formula ρ Y,Z.X = ρ Y Z cor(y, X)P 1 cor(x, Z) [1 cor(y, X)P 1 cor(x, Y )] [1 cor(z, X)P 1 cor(x, Z)]. Unlike the multiple correlation coefficient there is not any inequality between the ordinary correlation coefficient and the partial correlation coefficient. Therefore, ρ Y,Z.X can sometimes be lower and sometimes greater than ρ Y Z. If n = 1, so that the vector X has only one component, the partial correlation coefficient ρ Y,Z.X can be written as ρ Y,Z.X, ρ Y,Z.X = ρ Y Z cor(y, X)P 1 cor(x, Z) [1 cor(y, X)P 1 cor(x, Y )] [1 cor(z, X)P 1 cor(x, Z)]. Example 5.1. At the end of the last century it was examined the dependence of a crop hay (variable Y, in England, measured in quintals per acre) on the temperature (variable Z, which is equal to the sum of day temperatures above 42 F =. 5.6 C during the considered spring period). It was found that ρ Y Z = 0.40. The negative correlation is contrary to the the experience in that area, i.e. grass is growing more if the weather is warmer. Then was taken into account also spring precipitation (value X expressed in inches) and the paradox was explained. Other correlation coefficients were ρ XY = 0.80, ρ XZ = 0.56. This is consistent with the fact that for higher pouring grow more grass, but if it rains more, the weather is colder. Then ρ Y,Z.X = 0.10 if the influence of precipitation is excluded, the dependency between Y and Z is positive, although not too large. 14

5.2 The sample partial correlation coefficient Definition 5.2. Assume the random sample with sample size n from the joint k +2 dimensional distribution of random variables Y, Z and the random vector X. The sample partial correlation coefficient R Y,Z.X is given by the formula for the partial correlation coefficient ρ Y,Z.X = ρ Y Z cor(y, X)P 1 cor(x, Z) [1 cor(y, X)P 1 cor(x, Y )] [1 cor(z, X)P 1 cor(x, Z)], but instead of theoretical correlation coefficients there are used sample correlation coefficients from the sample correlation matrix. For testing a zero value of the partial correlation coefficient could be used the test statistic T = R2 Y,Z.X 1 R 2 Y,Z.X n k 2. Statistics T has the Student s distribution t(n k) under the condition that the considered random sample is from the joint k + 2 dimensional normal distribution with ρ Y,Z.X = 0 and n > k + 2. Thus, the null hypothesis ρ Y,Z.X = 0 is rejected at the significance level α, if T > t 1 α (n k). 2 15

6 Excercises 1. In chosen European countries was examined the relationship between an alcohol consumption (variable X) and a mortality from cirrhosis of the liver (the number of deaths per 100 000 inhabitants variable Y ). The data are taken from the monograph Andel: Statisticke metody. Obtained results are shown in the following table: Country FIN NOR IRL NLD SWE GBR BEL AUT DEU ITA FRA X 3,9 4,2 5,6 5,7 6,6 7,2 10,8 10,9 12,3 15,7 24,7 Y 3,6 4,3 3,4 3,7 7,2 3,0 12,3 7,0 23,7 23,6 46,1 Calculate the interval estimate of the theoretical correlation coefficient and using the z- transformation decide whether there is a statistically significant difference between the amount of alcohol consumption and mortality from the cirrhosis of the liver. 2. It was examined how many mg of lactic acid is in 100 ml of blood of primiparous mothers (values X i ) and their newborns (values Y i ). Obtained results are shown in the following table: X i 40 64 34 15 57 45 Y i 33 46 23 12 56 40 (a) Using Fisher s z-transformation decide whether there is a statistically significant difference in the amount of lactic acid in the blood of mothers and their newborns. (b) Determine the 95% confidence interval for the theoretical correlation coefficient ρ. 3. Medical research has dealt with the concentrations monitoring of compounds A and B in the urine of patients with kidney disease. For 142 people suffering from disease type 1 was the sample correlation coefficient between compounds A and B equal to r 1 = 0.37 and for 175 people suffering from diseases type 2 was the sample correlation coefficient equal to r 2 = 0.55. For 100 healthy subjects was selective correlation coefficient between the concentrations of both compounds A and B equal to r 3 = 0.65. Using Fisher s z- transform: (a) For each pair of indexes (i, j) = (1, 2), (1, 3), (2, 3), test the null hypothesis that the correlation coefficients r i and r j correspond to the pair of samples with the same theoretical correlation coefficients, i.e. the hypothesis that ρ i = ρ j. (b) Determine a 95% confidence interval for each correlation coefficient ρ i i = 1, 2, 3. (c) Test the hypothesis H 0 : ρ 1 = ρ 2 = ρ 3. 4. The table below shows the data according to the number of deaths in London (values of variable Y ) from 1 Dec to 15 Dec 1952, when London suffered a particularly severe fog. Furthermore, there are values of the variable X, which represents the average air pollution in Hall County in mg/m 3 and the values of the variable Z, which represents the average content of sulfur dioxide (the number of parts per one million). 16

Day Y i x i z i Day Y i x i z i 1 112 0,30 0,09 9 430 1,22 0,47 2 140 0,49 0,16 10 274 1,22 0,47 3 143 0,61 0,22 11 255 0,32 0,22 4 120 0,49 0,14 12 236 0,29 0,23 5 196 2,64 0,75 13 256 0,50 0,26 6 294 3,45 0,86 14 222 0,32 0,16 7 513 4,46 1,34 15 213 0,32 0,16 8 518 4,46 1,34 (a) Determine the confidence intervals for the correlation coefficients ρ(x, Y ), ρ(x, Z) and ρ(y, Z). (b) Using the z-transformation test the hypothesis that between each pair of these variables is a statistically significant linkage. (c) Determine the coefficient of multiple correlation r Y,(X,Z). (d) Test the hypothesis that the theoretical coefficient of multiple correlation ρ Y,(X,Z) = 0 and decide whether Y depends on the random vector (X, Z). Interpret test results. (e) Determine the partial correlation coefficient r Y,X.Z and the partial correlation coefficient r Y,Z.X. (f) Test the hypothesis that the partial correlation coefficient ρ Y,X.Z = 0 Interpret the results. (g) Test the hypothesis that the partial correlation coefficient ρ Y,Z.X = 0 Interpret the test result. 17