MULTIVARIATE DISTRIBUTIONS

Size: px

Start display at page:

Download "MULTIVARIATE DISTRIBUTIONS"

Barrie Jennings
5 years ago
Views:

1 Chapter 9 MULTIVARIATE DISTRIBUTIONS John Wishart ( ) British statistician. Wishart was an assistant to Pearson at University College and to Fisher at Rothamsted. In 1928 he derived the distribution which bears his name. As a professor of Statistics and Agriculture at Cambridge, he made outstanding contributions to experimental designs as well. He combined his academic work with that of a consultant for international organizations involved in the application of statistical methods to agriculture. 9.1 BASIC CONCEPTS A central problem in data analysis is deciding whether the properties found in a sample can be generalized to the population from which they are taken. In order to be able to carry out this extrapolation we need to build a model of the system which generates the data, meaning that we assume a probability distribution for the random variable in the population. This chapter reviews some basic concepts for constructing multivariate statistical models and presents the distributions which will be used for inference in the following chapters Vector random variables. A vector random variable is the result of observing p characteristics of an item in a population. For example, if we observe the age and weight of students at a university we will have the values of a bivariate random variable; if we observe the number of workers, sales and profit of the companies in a sector, we will have a trivariate vector. 1

2 We can say that we have a joint distribution of a random vector when the following are specified: 1. The sample space of the possible values. Representing each value with a point in the space of dimension p, R p, of real numbers, the sample space is, in general, a subset of this space. 2. The probabilities of each possible result of the sample space. We say that a vector variable p dimensional is discrete when each of the p scalar variables which comprise it are discrete as well. For example, eye and hair color make up a discrete bivariate variable. Analogously, we say that the variable is continuous if its components are. When some of its components are discrete and others are continuous we say that the variable is mixed. For example, the variable: gender (0=male, 1=female), height and weight, is a mixed trivariate. In this chapter, for the sake of simplicity and except when otherwise indicated, we will assume that the vector variable is continuous Joint distribution The joint distribution of a vector random variable F (x) is defined at point x 0 = (x 0 1,..., x0 p) by using: F (x 0 ) = P (x x 0 ) = P (x 1 x 0 1,..., x p x 0 p) where P (x x 0 ) represents the probability that the variable will take values less than or equal to the particular value under consideration, x 0. Thus, the distribution function accumulates the probabilities of the values which are less than or equal to the point considered, and will be non-decreasing. Although the distribution function is of great theoretical interest, in practice it is more useful to work with the density function for continuous variables or with the probability function for discrete variables. Let p(x 0 ) be the probability function of a discrete variable defined by p(x 0 ) = P (x = x 0 ) = P (x 1 = x 0 1,..., x p = x 0 p). We say that vector x is absolutely continuous if there is a density function, f(x), which satisfies: F (x 0 ) = x 0 f(x)dx, (9.1) where dx = dx 1...dx p and the integral is a multiple integral in dimension p. The probability density has the usual interpretation of a density: mass by unit of volume. Thus, the joint density function must verify a) f(x) = f(x 1..., x p ) 0. Density is always non-negative. 2

3 b) f(x)dx =... f(x 1,...x p ) dx 1...dx p = 1. If we multiply the density in each point by the element of volume in p dimensions ( if p = 2, it will be the area of a rectangle, if p = 3 it will be the volume of a parallelepiped, etc.) and we sum (integrate) for all points with non-zero density, we obtain the mass of total probability which standardizes the unit value. The likelihood of outcomes defined as subsets of the sample space will be equal to the total probability corresponding to the subset. These probabilities are calculated by integrating the density function over the subset. For example, for a bivariate variable and outcomes A where A = (a < x 1 b; c < x 2 d): whereas, in general, P (A) = b d a c P (A) = f(x 1, x 2 )dx 1 dx 2 A f(x)dx. In this chapter, and in order to simplify the notation, we will use the letter f when referring to the density function of any variable and will indicate the variable by the argument of the function, so that f(x 1 ) is the density function of the variable x 1, and f(x 1, x 2 ) is the density function of the bivariate variable (x 1, x 2 ) Marginal and conditional distributions Given a p dimensional random vector (x 1,..., x p ) we will call the univariate distribution of each component x i its marginal distribution. In the marginal distribution components are considered individually, ignoring the values of the remaining components. For example, for bivariate continuous variables, the marginal distributions are obtained as: f(x 1 ) = f(x 2 ) = f(x 1, x 2 )dx 2, (9.2) f(x 1, x 2 )dx 1, (9.3) and represent the density function of each variable ignoring the values taken by the other. As mentioned earlier, the letter f refers generically to the density function. The functions f(x 1 ) and f(x 1, x 2 ) in general are totally distinct and share only the fact that they are density functions, thus f (.) 0, and f(x 1 )dx 1 = 1 3

4 f(x 1, x 2 )dx 1 dx 2 = 1. In order to justify (9.2), we calculate the probability that variable x 1 belongs to an interval (a, b] starting from the joint distribution. Thus: P (a < x 1 b) = P (a < x 1 b; < x 2 ) = = b a f(x 1 )dx 1 b a dx 1 f(x 1, x 2 )dx 2 = which justifies (9.2). We can see that in this equation, x 1 is any fixed value. We assume that the accuracy of the measurement of x 1 is x 1, meaning that, we will say that a value x 1 has occurred if a value is observed in the interval x 1 ± x 1 /2. The likelihood of this value is the density value in the center of the interval, f(x 1 ) multiplied by the longitude of the base x 1. If we multiply both sides of equation (9.2) by the constant x 1, we have the first part f(x 1 ) x 1, which is the probability of this value of x 1 calculated with its univariate distribution. In the second part we have the sum of all the probabilities of the pairs of possible values (x 1, x 2 ), when x 1 is fixed and x 2 takes all possible values. These probabilities are given by f(x 1, x 2 )dx 2 x 1, and summing for all possible values of x 2 we again obtain the probability of the value x 1. If x = (x 1, x 2 ), where x 1 and x 2 are at the same time vector variables, the conditional distribution of x 1 is defined for a given value of the variable x 2 = x 0 2, by: f(x1 x 0 2) = f(x 1, x 0 2 ) f(x 0 2 ) (9.4) assuming that f(x 0 2 ) 0. This definition is consistent with the concept of conditional probability and with the density function for a variable. We assume, in order to simplify, that both variables are scalar. Then multiplying both sides by x 1 we have f(x 1 x 0 2) x 1 = f(x 1, x 0 2 ) x 1 x 2 f(x 0 2 ) x 2 and the first member represents the conditional probability that is expressed as a ratio of the joint and marginal probability. From this definition it can be deduced that: f(x 1, x 2 ) = f(x 2 x 1 )f(x 1 ). (9.5) The marginal distribution of x 2, can be calculated according to (9.3) and (9.5) as: f(x 2 ) = f(x 2 x 1 )f(x 1 )dx 1, (9.6) 4

5 which has a clear intuitive interpretation. If we multiply both sides by x 2, the element of volume, then on the left side, we have f(x 2 ) x 2, the probability of the given value of x 2. Formula (9.6) tells us that this probability can be calculated, first by obtaining the probability of the value x 2 for each possible value of x 1, given by f(x 2 x 1 ) x 2, and then multiplying each one of these values by the probabilities of x 1, f(x 1 )dx 1, which is equivalent to averaging the conditional probabilities by x 1 with respect to the distribution of this variable. As a result of (9.5) and (9.6) the conditional distribution f(x 1 x 2 ) can then be expressed as: f(x 1 x 2 ) = f(x 2 x 1 )f(x 1 ) f(x2 x 1 )f(x 1 )dx 1 (9.7) which is Bayes theorem for density functions, and constitutes a fundamental tool in Bayesian inference which we will study in Chapter 11. For discrete variables, the concepts are similar, but the integrals are replaced with sums as shown in the example below. Example: Table 9.1 gives the joint distribution of the discrete random variables: x 1 : vote for one of four possible political parties, whose four possible values are P 1, P 2, P 3 and P 4 ;and x 2 : the level of voter income, which takes the three values H (high), M (medium) and L (low). We calculate the marginal distributions, the conditional distribution of the votes for people of low income and the conditional distribution of income for those who voted for party P 4. H M L P P P P Table 9.1. Joint distribution of votes and income in a population In order to calculate the marginal distribution, we add a row and a column to the table and there we include the totals resulting from summing the rows and columns. With this, Table 9.2 is obtained. For example, the marginal distribution of income indicates that the probability of high income is.2, of middle income is.6 and low income is.2. We see that the marginal distributions are the totals which were obtained in the margins of the table (which explains its name) by adding the joint probabilities by rows and columns. 5

6 H M L Marginal votes P P P P Marginal income Table 9.2. Joint and marginal distribution of votes and income in a population In order to calculate the conditional distribution of votes for people with low income, we divide each box of the low income column by the total of the column. The resulting distribution is shown in Table 9.3. P 1 P 2 P 3 P Table 9.3 Conditional distribution of votes for people with low income. For example, the value.05 is the result of dividing.01, the joint probability of low income and voting for P 1 by the marginal probability of low income,.1. This table indicates that the party preferred by people with low incomes is the P 4 with 40% of the votes, followed by P 3 with 35%. Table 9.4 gives the conditional distribution of income by voters for party P 4. The most numerous group of voters for this party is the middle income group (52.63%) followed by low income (42.11%) and high (5.26%). H M L Total P Table 9.4 Conditional distribution of income by people who voted for P Independence A fundamental concept in the study of random variables is that of independence. We say that two random vectors x 1, x 2 are independent if the value of one has no influence on the value of the other and vice versa. In other words, the distribution of x 2 does not depend on x 1 and it is the same for any value of x 1. This is expressed mathematically as: f(x 2 x 1 ) = f(x 2 ) (9.8) which indicates that the conditional distribution is identical to the marginal. Using (9.5), an equivalent definition of independence between two random variables x 1, x 2 is: f(x 1, x 2 ) = f(x 1 )f(x 2 ) (9.9) meaning, two random vectors are independent if their joint distribution (their joint probability) is the product of the marginal distributions (of 6

7 the individual probabilities). In general, we say that the random variables x 1,..., x p, with joint density f(x 1,..., x p ) are independent, if it is verified that: f(x 1,..., x p ) = f(x 1 )f(x 2 )...f(x p ) (9.10) Joint independence is a very strong condition: if x 1,..., x p are independent, the same is also true for any subset of variables (x 1,..., x h ) with h p, as well as for any set of functions of individual variables, g 1 (x 1 )...g 1 (x p ). When the variables are independent we gain nothing from a joint study of them, thus it is advisable to study them individually. It is easy to prove that if variables x 1 and x 2 are independent and we build new variables y 1 = g 1 (x 1 ), y 2 = g 2 (x 2 ), where the first variable is only a function of x 1 and the second only of x 2, then variables y 1, y 2 are also independent The curse of dimensionality The curse of dimensionality is a term coined by the mathematician R. Bellman to describe how the complexity of a problem increases when the dimension of the variables involved increases as well. In multivariate statistical analysis, this problem appears in various ways. First, by increasing the dimension, the space becomes increasingly empty making any inference process of the data more difficult. This is a consequence of the fact that when the dimension of the space increases, so does its volume (or hypervolume in general), and since the total mass of probability is the unit, the density of the random variable must diminish. As a result, the probability density of a random variable of high dimensions is very low in space, or the equivalent being that the space grows progressively emptier. To illustrate the problem, we assume that the density of a p dimensional variable is uniform in the hypercube [0,1] p and that all components are independent. For example, samples of this variable can be produced by taking sets of p random numbers between zero and one. Let us consider the probability that a random value of this variable is within the hypercube [0; 0, 9] p. For p = 1, the scalar case, this probability is 0, 9, for p = 10, this value drops to 0, 9 10 = 0, 35, and for p = 30 it is 0, 9 30 = 0, 04. We can see that as the dimension of the space increases, any set will progressively become emptier. A second problem is that the number of parameters needed to describe the data also increases with the dimension. To represent the mean and covariance matrix in dimension p we need p + p(p + 1)/2 = p(p + 3)/2 which is of order p 2. Thus, the complexity of the data, measured by the number of parameters needed to represent them grows, in this case, with the square of the dimension of the space. For example, a sample of size 7

8 100 is a large sample for a unidimensional variable, but is quite small for a vector variable with p = 20. As a general rule, multivariate procedures need a ratio of n/p > 10 and it is preferable for this ratio to be greater than 20. The result of an increase in dimension is an increase in the uncertainty of the problem: the joint forecast of the values of the variable become more difficult. In practice, this problem diminishes if the variables are strongly dependent among themselves, since the density of probability is concentrated in determinate areas of the space, defined by the relationship of dependence instead of being spread throughout the sample space. This dependence can be used, extending the methods seen in earlier chapters, in order to reduce the dimension of the space of the variables and thus avoiding the curse of dimensionality. 9.2 PROPERTIES OF VECTOR VARIABLES Mean vector We use the term expectation or mean vector, µ, of a multidimensional random variable, x, to refer to a vector whose components are the expectations or means of the components of the random variable. We write the mean vector as: µ = E [x] (9.11) where it is understood that the expectation acting on a vector or matrix is the result of applying this operator to (taking the means of) each of the components. If the variable is continuous: µ = E [x] = xf(x)dx The expectation is a linear function, meaning that for any matrix A, and vector b, we have: E [Ax + b] = AE[x 1 ] + b. If x = (x 1, x 2 ) we also have that, for scalars a and b: and if x 1 and x 2 are independent: E [ax 1 + bx 2 ] = ae [x 1 ] + be[x 2 ]. E [x 1 x 2 ] = E [x 1 ]E[x 2 ]. 8

9 9.2.2 Expectation of a function Generalizing on the idea of expectation, if we have a scalar function y = g(x) of a vector of random variables, the mean value of this function is calculated as: E [y] = yf(y)dy =... g(x)f(x 1,..., x n )dx 1,..., dx n (9.12) The first integral takes into account that y is scalar and if we know its density function, f(y), the expectation can be calculated in the usual way. The second shows that it is not necessary to calculate f(y) in order to determine the average value of g(x): it is enough to weight the possible values by their probabilities. This definition is consistent, and it its easy to check that both methods lead to the same result. If x = (x 1, x 2 ), and we define y 1 = g 1 (x 1 ), y 2 = g 2 (x 2 ), if x 1 and x 2 are independent E [y 1 y 2 ] = E(g 1 (x 1 ))E(g 2 (x 2 )) Variance and covariance matrix We use the term covariance matrix of a random vector x = (x 1,..., x p ), of R p, with a mean vector µ = (µ 1,..., µ p ) to refer to the square matrix of order p obtained by: V x = E [ (x µ)(x µ) ] (9.13) The matrix V x contains the variances of the components in the diagonal which are represented by σi 2. Outside the diagonal, the covariances between pairs of variables are represented by σ ij. The covariance matrix is symmetric and is semidefinite positive. This means that given any vector, ω, it is true that: ω V x ω 0. In order to demonstrate this property we define a unidimensional variable by: y = (x µ) ω where ω is an arbitrary vector in R p. The variable y has an expectation of zero because E(y) = E [(x µ)] ω =0 and its variance must be non-negative: var(y) = E [ y 2] = ω E [ (x µ)(x µ) ] ω = ω V x ω 0 9

10 Let the mean variance be the average of the variances given by tr(v x )/p, the generalized variance be V x and the effective variance be V P = V x 1/p which is a global measure of the joint variability for all the variables that takes into account their independent structure. The interpretation of these measures is similar to that studied in Chapter 3 for data distributions Transformations of random vectors. When working with density functions of random vectors it is important to remember that, in the univariate case, the density function has dimensions: if p = 1, univariate case, probability by longitudinal unit; if p = 2, probability by surface unit; if p = 3 by unit of volume, and if p > 3 by unit of hypervolume. Therefore, if we change the units of measurement of the variables, the density function must also be modified. In general, let x be a vector in R p with density function f x (x) and y be another vector in R p, defined by the transformation: y 1 = g 1 (x 1,..., x p ).. y p = g p (x 1,..., x p ), where we assume that there are inverse functions x 1 = h 1 (y 1,..., y p ),..., x p = h p (y 1,..., y p ), and that all the functions implied have all the derivatives. Then it can be proved that the density function of vector y is given by: f y (y) = f x (x) dx dy, (9.14) where here we have used f y and f x to represent the density functions of variables y, and x, in order to avoid confusion. The term dx/dy represents the Jacobian of the transformation (which adjusts the probability by the measurement s change of scale) given by the determinant: dx dy = x 1 y x p... y 1 x 1 y p. x p y p which we assume is different from zero in the range of the transformation. An important case is that of linear transformations of the variable. If we take y = Ax 10

11 where A is a non-singular square matrix, the derivative of the components of x with respect to y will be obtained from x = A 1 y, and thus will be given by the elements of the matrix A 1. The Jacobian of the transformation will be A 1 = A 1 and the density function of the new variable y, will be f y (y) = f x (A 1 y) A 1 (9.15) an expression which indicates that in order to obtain the density function of variable y, we substitute the density function of variable x using A 1 y and we divide the result by the determinant of the matrix A Expectations of linear transformations Let x be a random vector of dimension p and we define a new random vector y of dimension m, (m p), with y = Ax, (9.16) where A is a rectangular matrix of dimensions m p. Letting µ x, µ y, be the mean vectors and V x, V y the covariance matrices, we have: µ y = Aµ x (9.17) which is straightforward taking expectations in (9.16). Also: V y = AV x A (9.18) where A is the transposed matrix of A. Indeed, by applying the definition of covariance matrices and the equations (9.16) and (9.18) V y = E [ (y µ y )(y µ y ) ] = E [ A(x µ x )(x µ x ) A ] = AV x A Example: Clients of a transportation service evaluated the following areas: punctuality (x 1 ), quickness (x 2 ) and cleanliness (x 3 ). The means, on a scale of zero to ten were 7, 8 and 8.5 respectively, with the variance covariance matrix: V x = Two indicators of service quality are built. The first is the average of the three scores and the second is the difference between the average of punctuality and quickness, which indicate the reliability of the service, and cleanliness, which indicates comfort. Calculate the mean vector and covariance matrix for these two indicators. 11

12 The expression of the first indicator is y 1 = x 1 + x 2 + x 3 3 and the second y 2 = x 1 + x 2 x 3 2 These two equations can be written in matrix form [ ] [ ] x y1 1/3 1/3 1/3 1 = x y 2 1/2 1/2 1 2 x 3 The mean vector is [ µ1 µ 2 ] = [ 1/3 1/3 1/3 1/2 1/2 1 ] 7 8 8, 5 = [ 7, 83 1 ] and the value 7.83 is a global measure of the average quality of the service and minus one is the reliability/comfort ratio. The variance covariance matrix is: [ ] /3 1/2 1/3 1/3 1/3 V y = /3 1/2 = 1/2 1/ /3 1 [ ] = which indicates that the variability of both indicators is similar and that they are negatively related, since the covariance is negative. 9.3 Dependence between random variables Conditional expectations The expectation of a vector x 1 conditional on a given value of another vector x 2 is the expectation of the distribution of x 1 conditional on x 2 and is given by: E [x 1 x 2 ] = x 1 f (x 1 x 2 ) dx 1. In general, this expression is a function of the value x 2. When x 2 is a fixed value, the conditional expectation is a constant. If x 2 is a random variable, the conditional expectation will also be a random variable. The expectation of a random vector x 1 can be calculated from the conditional expectations in two steps: in the first we calculate the expectation 12

13 of x 1 conditional on x 2. The result is a random function which is dependent on the random variable x 2. In the second, we calculate the expectation of this function with relation to the distribution of x 2. Then: E(x 1 ) = E [E(x 1 x 2 )]. (9.19) This expression indicates that the expectation of a random variable can be obtained by averaging the conditional expectations by their probability of appearance, or, in other words, the expectation of the conditional mean is the marginal expectation or unconditional. Proof E(x 1 ) = x 1 f(x 1 )dx 1 = x 1 f(x 1 x 2 )dx 1 dx 2 = x 1 f(x 1 x 2 )f(x 2 )dx 1 dx 2 [ ] = f(x 2 ) x 1 f(x 1 x 2 )dx 1 dx 2 = E [x 1 x 2 ] f(x 2 )dx 2 = E [E(x 1 x 2 )] Conditional variances The variance of x 1 conditional on x 2 is defined as the variance of the distribution of x 1 conditional on x 2. We use the notation V ar(x 1 x 2 ) = V 1/2 and this matrix will have the properties of covariance matrices studied earlier. If x 1 is scalar, its variance can also be calculated from the properties of conditional distribution. Specifically, it can be expressed as the sum of two terms: the first associated with conditional means and the second with conditional variances. In order to obtain this expression, we start with the decomposition: x 1 µ 1 = x 1 E(x 1 /x 2 ) + E(x 1 /x 2 ) µ 1 where x 2 is any random vector with finite conditional expectation E(x 1 /x 2 ). By squaring this expression and taking the expectations of both sides with respect to both x 1 and x 2, we have: var(x 1 ) = E(x 1 E(x 1 /x 2 )) 2 +E(E(x 1 /x 2 ) µ 1 ) 2 +2E [(x 1 E(x 1 /x 2 ))(E(x 1 /x 2 ) µ 1 )] the double product is zero, because = E [(x 1 E(x 1 /x 2 ))(E(x 1 /x 2 ) µ 1 )] = [ ] (E(x 1 /x 2 ) µ 1 ) (x 1 E(x 1 /x 2 ))f(x 1 /x 2 )dx 1 f(x 2 )dx 2 = 0 13

14 and the integral in brackets is null. On the other hand, as in (9.19): E [E(x 1 /x 2 )] = E(x 1 ) = µ 1, the second term is the expectation of the square of a random variable E(x 1 /x 2 ) minus its mean, µ 1. It is then the variance of the variable, E(x 1 /x 2 ). The first term can be expressed by taking first the expectation with respect to (x 1 /x 2 ) which leads to the variance var(x 1 /x 2 ), and second the expected value of this variable with respect to x 2 Then we obtain: var(x 1 ) = E [var(x 1 /x 2 )] + var [E(x 1 /x 2 )] (9.20) This expression is known as the variance decomposition, since it decomposes the variability of the variable in two main sources of variation. On one hand, variability exists because the variances of the conditional distributions, var(x 1 /x 2 ), may be different, and the first term averages these variances. On the other hand, variability also exists because the means of the conditional distributions may be different, and the second term includes the differences between the conditional means, E(x 1 /x 2 ), and total mean, µ 1, through the use of the term var [E(x 1 /x 2 )]. We can see that the variance of variable x 1 is, in general, greater than the average of the variances of the conditional distributions, due to the fact that in the conditionals, the variability is calculated from the conditional means, E(x 1 /x 2 ), whereas var(x 1 ) measures the variability with respect to the global mean, µ 1. If all the conditional means are equal to µ 1, which will happen for example if x 1 and x 2 are independent, then the term var [E(x 1 /x 2 )] is zero and the variance is the weighted average of the conditional variances. If E(x 1 /x 2 ) is not constant then the variance of x 1 will increase according to the increase in variability of the conditional means. This decomposition of the variance appears in the analysis of the variance of univariate linear models: (xi x) 2 /n = (x i x i ) 2 /n + ( x i x) 2 /n where, in this expression, x i is the estimation of the conditional mean in the linear model. The total variability, which is equivalent to var(x 1 ), is decomposed into two uncorrelated terms. On one side, there is the average of the estimations of var(x 1 /x 2 ), which is calculated by averaging the differences between the variable and the conditional mean. On the other side, the variability of the conditional expectations which are estimated in the linear models by taking the differences x i x Correlation matrix The correlation matrix of a random vector x with a covariance matrix V x, is defined by R x = D 1/2 V x D 1/2 14

15 where D = diag(σ 2 1,..., σ 2 p) is the diagonal matrix which contains the variances of the variables. The correlation matrix will then be a square and symmetric matrix with ones in the diagonal and the correlation coefficients between pairs of variables outside the diagonal. Simple correlation coefficients or linear correlation coefficients, are given by ρ ij = σ ij σ i σ j The correlation matrix is also positive semidefinite. A global measure of the linear correlations existing in the set of variables is the dependence, defined by D x = 1 R x 1/(p 1) whose interpretation for random variables is analogous to that presented in Chapter 3, statistical variables. For p = 2 the matrix R x has ones in the diagonal and the coefficient ρ 12 outside, R x = 1 ρ 2 12, and the dependence D x = 1 (1 ρ 2 12 ) = ρ2 12 coincides with the determination coefficient. Just as we saw in Chapter 3, in the general case where p > 2, the dependence is a geometrical average of the determination coefficients Multiple Correlations A linear measure of the ability to predict y using a linear function of the variables x is the multiple correlation coefficient. Assuming, without a loss of generality, that the variables have a zero mean, we define the best linear prediction of y as the function β x which minimizes E(y β x) 2. It can be shown that β = V 1 x V xy where V x is the covariance matrix of x and V xy is the vector of covariances between y and x. The simple correlation coefficient between the scalar variables y and β x is called the multiple correlation coefficient. We can prove that if we let σ ij be the terms of the covariance matrix V of a variable vector, and σ ij be the terms of the matrix V 1, then the multiple correlation coefficient, R i.r between each variable (i) and the others (R) is calculated as: Ri.R 2 = 1 1 σ ij σ ij In particular, if E(y x) is a linear function of x then E(y x) = β x and Ri.R 2 can also be calculated as 1 σ2 y x /σ2 y, where σy x 2 is the conditional distribution variance, y x and σy 2 is the marginal variance of y. 15

16 9.3.5 Partial Correlations Let us assume that we obtain the best linear approximation to a vector of variables x 1 of dimensions p 1 1 starting from another vector of variables x 2 of dimensions p 2 1. Supposing that the variables have a zero mean, this implies calculating a vector Bx 2 where B is a coefficient matrix of dimensions p 1 p 2 such that p 1 j=1 E(x 1j β j x 2) 2 is minimum, where x 1j is the component j of the vector x 1 and β j the row j of the matrix B. We let V 1/2 be the covariance matrix of the variable x 1 Bx 2. If we standardize this covariance matrix in order to obtain the correlations, the resulting correlation coefficients are called partial correlation coefficients between the components of x 1 given the variables x 2. The square and symmetric matrix of order p 1 R 1/2 = D 1/2 1/2 V 1/2D 1/2 1/2 is called the partial correlation matrix between the components of vector x 1 when we control for (or are conditional on) vector x 2, where D 1/2 = diag(σ 2 1/2,..., σ2 k/2 ) and σ 2 j/2 is the variance of the variable x 1j β j x 2. In particular, if E(x 1 x 2 ) is linear in x 2, then E(x 1 x 2 ) = Bx 2 and V 1/2 is the covariance matrix of the conditional distribution of x 1 x MULTINOMIAL DISTRIBUTION Suppose that we observe elements which we can classify into two classes, A and A. If, for example, we classify newborns in a hospital as male (A) or female (A), the days in a month as rainy (A) or not (A), or the objects manufactured by a machine as good (A) or defective (A). We assume that the process which generates the elements is stable, that there exists a constant likelihood that elements from either class will appear, P (A) = p = cte, and that the process has no memory, or in other words P (A A) = P (A A). Supposing that we observe random elements of this process and we define the variable { } 1, if the observation belongs to class A x = 0, otherwise this variable follows a Bernoulli distribution, with P (x = 1) = p and P (x = 0) = 1 p. If we observe n elements instead of one, and we define the variable y = n i=1 x i, that is to say, we count the number of elements in n that belong to the first group, then the variable y follows a binomial distribution with P (y = r) = n! r!(n r)! pr (1 p) n r. We can generalize this distribution allowing for G classes instead of two, and we let p be the vector of probabilities of belonging to the classes, 16

17 p =(p 1,..., p G ), where p j = 1. We can now define the G random variables: { } 1, if the observation belongs to group j x j = j = 1,..., G 0, otherwise and the result of an observation is a value of the vector of G-variables x = (x 1,..., x G ), which will always take the form of x = (0,..., 1,...0), since only one of the G components can take the value of one, that associated with the observed class for this element. As a result, the components of this random variable are not independent since they are bound by the equation G x j = 1. j=1 In order to describe the result of the observation, it would be enough to define G 1 variables, as is done in the Bernoulli distribution where a variable is defined only when there are two classes, since the value of the last variable is set when the rest are known. Nevertheless, with more than two classes it is customary to work with the G variables and the distribution of the multivariate variable thus defined is called the point multinomial. Its probability function is P (x 1,.., x G ) = p x px G G = p x j j Since only one of x j is different from zero, the probability of the j-th being one is precisely p j, the probability that the observed element belongs to class j. Generalizing this distribution, let (x 1,..., x n ) be a sample of n values of this specific multinomial variable which is a result of classifying n elements of a sample into the G classes. We use the term multinomial distribution to to refer to the variable sum: y = n i=1 which indicates the number of elements in the sample that correspond to each of the classes. The components of this variable, y =(y 1,..., y G ), represent the observed frequencies for each class and can take on the values y i = 0, 1,...n, but they are always subject to the restriction: x i yi = n, (9.21) and their probability function will be: P (y 1 = n 1,.., y G = n G ) = 17 n! n 1!...n G! pn pn G G

18 where ni = n. The combinatorial term takes into account the permutations of n elements when there are n 1,..., n G repetitions. We can verify that E(y) =np = µ y and V ar(y) =n [ diag(p) pp ] = diag(µ y ) 1 n µ yµ y where diag(p) is a square matrix with the elements of p in the diagonal and zeros outside. This matrix is singular since the elements of y are bound by the restriction equation (9.21). It is easy to verify that the marginal distributions are binomials, with: E[y j ] = np j, DT [y j ] = np j (1 p j ). Additionally, any conditional distribution is multinomial. With G 1 variables, for example, when y G takes the fixed value, n G is a multinomial in the G 1 remaining variables with a sample size of n = n n G. The conditional distribution of y 1, y 2 when y 3 = n 3,..., y G = n G is a binomial, with n = n n 3 n 4... n G, and so on. Example: In a quality control process, elements can have three types of defects: slight (A 1 ), medium (A 2 ), serious (A 3 ) and it is known that among the defective elements the probability of error is p 1 = P (A 1 ) = 0, 7; p 2 = P (A 2 ) = 0, 2; and p 3 = P (A 3 ) = 0, 1. Calculate the probability that in the next three defective elements there will be exactly one with a serious defect. The possible defects in the next three elements are, without taking into account the order of appearance: A 1 A 1 A 3 ; A 1 A 2 A 3 ; A 2 A 2 A 3 and their probabilities according to the multinomial distribution are: Then: P (x 1 = 2, x 2 = 0, x 3 = 1) = P (x 1 = 1, x 2 = 1, x 3 = 1) = P (x 1 = 0, x 2 = 2, x 3 = 1) = 3! 2!0!1! 0, 72 0, 2 0 0, 1 = 0, 147 3! 0, 7 0, 2 0, 1 = 0, 084 1!1!1! 3! 0!2!1! 0, 70 0, 2 2 0, 1 = 0, 012 P (x 3 = 1) = 0, , , 012 = 0, 243 The same result can also be obtained using the Binomial (A 3 A 3 ) with probabilities (0, 9; 0, 1) and: P (x 3 = 1) = ( 3 1 ) 0, 1 + 0, 9 2 = 0,

19 9.5 THE DIRICHLET DISTRIBUTION The Dirichlet distribution is introduced in order to represent variables between zero and one and whose sum is equal to the unit. These data are known as compositional data. For example, suppose that we are studying the relative weight that consumers assign to a set of quality attributes, and that the evaluation of the importance of those attributes is carried out on a scale of zero to one. Thus, for example, with three attributes a client can assign the following scores (0.6, 0,3, 01) indicating that the first attribute has an importance of 60%, the second 30% and the third, 10%. Other examples of this type of data are the proportion of time invested in certain activities, or the composition of percentages of different substances found in a group of products. In all of these cases the data are continuous variable vectors x =(x 1,..., x G ) so that, by construction, 0 x j 1 and there is a restriction equation: G x j = 1. j=1 An appropriate distribution to represent these types of situations is the Dirichlet distribution, whose density function is: f(x 1,..., x G ) = Γ(α 0 ) Γ(α 1 )Γ(α 2 )...Γ(α G ) xα x α G 1 G where Γ(.) is a gamma function and α = (α 1,..., α G ) is the vector of parameters that characterizes the distribution, and It is proven that α 0 = α 1 = G α j. j=1 E(x) = α/α 0 = µ x, therefore, the parameters α j indicate the relative expectation of each component and 1 V ar(x) = (α 0 + 1) ( 1 diag(α) 1 α 0 α0 2 αα ). This expression indicates that the variance of each component is: var(x j ) = α j(α 0 α j ) α 2 0 (α 0 + 1). and we see that the parameter α 0 determines the variance of the components and that these variances decrease rapidly with α 0. The Dirichlet variables, as with multinomials, are subject to a restriction equation, and thus are not 19

20 Figure 9.1: marginals. Representation of the Normal bivariate distribution and its linearly independent and their covariance matrix is singular. The covariances between two components are: cov(x i x j ) = α jα i α0 2(α 0 + 1), and the covariances also diminish with α 0, but they increase when the expectations of the variables increase. The reader can appreciate the similarity between the multinomial formulas of probabilities, means and variances and that of the Dirichlet. This similarity is due to the fact that in both cases we classify the results into G groups. The difference is that in the multinomial case we count how many observations of n appear in each group, whereas in the Dirichlet we measure the proportion which that element contains in each class. In the Dirichlet distribution the parameter α 0 has a role similar to that of the sample size and ratios α j / α 0 to the probabilities. 9.6 THE K-DIMENSIONAL NORMAL DISTRI- BUTION The density function of the normal scalar random variable is: f(x) = (σ 2 ) 1/2 (2π) 1/2 exp { (1/2)(x µ) 2 σ 2}. and we write x N(µ, σ 2 ) to show that x has a normal distribution with a mean µ and variance σ 2. Generalizing on this function, we say that a vector x follows a normal distribution p dimensional if its density function is: f(x) = V 1/2 (2π) p/2 exp { (1/2)(x µ) V 1 (x µ) } (9.22) [ Figure 9.1 shows a bivariate normal with µ = (0, 0) and V = 1 1/ 3 1/ 3 1 and its marginal distributions. We write x N p (µ, V). The principal properties of the multivariate normal are: ], 1. The distribution is symmetric around µ. The symmetry can be proven by replacing µ ± a with x in the density and by observing that f(µ + a) =f(µ a). 20

21 2. The distribution has a single maximum in µ. Since V is a positive definite, the term of the exponent (x µ) V 1 (x µ) is always positive, and the density f(x) will be maximum when that term is zero, which occurs when x = µ. 3. The mean of the normal random vector is µ and its covariance matrix is V. These properties, which can be rigorously proven, are deduced form the comparison of the univariate and multivariate densities. 4. If p random variables have a combined normal distribution and they are uncorrelated, they are independent. The proof of this property can be found by taking, in (9.22), the diagonal matrix V and proving that f(x) = f(x 1 ),..., f(x p ). 5. Any p dimensional normal vector x with non-singular matrix V can be converted using a linear transformation into a p dimensional normal vector z with a mean vector 0 and a variance-covariance matrix equal to the identity (I). We let the standard p dimensional normal be the density of z, which is given by: f(z) = { 1 exp 1 } (2π) p/2 2 z z = Π p 1 i=1 (2π) { exp 1/2 1 2 z2 i } (9.23) The proof of this property is the following: Since V is a positive definite, there is a square, symmetric matrix A from which we take the square root and verify that: Defining a new variable: V = AA (9.24) z = A 1 (x µ) (9.25) then x = µ + Az and according to (9.14) the density function of z is f z (z) = f x (µ + Az) A then by using AV 1 A = I, (9.23) is obtained. Therefore, any vector of normal variables x in R p can be transformed into another vector of R p of normal independent variables and of unit variance. 6. The marginal distributions are normal. If the variables are independent, the proof of this property is immediate. A general proof can be seen, for example, in Mardia et al (1979). 21

22 7. Any subset of h < p variables is h dimensional normal. This is an extension of the above property and is proved in the same way. 8. If y is (k 1), k p, the vector y = Ax, where A is a matrix (k p), is k dimensional normal. In particular, any scalar variable y = a x, (where a is a non-zero vector 1 p) has a normal distribution. The proof can be seen, for example, in Mardia et al (1979). 9. By cutting the density function by parallel hyperplanes, the level curves are obtained. Their equation is: (x µ) V 1 (x µ) = cte. The level curves are therefore ellipses. If we consider that all the points in a level curve are at the same distance from the center of the distribution, the implied distance is called the Mahalanobis distance, and it is given by: D 2 = (x µ) V 1 (x µ) (9.26) As an illustration, we take the simplest case of two univariate distribution shown in Figure 9.2. The observation x=3, indicated with an X in the graph, has its Euclidean distance closer to the center of the distribution A, which is zero, than to the center of B, which is ten. Nevertheless, with the Mahalanobis distance, the distance of point X to the distribution A, which has a standard deviation of one, is (3 0) 2 /1, whereas the distance to the center of B, which has a standard deviation of ten, is (3 10) 2 /10 2 = 0, 7 2 and with this distance point X is much closer to distribution B. This is due to the fact that it is much more likely that this point comes from distribution B than from A. 10. The Mahalanobis distance is distributed as a χ 2 with p degrees of freedom. In order to prove this, we carry out the transformation (9.25) and since V 1 = A 1 A 1 we obtain D 2 = z z = z 2 i where each z i is N(0, 1). Thus D 2 χ 2 p. 22

23 Figure 9.2: Point X is closer to the center of the distribution A with the Euclidean distance, but with the Mahalanobis distance it is closer to that of B Conditional distributions We split a random vector into two parts, x = (x 1, x 2 ), where x 1 is a vector of dimension p 1 and x 2 of dimension p 2, where p 1 + p 2 = p. We also split the covariance matrix of vector x into blocks linked to these two vectors such that: [ ] V11 V V = 12 (9.27) V 21 V 22 where, for example, V 11, the covariance matrix of vector x 1, is a square of order p 1, V 12, the covariance matrix between the vectors x 1 and x 2 has dimensions p 1 p 2, and V 22, the covariance matrix of vector x 2, is a square of order p 2. Supposing we wish to calculate the conditional distribution of vector x 1 given the values of vector x 2. We can prove that the distribution is normal, with a mean: and covariance matrix: E [x 1 x 2 ] = µ 1 + V 12 V 1 22 (x 2 µ 2 ) (9.28) V ar [x 1 x 2 ] = V 11 V 12 V 1 22 V 21 (9.29) In order to interpret these expressions, we first assume the bivariate case where both variables are scalar with a zero mean. The mean is then E [x 1 x 2 ] = σ 12 σ 1 22 x 2 which is the usual expression for the straight line regression with a slope β = σ 12 /σ 22. The expression of the conditional variance around the straight line regression is var [x 1 x 2 ] = σ 11 σ 2 12/σ 22 = σ 2 1(1 ρ 2 ) where ρ = σ 12 /σ 1/2 22 σ1/2 11 is the correlation coefficient between the variables. This expression indicates that the variability of the conditional distribution is always less than that of the marginal, and the reduction of variability increases with respect to increases in ρ 2. We now suppose that x 1 is scalar but that x 2 is a vector. The expression of the conditional mean provides the equation for the multiple regression E [x 1 x 2 ] = µ 1 + β (x 2 µ 2 ) 23

24 where β = V 1 22 V 21 with V 21 being the covariance vector between x 1 and the components of x 2. The variance of this conditional distribution is var [x 1 x 2 ] = σ 2 1(1 R 2 ) where R 2 = V 12 V22 1 V 21/σ1 2 is the multiple correlation coefficient. In general cases, these expressions correspond to the set of multiple regressions of the components of x 1 over the variables x 2, which is known as multivariate regression. Proof The conditional distribution is f (x 1 x 2 ) = f (x 1, x 2 ) f (x 2 ) Since the distributions f (x 1, x 2 ) and f (x 2 ) are multivariate normal when the quotient is calculated we will have the quotient among the determinants and the difference between the exponents. We begin by calculating the exponents which are (x µ) V 1 (x µ) (x 2 µ 2 ) V 1 22 (x 2 µ 2 ) (9.30) We are going to decompose the first quadratic form in the terms corresponding to x 1 and x 2. To do this, we partition (x µ) as (x 1 µ 1, x 2 µ 2 ), we partition V as in (9.27), and we use the equation of the inverse of a partitioned matrix (see section 2.2.3). Calculating the product we get (x µ) V 1 (x µ) = (x 1 µ 1 ) B 1 (x 1 µ 1 ) (x 1 µ 1 ) B 1 V 12 V µ 2 ) (x 2 µ 2 ) V B 1 (x 1 µ 1 ) + (x 2 µ 2 ) V µ 2 ) + (x 2 µ 2 ) A 1 22 A 21B 1 A 12 A 1 22 (x 2 µ 2 ) where B = ( V 11 V 12 V22 1 V 21), which is the equation used in (9.29). The fourth term of this equation is cancelled out in the difference (9.30), and the other four can be grouped as ( x1 µ 1 V 12 V22 1 (x 2 µ 2 ) ) B 1 ( x 1 µ 1 V 12 V22 1 (x 2 µ 2 ) ). This equation shows that the exponent of the distribution corresponds to a normal variable with a mean vector and covariance matrix equal to those shown in (9.28) and (9.29). We are going to prove that the quotient of determinants leads to the same covariance matrix as well. We use, according to section 2.3.5, V = V 22 V 11 V 12 V22 1 V 21 = V 22 B. Since we have V 22 in the denominator, the quotient provides the single term B. Finally we end up with the term (2π) p/2 p2/2 =(2π) p1/2. In conclusion, the resulting equation is the density function of the multivariate normal distribution of order p 1, with the mean vector given by (9.28) and covariance matrix given by (9.29). 24

25 Example: The distribution of spending of a group of consumers on two products (x, y) follows a bivariate normal distribution with respective means of 2 and 3 Euros and covariance matrix: [ 1 ] 0, 8 0, 8 2 Calculate the conditional distribution of spending on product y for consumers that spend 4 Euros on product x. The conditional distribution f (y/x = 4) = f (4, y) /f x (4). The marginal distribution of x is normal, N(2, 1). The terms of the joint distribution f(x, y) are: V 1/2 = ( σ1σ ( 1 ϱ 2 )) 1/2 = σ1 σ 2 1 ϱ 2 V 1 1 = σ1 2σ2 2 (1 σ2 2 ϱσ 2 σ 1 ϱ2 ) ϱσ 2 σ 1 σ1 2 where in this example σ1 2 = 1, σ2 2 = 2, and ϱ = 0, 8/ 2 = 0, 566. The exponent of the bivariate normal f(x, y) is: { (x ) 1 2 ( ) } µ1 y 2 µ2 2 (1 ϱ 2 + 2ϱ (x µ 1) (y µ 2 ) = A ) σ 1 σ 2 σ 1 σ 2 2 As a result, we have: f (y x) = = ( 1 σ 1 σ 2 1 ϱ 2) (2π) 1 exp { A } 2 { ( ) } 2 = σ1 1 (2π) 1 exp 1 x µ1 2 σ exp { 12 } σ B 2 1 ϱ 2 2π where the resulting term in the exponent, which we denote as B, is: B = = B = 1 1 ϱ ϱ 2 1 σ 2 2 (1 ϱ2 ) [ (x ) 2 ( ) µ1 y 2 µ2 + 2ϱ (x µ ( ) 1) (y µ 2 ) x 2 µ1 ( 1 ϱ 2 )] σ 1 σ 2 σ 1 σ 2 σ 1 [ (y ) 2 ( ) ] µ2 x 2 2 µ1 ϱ σ 2 [ y σ 1 ( µ 2 + ϱ σ 2 σ 1 (x µ 1 ) )] 2 This exponent corresponds to a normal distribution with mean: E [y x] = µ 2 + ϱ σ 2 σ 1 (x µ 1 ) 25

26 Figure 9.3: Density of the standard bivariate normal. which is the regression line, and standard deviation: For x = 4: DT [y x] = σ 2 1 ϱ 2 ( ) 0, 8 2 E [y 4] = 3 + (4 2) = 4, Since there is a positive correlation of between spending on both products the consumers who spend more on one also spend, on average, more on the other. The variability of the conditional distribution is: V ar [y 4] = σ2 2 ( 1 ϱ 2 ) = 2 (1 0, 32) = 1, 36 and will be less than the variance of the marginal distribution because conditioning gives us more information. 9.7 ELLIPTICAL DISTRIBUTIONS The multivariate normal distribution is a particular case in a family of distributions frequently used in multivariate analysis: elliptical distributions. As an introduction, we will first look at the simplest case: spherical distributions Spherical distributions We say that a vector variable x = (x 1,..., x p ) follows a spherical distribution if its density function depends on the variable only for the Euclidean distance x x = p i=1 x2 i. This property implies that: 1. The shape of equiprobability of the distribution is a sphere whose center is in the origin. 2. The distribution is invariant when rotated. Thus, if we define a new variable y = Cx, where C is an orthogonal matrix, the density of the variable y is the same as that of x. One example of spherical distribution, studied in the above section, is the standard multivariate normal density function, whose density is f(x) = 1 (2π) p/2 exp( 1 2 x x) = Π p 1 i=1 (2π) 1/2 exp( 1 2 x2 i ) This density is shown in the bivariate case in Figure 9.3. Then the two scalar variables which form the vector are independent. This property is 26

27 Figure 9.4: Density of the bivariate double exponential. characteristic of the normal distribution since, in general, the components of spherical distributions are dependent. For example, the Cauchy multivariate distribution given by p+1 Γ( 2 f(x) = ) π (p+1)/2 (1 + x x) (p+1)/2 (9.31) has heavier tails than the normal distribution, as in the univariate case. It is easy to prove that this function cannot be written as a product of Cauchy univariate distributions due to the fact that while its components are uncorrelated, they are not independent. Another important spherical distribution is the double exponential. In the bivariate case, this distribution has a density function f(x) = 1 2π exp( x x) and although the density function may appear to be similar to the normal, its tails are much heavier. This distribution can be seen in Figure Elliptical distributions If variable x has a spherical distribution and A is a square matrix of dimension p and m is a vector of dimension p, then the variable y = m + Ax (9.32) has an elliptical distribution. Since a spherical variable has a mean of zero and covariance matrix ci, it follows that an elliptical variable has a mean m and a covariance matrix V =caa. Elliptical distributions have the following properties: 1. Their density function depends on the variable through the Mahalanobis distance: (y m) V 1 (y m) 2. The curves of equiprobability are ellipsoids with the center at point m. The general multivariate normal distribution is the most well known of the elliptical distributions. Another member of this family is the multivariate t distribution. Although there are different versions of this distribution, the most common is constructed by dividing each component of a vector of multivariate normal variables N p (m, V) by the same scalar variable: the 27

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics