Multivariate Statistics

Size: px

Start display at page:

Download "Multivariate Statistics"

John Little
5 years ago
Views:

1 Multivariate Statistics Chapter 2: Multivariate distributions and inference Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid Course 2016/2017 Master in Mathematical Engineering Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 1 / 92

2 Chapter outline 1 Introduction. 2 Basic concepts. 3 Multivariate distributions. 4 Statistical inference. 5 Hypothesis testing. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 2 / 92

3 Introduction Multivariate statistical analysis is concerned with analysing and understanding data in high dimensions. Therefore, we assume that we are given a set of n observations of a multivariate random variable x in R p. Thus, each observation has p dimensions and it is an observed value of the multivariate random variable x that is composed of p random variables: x = (x 1,..., x p ) where x j, for j = 1,..., p is a univariate random variable. In this chapter we give an introduction to the basic probability tools useful in statistical multivariate analysis. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 3 / 92

4 Introduction In particular, we present: the basic probability tools used to describe a multivariate random variable, including marginal and conditional distributions and the concept of independence; the mean vector, the covariance matrix and the correlation matrix of a multivariate random variable and their counterparts for marginal and conditional distributions; the basic techniques needed to derive the distribution of transformations with special emphasis on linear transformations; several multivariate distributions, including the multivariate Gaussian distribution, along with most of its companion distributions and other interesting alternatives; and statistical inference for multivariate samples, including parameter estimation and hypothesis testing. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 4 / 92

5 Basic concepts We can say that we have the joint distribution of a multivariate random variable when the following are specified: 1 The sample space of the possible values, which, in general, is a subset of R p. 2 The probabilities of each possible result of the sample space. We say that a p-dimensional random variable is discrete when each of the p scalar variables that comprise it are discrete as well. Analogously, we say that the variable is continuous if its components are continuous as well. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 5 / 92

6 Basic concepts Let x = (x 1,..., x p ) be a multivariate random variable. The cumulative distribution function (cdf) of x at a point x 0 = ( x 0 1,..., x 0 p ), is denoted by F x ( x 0 ) and is given by: ( F x x 0 ) = Pr ( x x 0) = Pr ( x 1 x1 0,..., x p xp 0 ) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 6 / 92

7 Basic concepts For continuous multivariate random variables, a nonnegative probability density function (pdf) f x exists, such that: F x ( x 0 ) = x 0 p x 0 1 f x (x 1,..., x p ) dx 1 dx p Note that: f x (x 1,..., x p ) dx 1 dx p = 1 Note also that the cdf F x is differentiable with: f x (x) = p F x (x) x 1 x p Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 7 / 92

8 Basic concepts For discrete multivariate random variables, the values of the random variable are concentrated on a countable or finite set of points {c j } j J. The probability of events of the form x D, for a certain set D J can be computed as: Pr (x D) = Pr (x = c j ) j:c j D For simplicity we will focus on continuous multivariate random variables. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 8 / 92

9 Basic concepts The marginal density function of a subset of the elements of x, say ( ), x i1,..., x ij is given by: ( ) f xi1,...,x xi1 ij,..., x ij = i 1,...,i j f x (x 1,..., x p ) dx 1 dx p i 1,...,i j In particular, the marginal density function of each x j, for j = 1,..., p is given by: f xj (x j ) = j f x (x 1,..., x p ) dx 1 dx p j Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 9 / 92

10 Basic concepts Let x = (x 1,..., x p ) and y = (y 1,..., y q ) be two multivariate random variables with density functions f x and f y, respectively, and joint cumulative density function f x,y. Then, the conditional density function of y given x is given by: f y x (y x) = f x,y (x, y) f x (x) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 10 / 92

11 Basic concepts From the previous definition, we can deduce that the pdf of (x, y) is given by: As a consequence: f x,y (x, y) = f y x (y x) f x (x) = f x y (x y) f y (y) f y x (y x) = f x y (x y) f y (y) f x (x) = f x y (x y) f y (y) fx,y (x, y) dy = f x y (x y) f y (y) fx y (x y) f y (y) dy This is the Bayes Theorem, one of the most important results in Statistics as it is the base of Bayesian inference. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 11 / 92

12 Basic concepts The multivariate random variables x and y are independent if, and only if: f x,y (x, y) = f x (x) f y (y) Therefore, if x and y are independent, then: f y x (y x) = f y (y) and, f x y (x y) = f x (x) Independence can be interpreted as follows: knowing y = y 0 does not change the probability assessments on x, and conversely. In general, the p univariate random variables x 1,..., x p are independent if, and only if: f x1,...,x p (x 1,..., x p ) = f x1 (x 1 ) f xp (x p ) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 12 / 92

13 Basic concepts It is important to note that different multivariate pdf may have the same marginal pdf s. For instance, it is easy to see that the bivariate pdf s given by: and, f x1,x 2 (x 1, x 2 ) = 1, 0 < x 1, x 2 < 1 f x1,x 2 (x 1, x 2 ) = (2x 1 1) (2x 2 1), 0 < x 1, x 2 < 1 have the marginals pdf given by: f x1 (x 1 ) = 1, 0 < x 1 < 1 and, respectively. f x2 (x 2 ) = 1, 0 < x 2 < 1 Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 13 / 92

14 Basic concepts An elegant concept of connecting marginals with joint cdf s is given by copulae. For simplicity of presentation we concentrate on the p = 2 dimensional case. A 2-dimensional copula is a function C : [0, 1] 2 properties: 1 For every u [0, 1] : C (0, u) = C (u, 0) = 0. [0, 1] with the following 2 For every u [0, 1] : C (1, u) = C (u, 1) = u. 3 For every (u 1, u 2), (v 1, v 2) [0, 1] [0, 1] with u 1 v 1 and u 2 v 2: C (v 1, v 2) C (v 1, u 2) C (u 1, v 2) + C (u 1, u 2) 0 Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 14 / 92

15 Basic concepts The usefulness of a copula function C is explained by the Sklar s Theorem. Sklar s Theorem Let F x be a multivariate cdf with marginal cdf s F x1 and F x2. Then, a copula C x1,x 2 exists with: F x1,x 2 (x 1, x 2 ) = C x1,x 2 (F x1 (x 1 ), F x2 (x 2 )) for every x 1, x 2 R 2. If F x1 and F x2 are continuous, then C x1,x 2 is unique. On the other hand, if C x1,x 2 is a copula and F x1 and F x2 are cdf s, then the function F x1,x 2 defined above, is a multivariate cdf with marginals F x1 and F x2. Therefore, a copula function links a multivariate distribution to its one-dimensional marginals. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 15 / 92

16 Basic concepts Theorem Let x 1 and x 2 be random variables with cdf s F x1 and F x2, and multivariate cdf F x1,x 2. Then, x 1 and x 2 are independent if and only if: C x1,x 2 (F x1, F x2 ) = F x1 F x2 The previous copula function is called the independence copula. Other copula functions will be given in this chapter. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 16 / 92

17 Basic concepts Let x = (x 1,..., x p ) be a multivariate random variable. The expectation or mean vector of x, is the vector µ x whose components are the expectations or means of the components of the random variable, i.e.: µ x = E [x] = E [x 1 ]. E [x p ] where, E [x j ] = x j f xj (x j ) dx j and f xj (x j ) is the marginal density function of x j. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 17 / 92

18 Basic concepts The covariance matrix of the multivariate random variable x with mean vector µ x, is a symmetric and semidefinite positive matrix given by: Σ x = E [ (x µ x ) (x µ x ) ] The diagonal elements of Σ x are the variances of the components given by, for j = 1,..., p. σ 2 x,j = (x j µ x,j ) 2 f xj (x j ) dx j, The elements outside the main diagonal are the covariances between pairs of variables, σ x,jk = for j, k = 1,..., p. (x j µ x,j ) (x k µ x,k ) f xj,x k (x j, x k ) dx j dx k, Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 18 / 92

19 Basic concepts The correlation matrix of the multivariate random variable x with covariance matrix Σ x is given by: ϱ x = 1/2 x Σ x 1/2 x where x is a diagonal matrix with the variances of the components of x. The elements outside the main diagonal are the correlations between pairs of variables, given by: ρ x,jk = σ x,jk σ x,j σ x,k Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 19 / 92

20 Basic concepts Let x = (x 1,..., x p ) be a multivariate random variable and let ( x i1,..., x ij ) be a subset of the elements of x. Then, the mean vector and the covariance and correlation matrices of ( x i1,..., x ij ) are obtained by extracting the corresponding elements of the mean vector and the covariance and correlation matrices of x. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 20 / 92

21 Basic concepts Let x = (x 1,..., x p ) and y = (y 1,..., y q ) be two random variables with density functions f x and f y, respectively, and let f y x be the conditional density function of y given x. The conditional expectation of y given x is given by: E y x [y x] = yf y x (y x) dy which depends on x. An important property of E y x [y x] is that E y [y] = E x [ Ey x [y x] ]. Then, to compute E y [y], we can first compute E y x [y x] and then, take the expectation with respect to the distribution of x. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 21 / 92

22 Basic concepts Similarly, the conditional covariance and correlation matrices are the covariance and correlation matrices of the multivariate random variable y x. In particular, the condicional covariance matrix contains the conditional variances, Var yj x [y j x] and the conditional covariances, Cov yj,y k x [y j, y k x]. An important property of Var yj x [y j x] is that: [ Var yj [y j ] = E x Varyj x [y j x] ] [ + Var yj x Eyj x [y j x] ]. This is usually called the law of total variance. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 22 / 92

23 Basic concepts Let x = (x 1,..., x p ) and y = (y 1,..., y q ) be two multivariate random variables with mean vectors µ x and µ y and covariance matrices Σ x and Σ y, respectively. The covariance matrix between x = (x 1,..., x p ) and y = (y 1,..., y q ) is a p q matrix given by: Cov [x, y] = E [ (x µ x ) (y µ y ) ] Similarly, the correlation matrix between x = (x 1,..., x p ) and y = (y 1,..., y q ) is a p q matrix given by: Cor [x, y] = 1/2 x Cov [x, y] 1/2 y where x and y are diagonal matrices with elements the diagonal elements of Σ x and Σ y, respectively. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 23 / 92

24 Basic concepts Let x = (x 1,..., x p ) be a multivariate variable with pdf f x and let y = (y 1,..., y p ) a new variable given by: y = g (x) where g is a function with differentiable inverse given by: x = g 1 (y) = h (y) Therefore, the pdf of y is given by: ( ) ( ) f y (y) = f x (x) x det = f x (h (y)) h (y) y det y where x y is the Jacobian of the transformation, det ( ) stands for determinant and denotes the absolute value function. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 24 / 92

25 Basic concepts Consider the particular case of a linear transformation, y = Ax + b, where A is a non-singular p p matrix and b is a p 1 vector. Then, we have that x = A 1 (y b) while x y = A 1. Therefore: f y (y) = f x ( A 1 (y b) ) A 1 Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 25 / 92

26 Basic concepts The previous case only consider transformation from a p-dimensional random variable to another p-dimensional random variable. The case of transformations from a p-dimensional random variable to an q- dimensional random variable, with p q is more difficult to handle. Therefore, we focus on the mean vector and the covariance matrix of the transformed random variable. Let x = (x 1,..., x p ) be a multivariate random variable and let y = (y 1,..., y q ) such that: y = Ax + b where A is a q p matrix and b is a q 1 column vector. Then, letting µ x and µ y be the mean vectors and Σ x and Σ y be the covariance matrices of x and y, respectively, we have: µ y = Aµ x + b, Σ y = AΣ x A Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 26 / 92

27 Multivariate distributions The multivariate Gaussian distribution is a generalization to two or more dimensions of the univariate Gaussian (or Normal) distribution. This is often characterized by its resemblance to the shape of a bell and this is why it is popularly referred as the bell curve. The Gaussian distribution is used extensively in both theoretical and applied statistics research. Although it is well known that real data rarely obey the dictates of the Gaussian distribution, this deception does provide us with a useful approximation to reality. The pdf of a univariate Gaussian random variable with mean µ x = E (x) and variance σx 2 = Var (x) is: f x (x) = ( ) 2πσx 2 ) 1/2 exp ( (x µ x) 2 < x < and we denote it as x N ( µ x, σ 2 x). 2σ 2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 27 / 92

28 Multivariate distributions PDF of N(0,1) in blue, N(1,1) in green and N(0,2) in orange x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 28 / 92

29 Multivariate distributions Generalizing the univariate Gaussian distribution, the pdf of a multivariate Gaussian random variable x = (x 1,..., x p ) with mean vector µ x = E (x) and covariance matrix Σ x = Cov (x) is given by: f x (x) = (2π) p/2 Σ x 1/2 exp ( (x µ x) Σ 1 ) x (x µ x ) 2 where < x j <, for j = 1,..., p. We denote it as x N p (µ x, Σ x ). The next slides show some examples of pdfs of bivariate Gaussian distributions. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 29 / 92

30 Multivariate distributions PDF of multivariate standard Gaussian x2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 30 / 92

31 Multivariate distributions PDF of Gaussian with correlation x2 2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 31 / 92

32 Multivariate distributions PDF of Gaussian with correlation x2 2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 32 / 92

33 Multivariate distributions How is the N p (µ x, Σ x ) distribution related to the N p (0 p, I p ) distribution (the standard multivariate Gaussian distribution)? Through a linear transformation as follows. Let x N p (µ x, Σ x ) and y = Σ 1/2 x (x µ x ). Then, x N p (0 p, I p ). How can we create N p (µ x, Σ x ) variables on the basis of N p (0 p, I p ) variables? We use the inverse linear transformation: x = Σ 1/2 x y + µ x Additionally, it is of interest to know the distribution of a Gaussian variable after it has been linearly transformed. Let x N p (µ x, Σ x ), A a q p matrix and b a q 1 column vector. Then, y = Ax + b has a N q (Aµ x + b, AΣ x A ) distribution. Therefore, y has also a Gaussian distribution. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 33 / 92

34 Multivariate distributions The level curves or contours are the curves obtained by cutting the probability density function by parallel hyperplanes. In other words, the level curves are points with the same density value. In the multivariate Gaussian case, their equation is given by: (x µ x ) Σ 1 x (x µ x ) = c where c is a constant. Therefore, the level curves of multivariate Gaussian distributions are ellipsoids. The next two slides show the level curves for the Gaussian distributions considered in the previous plots with and without a sample of 100 points generated from these distributions. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 34 / 92

35 Multivariate distributions Levels curves for Gaussian with correlation 0 Levels curves for Gaussian with correlation.9 Levels curves for Gaussian with correlation Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 35 / 92

36 Multivariate distributions Levels curves for Gaussian with correlation 0 Levels curves for Gaussian with correlation.9 Levels curves for Gaussian with correlation Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 36 / 92

37 Multivariate distributions The level curves of the multivariate Gaussian distribution give us a notion of distance between points. Note that all the points in the level curve have the same density and form an ellipsoid. Therefore, it is reasonable to assume that all the points in a level curve are at the same distance from the center of the distribution. The implied distance is the Mahalanobis distance between x and µ x, given by: D M (x, µ x ) 2 = (x µ x ) Σ 1 x (x µ x ) If x N p (µ x, Σ x ), the squared Mahalanobis distance has a χ 2 p distribution, i.e., D M (x, µ) 2 χ 2 p. The Mahalanobis distance plays an important role in many problems such as outlier detection, classification, clustering and so on. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 37 / 92

38 Multivariate distributions Random sample Mahalanobis distances Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 38 / 92

39 Multivariate distributions It is useful to know more about the multivariate Gaussian distribution, since it is often a good approximation in many situations. It is often of interest to partition x into sub-variables. Therefore, if we partition x, its mean vector µ x and its covariance matrix Σ x as: ( ) ( ) x(1) µx(1) x = µ x x = (2) µ x(2) and, ( ) Σx(11) Σ Σ x = x(12) Σ x(21) Σ x(22) where x (1) and x (2) have dimensions q and p q, respectively, then, x (1) N q ( µx(1), Σ x(12) ), x(2) N p q ( µx(2), Σ x(22) ) and Cov ( x(1), x (2) ) = Σx(12). Moreover, x (1) and x (2) are independent if and only if Σ x(12) = 0 (q,p q), where 0 (q,p q) is a q (p q) matrix of zeros. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 39 / 92

40 Multivariate distributions If Σx(22) > 0, then the conditional distribution of x(1) given x (2) is Gaussian with mean: µ x(1) + Σ x(12) Σ 1 ( ) x(22) x(2) µ x(2) and covariance matrix: Σ x(11) Σ x(12) Σ 1 x(22) Σ x(21) If x (1) and x (2) are independent and distributed as N q ( µx(1), Σ x(12) ) and x(2) ( ) N p q µx(2), Σ x(22), respectively, then x = (x(1) (2)), x has the multivariate Gaussian distribution: (( ) ( )) µx(1) Σx(11) 0 N p, (q,p q) µ x(2) 0 (p q,q) Σ x(22) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 40 / 92

41 Multivariate distributions The multivariate Gaussian distribution belongs to the large family of elliptical distributions which has recently gained a lot of attention in financial mathematics. The simplest case of elliptical distributions is the subclass of spherical distributions. We say that a vector variable x = (x 1,..., x p ) follows a spherical distribution if its density function only depends on the variable through x x. Therefore, the level curves of the distribution are spheres whose center is in the origin and the distribution is invariant under rotations. In other words, if we define y = Cx, where C is an orthogonal matrix, the density of the variable y is the same as that of x. This is only one of the possible ways to define spherical distributions. We can see spherical distributions as an extension of the standard multivariate Gaussian distribution N p (0 p, I p ). Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 41 / 92

42 Multivariate distributions The variable x = (x 1,..., x p ) follows a elliptical distribution if its density function only depends on x through (x m) V 1 (x m), where m is a p 1 column vector and V is a p p matrix (not necessarily the mean and the covariance matrix of x). The elliptical distributions verifies that their level curves are ellipsoids centered in m. The multivariate Gaussian distribution is the best known elliptical distribution. Indeed, elliptical distributions can be seen as an extension of the N p (µ x, Σ x ). Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 42 / 92

43 Multivariate distributions Let y N p (0 p, Σ) and u χ 2 ν be independent. The multivariate random variable: ν x = µ + u y has a multivariate Student s t distribution with parameters µ, Σ and ν. For ν > 2, the mean of the distribution is µ and the covariance matrix is v/ (v 2) Σ. The parameter ν is called the degrees of freedom parameter. The density function of a multivariate Student s t distribution is given by: f x (x) = Γ ( ) ν+p 2 (πν) p 2 Γ ( ) V x 1/2 ( 1 + (x m x ) V ν x 1 (x m x ) ) ν+p 2 2 The multivariate Student s t distribution belongs to the class of elliptical distributions. In particular, if Σ = I p, this distribution belongs to the class of spherical distributions. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 43 / 92

44 Multivariate distributions PDF of a Student't distribution with 5 df x2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 44 / 92

45 Multivariate distributions Elliptical distributions share many properties with Gaussian distributions: marginal and conditional distributions are also elliptical, and the conditional means are a linear function of the determining variables. Nevertheless, the Gaussian distribution is the only one in the family to have the property whereby if the covariance matrix is diagonal, all the component variables are independent. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 45 / 92

46 Multivariate distributions A distribution is called heavy-tailed if it has higher probability density in its tail area compared with a Gaussian distribution with the same mean vector and covariance matrix. The multivariate Student s t distribution is an example of heavy-tailed distributions. Other examples of heavy-tailed distributions includes the multivariate generalized hyperbolic distribution, the multivariate Laplace distribution and the multivariate mixture of distributions. In particular, we briefly revise multivariate mixtures of distributions. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 46 / 92

47 Multivariate distributions Mixture modelling concerns modelling a statistical distribution by a mixture (or weighted sum) of different distributions. For many choices of component density functions, the mixture model can approximate any continuous density to arbitrary accuracy, provided that the number of component density functions is sufficiently large and the parameters of the model are chosen correctly. The density function of a multivariate random variable x = (x 1,..., x p ) that follows a mixture distribution is given by: f x (x) = G π g f x,g (x) g=1 where: π1,..., π G are weights such that G g=1 πg = 1; fx,1(x),..., f x,g (x) are multivariate pdf s. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 47 / 92

48 Multivariate distributions Note that the mixture distributions have an interesting interpretation in terms of heterogeneous populations. Assume a population where we have defined the multivariate random variable x and that can be subdivided more homogeneously into G groups. Then, π 1,..., π G can be seen as the proportion of elements in the groups 1,..., G, while f x,1 (x),..., f x,g (x) are multivariate pdf s associated with each population. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 48 / 92

49 Multivariate distributions PDF of a Mixture distribution x2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 49 / 92

50 Multivariate distributions Levels curves for a mixture of Gaussian distributions Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 50 / 92

51 Multivariate distributions PDF of a Mixture distribution x2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 51 / 92

52 Multivariate distributions Levels curves for a mixture of Gaussian distributions Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 52 / 92

53 Multivariate distributions One main problem in multivariate analysis is how to model dependence of the components of a multivariate random variable. We have seen several multivariate distributions that model this dependence. However, these models, except perhaps mixtures, are not flexible enough to model multivariate dependence. As seen before, Copulae represent an elegant concept of connecting marginals with joint cumulative distribution functions. Copulas are functions that join or couple multivariate distribution functions to their 1-dimensional marginal distribution functions. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 53 / 92

54 Multivariate distributions Let x = (x 1,..., x p ) be a multivariate random variable and let F xj, for j = 1,..., p, be the marginal distribution functions of the components of x. Using copulae, the marginal distribution functions can be separately modelled from their dependence structure and then coupled together to form the multivariate distribution F x. The formal definition of copula function is more complex than in the 2-dimensional case. However, the intuition is the same as for the 2-dimensional case, so we do not provide here its formal definition. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 54 / 92

55 Multivariate distributions Sklar s Theorem in p dimensions Let F x be a p-dimensional distribution function with marginal distribution functions F x1,..., F xp. Then, a p-dimensional copula C x exists such that for all x 1,..., x p R p : F x (x 1,..., x p ) = C x ( Fx1 (x 1 ),..., F xp (x p ) ) Moreover, if F x1,..., F xp are continuous then C x is unique. Conversely, if C x is a copula and F x1,..., F xp are distribution functions then F x defined above is a p-dimensional distribution function with marginals F x1,..., F xp. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 55 / 92

56 Multivariate distributions Let F z denote the univariate standard Gaussian distribution function and F x the p-dimensional Gaussian distribution with mean vector 0 p and covariance as well as correlation matrix Σ x. Then, the function: C Gauss x,σ x ( (u) = F x F 1 z (u 1 ),..., Fz 1 (u p ) ) is the p-dimensional Gaussian copula with correlation matrix Σ x, where u = (u 1,..., u p ) [0, 1] p. If Σ x I p, then, the corresponding Gaussian copula allows to generate joint symmetric dependence. However, it is not possible to model a tail dependence, e.g., joint extreme events have a zero probability. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 56 / 92

57 Multivariate distributions The function: 1/θ p x,θ (u) = exp ( log u j ) θ C GH j=1 is the p-dimensional Gumbel-Hougaard copula function where θ [1, ). Unlike the Gaussian copula, C GH x,θ can generate an upper tail dependence. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 57 / 92

58 Multivariate distributions PDF of a Copula distribution x1 4 2 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 58 / 92

59 Statistical inference In multivariate statistics, we observe the values of a multivariate random variable x = (x 1,..., x p ) and obtain a sample x i = (x i1,..., x ip ), for i = 1,..., n summarised in a data matrix X. For a given random sample, x 1,..., x n, the idea of statistical inference is to analyse the properties of the population random variable x. If we do not know the distribution of x, statistical inference can often be performed using some observable functions of the sample x 1,..., x n, i.e., statistics. Example of statistics are the sample mean and the sample covariance matrix. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 59 / 92

60 Statistical inference To get an idea of the relationship between a statistic and the corresponding population counterpart, one has to derive the sampling distribution of the statistic. Therefore, given a random sample, x 1,..., x n, of the population random variable x such that E [x] = µ x and Cov [x] = Σ x, then, the sample mean vector x and the sample covariance matrix S x verifies the following properties: 1 E [x] = µ x. 2 Cov [x] = 1 n Σx. 3 E [S x] = Σ x. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 60 / 92

61 Statistical inference Statistical inference often requires more than just the mean and/or the covariance of a statistic. We need the sampling distribution of the statistic to derive confidence intervals or to define rejection regions in hypothesis testing for a given significance level. For instance, in the Gaussian case, we have the following result. Theorem Let x 1,..., x n be i.i.d. with x i N (µ x, Σ x ). Then, x N ( µ x, 1 n Σ x). The central limit theorem shows than even if the parent distribution is not Gaussian, when the sample size n is large, the sample mean vector x has an approximate Gaussian distribution. Central Limit Theorem (CLT) Let x 1,..., x n be i.i.d. with x i (µ x, Σ x ). Then, the distribution of n (x µ x ) is asymptotically N (0 p, Σ x ), i.e., n (x µx ) d N (0 p, Σ x ) as n Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 61 / 92

62 Statistical inference The next two slides show multivariate kernel density estimates of a sample of 2000 sample mean vector corresponding to 2000 samples of a certain bivariate random variable. The first slide corresponds to the case of n = 5. The second slide corresponds to the case of n = 100. It is easy to see that the second estimate appears to be closer to the bivariate Gaussian distribution than the first one. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 62 / 92

63 Statistical inference n= x1 1 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 63 / 92

64 Statistical inference n= x1 x Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 64 / 92

65 Statistical inference If we assume that we know the distribution of the multivariate random variable x, then, the mail goal of statistical inference is to estimate the parameters of this distribution. Then, let θ = (θ 1,..., θ r ) be the vector of parameters of a certain distribution with density function f ( θ). The aim is to estimate the vector θ from a i.i.d. sample x 1,..., x n from x. For that, the most important method to carry out this task is the maximum likelihood estimation (MLE) method. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 65 / 92

66 Statistical inference Let x 1,..., x n be an i.i.d. sample of x. Then, the joint pdf of x 1,..., x n is given by: n f (x 1,..., x n θ) = f (x i θ) Then, note that the sample is known (X, the data matrix) but θ is unknown. In MLE, it is considered that θ is a variable and X is fixed, leading to the likelihood function: n l (θ X ) = f (x i θ) where x i = (x i1,..., x ip ). i=1 i=1 The likelihood function can be seen as the pdf of θ X. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 66 / 92

67 Maximum likelihood estimation The maximum likelihood estimate (MLE) of θ, denoted by θ, is the value of θ that maximizes l (θ X ), i.e.: θ = arg maxl (θ X ) θ In other words, the MLE, θ, is the value of θ that maximizes the probability of obtaining the sample under study. Often it is easier to maximize the log of the likelihood function, named the log-likelihood function or support function: L (θ X ) = log l (θ X ) which is equivalent since the logarithm is a monotone one-to-one function. Hence, θ = arg maxl (θ X ) = arg maxl (θ X ) θ θ Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 67 / 92

68 Maximum likelihood estimation Usually, the maximisation process can not be performed analytically. Therefore, the maximisation process involves nonlinear optimization techniques. In this case, given a data matrix X and the likelihood function, numerical methods will be used to determine the value of θ maximising L (θ X ) or l (θ X ). These numerical methods are typically based on Newton-Raphson techniques. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 68 / 92

69 Maximum likelihood estimation Let x 1,..., x n be a simple random sample from x N (µ x, Σ x ). Then, the joint density function is: f (x 1,..., x n µ x, Σ x ) = n { (2π) p/2 Σ x 1/2 exp ( (x i µ x ) Σ 1 )} x (x i µ x ) 2 i=1 Then, the support function is given by: L (µ x, Σ x X ) = np 2 log 2π n 2 log Σ x 1 2 n i=1 (x i µ x ) Σ 1 x (x i µ x ) Next, note that we can write: n i=1 (x i µ x ) Σ 1 x (x i µ x ) = Tr [ Σ 1 x ( n )] (x i µ x ) (x i µ x ) i=1 Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 69 / 92

70 Maximum likelihood estimation On the other hand, adding and subtracting the sample mean vector x in (x i µ x ) leads to: n (x i µ x ) (x i µ x ) = i=1 = n (x i x + x µ x ) (x i x + x µ x ) = i=1 n (x i x) (x i x) + n (x µ x ) (x µ x ) i=1 because the terms n i=1 (x i x) (x µ x ) and n i=1 (x µ x) (x i x) are both matrices of zeros. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 70 / 92

71 Maximum likelihood estimation Consequently: Tr = Tr [ [ Σ 1 x Σ 1 x n i=1 (x i µ x ) Σ 1 x (x i µ x ) = ( n )] (x i x) (x i x) + n (x µ x ) (x µ x ) = i=1 ( n )] (x i x) (x i x) i=1 + n (x µ x ) Σ 1 x (x µ x ) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 71 / 92

72 Maximum likelihood estimation Therefore, the support function can be written as: 1 2 ( Tr [ Σ 1 x L (µ x, Σ x X ) = np 2 log 2π n 2 log Σ x ( n i=1 (x i x) (x i x) )] + n (x µ x ) Σ 1 x (x µ x ) Now, L (µ x, Σ x X ) only depends on µ x in the last term and that this is maximized if (x µ x ) Σ 1 x (x µ x ) = 0. Therefore, the MLE of µ x is µ = x. ) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 72 / 92

73 Maximum likelihood estimation It remains to maximize: L (Σ x X, µ x = x) = np 2 log 2π n 2 log Σ x 1 2 Tr [ Σ 1 x ( n )] (x i x) (x i x) For that, we need a result from the matrix algebra: Given a p p symmetric positive definite matrix B and a scalar b > 0, it follows that: b log Σ x 1 2 Tr ( Σ 1 x B ) b log B + pb log (2b) pb Then, taking b = n/2 and B = n i=1 (x i x) (x i x), shows that the MLE of Σ x is: Σ x = 1 n (x i x) (x i x) n i=1 Note that the MLE of Σ x is not the sample covariance matrix but a re-scaled version of it. i=1 Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 73 / 92

74 Maximum likelihood estimation The next Theorem gives the asymptotic sampling distribution of the MLE, which turns out to be Gaussian. Theorem Suppose that the sample x 1,..., x n is i.i.d. If θ is the MLE for θ R r, i.e., θ = arg maxl (θ X ), then under some regularity conditions, as n : θ n ( θ θ ) d N ( 0 r, F 1) where F denotes the Fisher information matrix given by: F = 1 [ ] 2 n E θ θ L (θ X ) As a consequence of this Theorem, we see that under regularity conditions the MLE is asymptotically unbiased, efficient (minimum variance) and Gaussian distributed. Also it is a consistent estimator of θ. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 74 / 92

75 Hypothesis testing We turn now our interest towards hypothesis testing issues. In particular, we will go over a general methodology to construct tests called the likelihood ratio method and we will apply them to the case of Gaussian populations. Then, we assume a r-dimensional vector parameter, θ, that takes values in Ω R r. We want to test the hypothesis H 0 that the unknown parameter θ belongs to some subspace of R r. This subspace is called the null set and will be denoted by Ω 0 R r. Consequently, we want to test the hypothesis: versus the alternative hypothesis: H 0 : θ Ω 0 H 1 : θ Ω which suppose that θ is not restricted to Ω 0. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 75 / 92

76 Hypothesis testing For example, consider a multivariate Gaussian N (µ x, Σ x ). To test if µ x equals a certain fixed value of µ 0, we construct the test problem: H 0 : µ x = µ 0 H 1 : no constraints on µ x Then, in this example we have Ω 0 = {µ 0 } and Ω = R p. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 76 / 92

77 Hypothesis testing Define l 0 = max l (θ X ) and l = maxl (θ X ), the values of the maximized θ Ω 0 θ Ω likelihood under H 0 and H 1, respectively. Consider the likelihood ratio (LR) given by: LR = l 0 l By construction 0 LR 1, and one tends to favour H 0 if the LR is high ( close to 1) and H 1 if the LR is low ( not close to 1). The likelihood ratio test (LRT) tell us when exactly to favour H 0 over H 1. This is given by: λ = 2 ln LR = 2 (ln l 0 ln l ) The LRT λ is asymptotically distributed like a χ 2 distribution with the number of degrees of freedom equal to the difference of the dimension between the spaces Ω and Ω 0. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 77 / 92

78 Hypothesis testing Given a sample from a population N (µ x, Σ x ), we want to test the hypothesis: against the alternative H 0 : µ x = µ 0 H 1 : µ x µ 0 It is possible to show that the likelihood ratio test statistic is given by, Σ 0 λ = n log Σ x where, Σ 0 = 1 n n (x i µ 0 ) (x i µ 0 ) i=1 which has an asymptotic χ 2 distribution with p degrees of freedom. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 78 / 92

79 Illustrative example (I) Consider the daily log-returns (in percentages) of four major European stock indices: Germany (DAX), Switzerland (SMI), France (CAC) and UK (FTSE), from 1991 to We want to test the null hypothesis that the mean vector of returns is zero (assuming Gaussianity). The estimated mean is given by: x = (0.065, 0.081, 0.043, 0.043) The covariance matrix under H 0 is given by: Σ 0 = Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 79 / 92

80 Illustrative example (I) The covariance matrix under H 1 is given by: Σ x = The value of the statistic is λ = with associated p-value Thus, we reject H 0 at the 5% significant level but we cannot reject H 0 at the 1% significant level. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 80 / 92

81 Hypothesis testing Given a sample of a population N (µ x, Σ x ), we want to test the hypothesis: against the alternative H 0 : Σ x = Σ 0 H 1 : Σ x Σ 0 It is possible to show that the likelihood ratio test statistic is given by, λ = n log Σ ( 0 + ntr Σ Σ 1 0 ) Σ np which has an asymptotic χ 2 distribution with p (p + 1) /2 degrees of freedom. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 81 / 92

82 Hypothesis testing It is also of interest to know whether Σ x is diagonal, in which case the univariate variables are independent. In this case, we gain nothing from analyzing them jointly since they have no information in common. Then: against the alternative H 0 : Σ x diagonal H 1 : Σ x unrestricted It is possible to show that the likelihood ratio test statistic is given by, λ = n log R x which has an asymptotic χ 2 distribution with p (p 1) /2 degrees of freedom. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 82 / 92

83 Illustrative example (I) Consider again the daily log-returns (in percentages) of four major European stock indices. We test the null hypothesis of independency (assuming Gaussianity). The estimated correlation matrix under H 0 is given by: R = The value of the statistic is λ = with associated p-value 0. Thus, we reject H 0 at the usual significance levels. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 83 / 92

84 Hypothesis testing Assume that we have observed a sample of size n of a p-dimensional variable x = (x 1,..., x p ) that can be split into G groups so that there are n 1 observations of group 1, and so on. Our goal here is to check whether the means of the G groups are equal or not assuming Gaussianity and that the covariance matrix Σ x is the same for all the groups. Then, the hypothesis to be tested is: and the alternative hypothesis is: H 0 : µ 1 = = µ G = µ x H 1 : not all the µ g are equal This problem is known as the multivariate analysis of variance. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 84 / 92

85 Hypothesis testing The likelihood ratio test method leads to the statistic: Σ x λ = n log S W where Σ x is the MLE of Σ x under Gaussianity, and S W = W /n where: W = n G g (x ig x g ) (x ig x g ) g=1 i=1 where x ig is the i-th observation in group g and x g is the sample mean vector of the observations in group g. W is usually called the within groups variability matrix or matrix of deviations with respect to the means of each group. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 85 / 92

86 Hypothesis testing The statistic λ has an asymptotic χ 2 p(g 1) distribution. However, this approximation can be improved for small sample sizes. For instance, the statistic: Σ x λ 0 = m log S W asymptotically follows a χ 2 p(g 1) distribution, where m = (n 1) (p + G) /2. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 86 / 92

87 Hypothesis testing This test can be derived in an alternative way. Let: T = n Σ x = n G g (x ig x) (x ig x) g=1 i=1 be the total variability of the data, which measures the deviations with respect to a common mean. The matrix T can be decomposed as the sum of two matrices. The first one is the matrix W which has been defined previously. The second one measures the between groups variability, explained by the differences between means, and that we will denote as B: Therefore, we can write: B = G n g (x g x) (x g x) g=1 T (Total variability) = B (Explained variability) + W (Residual variability) Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 87 / 92

88 Hypothesis testing In order to test whether the means are equal we can compare the size of the matrices T and B. One idea is to consider that the measurement of their size is their determinant. Then, we can propose a test based on the ratio T / W. For moderate sizes, the test is similar to the likelihood ratio test that uses the statistic λ 0, that can also be written as: Σ x T λ 0 = m log = m log S W W Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 88 / 92

89 Illustrative example (II) We consider the Iris dataset consisting in four univariate variables measured on 150 flowers of 3 different species (setosa, versicolor and virginica). There are 50 flowers of each specie: x1: Length of the sepal (in mm.). x2: Width of the sepal (in mm.). x3: Length of the petal (in mm.). x4: Width of the petal (in mm.). The next slide shows the scatterplot matrix of the dataset. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 89 / 92

90 Illustrative example (II) Iris dataset Sepal.Length Sepal.Width Petal.Length Petal.Width Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 90 / 92

91 Illustrative example (II) We test the equality of means for the 3 groups of the Iris dataset. The means of the 3 groups are given by: x 1 = x 2 = x 2 = The value of the statistic λ is with associated p-value 0. Thus, we reject H 0. On the other hand, the value of the statistic λ 0 is with associated p-value 0. Thus, we also reject H 0 with this statistic. Consequently, we reject that the three subset of observations have the same means. Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 91 / 92

92 Chapter outline 1 Introduction. 2 Basic concepts. 3 Multivariate distributions. 4 Statistical inference. 5 Hypothesis testing. We are ready now for: Chapter 3: Principal components Pedro Galeano (Course 2016/2017) Multivariate Statistics - Chapter 3 Master in Mathematical Engineering 92 / 92

A Probability Review

A Probability Review Outline: A probability review Shorthand notation: RV stands for random variable EE 527, Detection and Estimation Theory, # 0b 1 A Probability Review Reading: Go over handouts 2 5 in