Multivariate Analysis and Likelihood Inference

Size: px

Start display at page:

Download "Multivariate Analysis and Likelihood Inference"

Alexina French
6 years ago
Views:

1 Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference

2 Joint density function The distribution of a random vector X =(X 1,...,X k ) T is characterized by its joint density function f such that P {X 1 a 1,..., X k a k } = ak a1 f(x 1,...,x k )dx 1 dx k when X has a differentiable distribution function (which is the left-hand side of (??)), and for the case of discrete X i, f(a 1,...,a k )=P {X 1 = a 1,...,X k = a k }. Change of variables Suppose X has a differentiable distribution function with density function f X. Let g : k k be continuously differentiable and one-to-one. Let Y = g(x). Since g is a one-to-one transformation, we can solve Y = g(x) to obtain X = h(y); h =(h 1,..., h k ) T is the inverse of the transformation g. The density function f Y of Y is given by f Y (y) =f X (h(y)) det( h/ y), where h/ y is the Jacobian matrix ( h i / y j ) 1 i,j k. The inverse function theorem of multivariate calculus gives h/ y =( g/ x) 1, or equivalently x/ y =( y/ x) 1.

3 Mean and covariance matrix The mean vector of Y =(Y 1,...,Y k ) T is E(Y) = (EY 1,..., EY k ) T. The covariance matrix of X is Cov(Y) = (Cov(Y i,y j )) 1 i,j k = E { (Y E(Y))(Y E(Y)) T }. The following linearity (and bilinearity) properties hold for EY (and Cov(Y)). For nonrandom p k matrix A and p 1 vector B, E(AY + B) =AEY + B, Cov(AY + B) =ACov(Y)A T. Eigenvalue and eigenvector Definition 1 Let V be a p p matrix. A complex number λ is called an eigenvalue of V if there exists a p 1 vector a 0 such that Va = λa. Such a vector a is called an eigenvector of V corresponding to the eigenvalue λ. We can rewrite Va = λa as (V λi)a =0. Since a 0, this implies that λ is a solution of the equation det(v λi) = 0. Since det(v λi) is a polynomial of degree p in λ, there are p eigenvalues, not necessarily distinct. If V is symmetric, then all its eigenvalues are real and can be ordered as λ 1 λ p. Moreover, tr(v) =λ λ p, det(v) =λ 1...λ p. If a is an eigenvector of V corresponding to the eigenvalue λ, then so is ca for c 0. Moreover, premultiplying λa = Va by a T yields λ = a T Va / a 2.

4 Eigenvalue and eigenvector We shall now focus on the case where V is the covariance matrix of a random vector x =(X 1,...,X p ) T. Proposition 1 The eigenvalues of V are real and nonnegative since V is symmetric and nonnegative definite. Consider the linear combination a T x with a =1that has the largest variance over all such linear combinations. Maximizing a T Va (= Var(a T x)) over a with a =1yields that Va = λa. Since a 0, this implies that λ is an eigenvalue of V and a is the corresponding eigenvector, and that λ = a T Va. Denote λ 1 = max a: a =1 a T Va and a 1 be the corresponding eigenvector with a 1 =1. Eigenvalue and eigenvector Next consider a T x that maximizes Var(a T x)=a T Va subject to a T 1 a =0and a =1. Its solution should satisfy {a T Va + λ(1 a T a)+ηa T 1 a} =0for i =1,..., p. a i This implies that λ is an eigenvalue of V with corresponding unit eigenvector a 2 that is orthogonal to a 1. Proceeding inductively in this way, we obtain the eigenvalue λ 1 λ 2 λ p of V with the optimization characterization λ k+1 = max a T Va. a: a =1,a T a j =0 for 1 j k The maximizer a k+1 of the above equation is an eigenvector corresponding to the eigenvalue λ k+1. Definition 2 a T i x is called the ith principal component of x.

5 Principal components (a) From the preceding derivation, λ i = Var(a T i x). (b) The elements of the eigenvectors a i are called factor loadings in PCA. Note that (a 1,..., a p ) is an orthogonal matrix, i.e., Moreover, I =(a 1,..., a p )(a 1,...,a p ) T = a 1 a T a pa T p. V = λ 1 a 1 a T λ p a p a T p. (c) Since V = Cov(x) and x =(X 1,...,X p ) T, tr(v) = p i=1 Var(X i). Hence it follows from the fact tr(v) = p i=1 λ i that λ λ p = p Var(X i ). i=1 PCA of sample covariance matrix Suppose x 1,..., x n are n independent observations from a multivariate population with mean µ and covariance matrix V. The mean vector µ can be estimated by x = n i=1 x i/n, and the covariance matrix can be estimated by the sample covariance matrix, i.e., V = n (x i x)(x i x) T / (n 1). i=1 Let â j =(â 1j,...,â pj ) T be the eigenvector corresponding to the jth largest eigenvalue λ j of the sample covariance matrix V.

6 PCA of sample covariance matrix For fixed p, the following asymptotic results have been established under the assumption λ 1 > > λ p > 0 as n : (i) n(â i a i ) has a limiting normal distribution with mean 0 and covariance matrix λ i λ h (λ i λ h ) 2 a ha T h. h i Note that â i and a i are chosen so that they have the same sign. Approximating covariance structure Even when p is large, if many eigenvalues λ h are small compared with the largest λ i s, the covariance structure of the data can be approximated by a few principal components with large eigenvalues. PCA of sample covariance matrix For fixed p, the following asymptotic results have been established under the assumption λ 1 > > λ p > 0 as n : (i) n(â i a i ) has a limiting normal distribution with mean 0 and covariance matrix λ i λ h (λ i λ h ) 2 a ha T h. h i Note that â i and a i are chosen so that they have the same sign. (ii) n( λ i λ i ) has a limiting N(0, 2λ 2 i ) distribution. Moreover, for i j, the limiting distribution of n( λ i λ i, λ j λ j ) is that of two independent normals.

7 Representation of data in terms of principal components Let x i =(x i1,...,x ip ) T, i =1,..., n, be the multivariate sample. Let X k =(x 1k,..., x nk ) T, 1 k p, and define Y j = â 1j X â pj X p, 1 j p, where â j =(â 1j,...,â pj ) T is the eigenvector corresponding to the jth largest eigenvalue λ j of the sample covariance matrix V with â j =1. Since the matrix Â := (â ij) 1 i,j p is orthogonal (i.e., ÂÂT = I), it follows that the observed data X k can be expressed in terms of the principal components Y j as X k = â k1 Y â kp Y p. Moreover, the sample correlation matrix of the transformed data y tj is the identity matrix. PCA of correlation matrices An alternative to Cov(x) is the correlation matrix Corr(x). Since a primary goal of PCA is to uncover low-dimensional subspaces around which x tends to concentrate, scaling the components of x by their standard deviations may work better for this purpose, and applying PCA to sample correlation matrices may give more stable results.

8 Example: PCA of U.S. Treasury-LIBOR swap rates swp1y swp4y swp10y 4 swp30y 1 07/05/00 05/03/01 03/11/02 01/09/03 11/07/03 09/10/04 07/15/05 Figure 1: Daily swap rates of four of the eight maturities T =1, 2, 3, 4, 5, 7, 10, 30 from July 3, 2000 to July 15, We now apply PCA to account for the variance of daily changes in swap rates with a few principal components (or factors). Example: PCA of U.S. Treasury-LIBOR swap rates 0.2 0!0.2! !0.2!0.4 07/05/00 05/03/01 03/11/02 01/09/03 11/07/03 09/10/04 07/15/05 Figure 2: The differenced series for 1-year (top) and 30-year (bottom) swap rates. Let d kt = r kt r k,t 1 denote the daily changes in the k-year swap rates during the period.

9 Example: PCA of U.S. Treasury-LIBOR swap rates The sample mean vector and the sample covariance and correlation matrices of the difference data {(d 1t,...,d 8t ), 1 t 1256} are bµ = ` T bσ = , C A Example: PCA of U.S. Treasury-LIBOR swap rates The sample mean vector and the sample covariance and correlation matrices of the difference data {(d 1t,...,d 8t ), 1 t 1256} are bµ = ` T br = C A

10 Example: PCA of U.S. Treasury-LIBOR swap rates Table 1: PCA of covariance matrix of swap rate data. PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC λ i Proportion factor loadings Example: PCA of U.S. Treasury-LIBOR swap rates Table 2: PCA of correlation matrix of swap rate data. PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 λ i Proportion factor loadings

11 Example: PCA of U.S. Treasury-LIBOR swap rates Let x tk =(d tk µ k ) / σ k, where σ k is the sample standard deviation of d tk. We represent X k =(x 1k,...,x nk ) T in terms of the principal components Y 1,...,Y 8, X k = â k1 Y â kp Y p. Tables 1 and 2 shows the PCA results using the sample covariance matrix Σ and the sample correlation matrix R. From Table 2, the proportion of the overall variance explained by the first three principal components are 90.8%, 6.7% and 1.3%, respectively. Hence 98.9% of the overall variance is explained by the first three principal components, yielding the approximation (d k µ k ) / σ k. = âk1 Y 1 + â k2 Y 2 + â k3 Y 3 for the differenced swap rates. Example: PCA of U.S. Treasury-LIBOR swap rates 0.6 the 1st the 2nd the 3rd !0.2!0.4! Figure 3: Eigenvectors of the first three principal components, which represent the parallel shift, tilt, and convexity components of swap rate movements.

12 Example: PCA of U.S. Treasury-LIBOR swap rates The graphs of the factor loadings show the following constituents of interest rate or yield curve movements documented in the literature: (a) Parallel shift component. The factor loadings of the first principal component are roughly constant over different maturities. This means that a change in the swap rate for one maturity is accompanied by roughly the same change for other maturities. (b) Tilt component. The factor loadings of the second principal component have a monotonic change with maturities. Changes in short-maturity and long-maturity swap rates in this component have opposite signs. (c) Curvature component. The factor loadings of the third principal component are different for the midterm rates and the average of short- and long-term rates, revealing a curvature of the graph that resembles the convex shape of the relationship between the rates and their maturities. Joint, marginal and conditional distributions An m 1 random vector Y is said to have a multivariate normal distribution N(µ, V) if its density function is f(y) = 1 { (2π) m/2 det(v) exp 1 } 2 (y µ)t V 1 (y µ), y m. Suppose that Y N(µ, V) is partitioned as ( ) ( ) ( ) Y1 µ1 V11 V Y =, µ =, V = 12, Y 2 µ 2 V 21 V 22 where Y 1 and µ 1 have dimension r(< m) and V 11 is r r. Then Y 1 N(µ 1, V 11 ), Y 2 N(µ 2, V 22 ), and the conditional distribution of Y 1 given Y 2 = y 2 is N ( µ 1 + V 12 V 1 22 (y 2 µ 2 ), V 11 V 12 V 1 22 V 21).

13 Orthogonality and independence Let ζ be a standard normal random vector of dimension p and let A, B be r p and q p nonrandom matrices such that AB T = 0. Then the components of U 1 := Aζ are uncorrelated with those of U 2 := Bζ because E(U 1 U T 2 )=AE(ζζ T )B T = 0. Note that ( ) ( ) U1 A U = = ζ U 2 B has a multivariate normal distribution with mean 0 and covariance matrix ( ) AA T 0 V = 0 BB T, Then the density function of U is a product of the density functions of U 1 and U 2, so U 1 and U 2 are independent. Note also that U T 1 U 2 = 0; i.e., they are orthogonal vectors. Sample covariance matrix and Wishart distribution Let Y 1,...,Y n be independent m 1 random N(µ, Σ) vectors with n > m and positive definite Σ. Define Ȳ = 1 n n Y i, W = i=1 n (Y i Ȳ)(Y i Ȳ)T. i=1 It can be shown that the sample mean vector Ȳ N(µ, Σ/n), and Ȳ and the sample covariance matrix W/(n 1) are independent. We next generalize n t=1 (y t ȳ) 2/ σ 2 χ 2 n 1 to the multivariate case. Wishart distribution Suppose Y 1,..., Y n m are independent N(0, Σ) random vectors. Then the random matrix W = Σ n i=1 Y iyi T is said to have a Wishart distribution, denoted by W m (Σ,n).

14 Wishart distribution The density function of the Wishart distribution W m (Σ,n) generalizes this to f(w) = det(w)(n m 1)/2 exp { 1 2 tr( Σ 1 W )}, W > 0, [2 m det(σ)] n/2 Γ m (n/2) in which W > 0 denotes that W is positive definite and Γ m ( ) denotes the multivariate gamma function Γ m (t) =π m(m 1)/4 m i=1 ( Γ t i 1 2 Note that the usual gamma function corresponds to the case m =1. ). Wishart distribution The following properties of the Wishart distribution are generalizations of some well-known properties of the chi-square distribution and the gamma(α, β) distribution. (a) If W W m (Σ,n), then E(W) =nσ. (b) Let W 1,..., W k be independently distributed with W j W m (Σ,n j ), j =1,..., k. Then k j=1 W j W m (Σ, k j=1 n j). (c) Let W W m (Σ,n) and A be a nonrandom m m nonsingular matrix. Then AWA T W m (AΣA T,n). In particular, a T Wa (a T Σa)χ 2 n for all nonrandom vectors a 0.

15 Multivariate-t distribution Multivariate-t distribution If Z N(0, Σ) and W W m (Σ,k) such that the m 1 vector Z and the m m matrix W are independent, then ( W/k ) 1/2 Z is said to have the m-variate t-distribution with k degrees of freedom. The m-variate t-distribution with k degrees of freedom has the density function f(t) = Γ( (k + m)/2 ) ( (πk) m/2 Γ(k/2) 1+ tt t k ) (k+m)/2, t m. If t has the m-variate t-distribution with k degrees of freedom such that k m, then k m +1 km tt t has the F m,k m+1 -distribution. (1) Multivariate-t distribution Applying (1) to Hotelling s T 2 -statistic T 2 = n(ȳ µ)t ( W / (n 1) ) 1 ( Ȳ µ), (2) where Ȳ and W are defined in (??), yields noting that n m m(n 1) T 2 F m,n m, (3) [ / 1/2 [ n( ] [ / 1/2N(0, W (n 1)] Ȳ µ) = W m (Σ,n 1) (n 1)] Σ) has the multivariate t-distribution.

16 Multivariate-t distribution In the definition of the multivariate t-distribution, it is assumed that Z and W share the same Σ. More generally, we can consider the case where Z N(0, V) instead. By considering V 1/2 Z instead of Z, we can assume that V = I m. Then the density function of ( W/k ) 1/2 Z, with independent Z N(0, I m ) and W W m (Σ,k), has the general form Γ ( (k + m)/2 ) ( f(t) = (πk) m/2 Γ(k/2) 1+ tt Σ 1 t ) (k+m)/2, t m. det(σ) k This density function can be used to define multivariate t-copulas. Method of maximum likelihood (MLE) Suppose X 1,...,X n have joint density function f θ (x 1,...,x n ), θ p. The likelihood function based on the observations X 1,..., X n is L(θ) =f θ (X 1,...,X n ), and the MLE θ of θ is the maximizer of L(θ) or the log-likelihood l(θ) = log f θ (X 1,..., X n ). Example: MLE in Gaussian regression models Consider the regression model y t = β T x t + ɛ t, in which the regressors x t F t 1 = {y t 1, x t 1,..., y 1, x 1 }, and ɛ t are i.i.d. N(0, σ 2 ). Letting θ =(β, σ 2 ), the likelihood function is n { 1 [ ]} L(θ) = exp (y t β T x t ) 2 /(2σ 2 ). 2πσ t=1 It then follows that the MLE of β is the same as the OLS estimator β that minimizes n t=1 (y t β T x t ) 2, and the MLE of σ 2 is σ 2 = n 1 n t=1 (y t β T x t ) 2.

17 Consistency of MLE The classical asymptotic theory of maximum likelihood deals with i.i.d. X t so that l(θ) = n t=1 log f θ(x t ). Let θ 0 denote the true parameter value. By the law of large numbers, n 1 l(θ) E θ0 {log f θ (X 1 )} as n, (4) with probability 1, for every fixed θ. Jensen s inequality Suppose X has a finite mean and ϕ : is convex, i.e., λϕ(x) + (1 λ)ϕ(y) ϕ(λx + (1 λ)y) for all 0 < λ < 1 and x, y. (If the inequality above is strict, then ϕ is called strictly convex.) Then Eϕ(X) ϕ(ex), and the inequality is strict if ϕ is strictly convex unless X is degenerate (i.e., it takes only one value). Consistency of MLE Since log x is a (strictly) convex function, it follows from Jensen s inequality that ( ) ( ) E θ0 log f θ(x 1 ) f log E θ (X 1 ) f θ0 (X 1 ) θ0 = log 1 = 0; f θ0 (X 1 ) the inequality above is strict unless the random variable f θ (X 1 ) /f θ0 (X 1 ) is degenerate under P θ0, which means P θ0 {f θ (X 1 )=f θ0 (X 1 )} =1since E θ0 {f θ (X 1 )/f θ0 (X 1 )} =1. Thus, except in the case where f θ (X 1 )=f θ0 (X 1 ) with probability 1, E θ0 log f θ (X 1 ) <E θ0 log f θ0 (X 1 ) for θ θ 0. This suggests that the MLE is consistent (i.e., it converges to θ 0 with probability 1 as n ). A rigorous proof of consistency of the MLE involves sharpening the argument in (4) by considering (instead of fixed θ) sup θ U n 1 l(θ) over neighborhoods U of θ θ 0.

18 Asymptotic normality of MLE Suppose f θ (x) is smooth in θ. Then we can find the MLE by solving the equation l(θ) =0, where l is the (gradient) vector of partial derivatives l/ θ i. For large n, the consistent estimate θ is near θ 0, and therefore Taylor s theorem yields 0 = l( θ) = l(θ 0 )+ 2 l(θ )( θ θ 0 ), (5) where θ lies between θ and θ 0 (and is therefore close to θ 0 ) and ( 2 2 ) l l(θ) = θ i θ j 1 i,j p is the (Hessian) matrix of second partial derivatives. From (5) it follows that θ θ 0 = ( 2 l(θ ) ) 1 l(θ0 ). (7) (6) Asymptotic normality of MLE The central limit theorem (CLT) can be applied to n l(θ 0 )= log f θ (X t ) θ=θ0, (8) t=1 noting that the gradient vector log f θ (X t ) θ=θ0 has mean 0 and covariance matrix { }) 2 I(θ 0 )= (E θ log f θ i θ θ (X t ) θ=θ0. (9) j 1 i,j p The matrix I(θ 0 ) is called the Fisher information matrix. By the law of large numbers, n 1 2 l(θ 0 )= n 1 with probability 1. n t=1 2 log f θ (X t ) θ=θ0 I(θ 0 ) (10)

19 Asymptotic normality of MLE Moreover, n 1 2 l(θ )=n 1 2 l(θ 0 )+o(1) with probability 1 by the consistency of θ. From (7) and (10), it follows that n( θ θ0 )= ( n 1 2 l(θ ) ) 1 ( l(θ0 )/ n ) (11) converges in distribution to N(0, I 1 (θ 0 )) as n, in view of the following theorem. Slutsky s theorem If X n converges in distribution to X and Y n converges in probability to some nonrandom C, then Y n X n converges in distribution to CX. It is often more convenient to work with the observed Fisher information matrix 2 l( θ) than the Fisher information matrix ni(θ 0 ) and to use the following variant of the limiting normal distribution of (11): ( θ θ 0 ) T ( 2 l( θ))( θ θ 0 ) has a limiting χ 2 p-distribution. (12) Asymptotic normality of MLE The preceding asymptotic theory has assumed that X t are i.i.d. so that the law of large numbers and CLT can be applied. More generally, without assuming that X t are independent, the log-likelihood function can still be written as a sum: l n (θ) = n log f θ (X t X 1,...,X t 1 ). (13) t=1 It can be shown that { l n (θ 0 ),n 1} is a martingale, or equivalently, } X1 E { log f θ (X t X 1,..., X t 1 ) θ=θ0,...,x t 1 = 0, and martingale strong laws and CLTs can be used to establish the consistency and asymptotic normality of θ under certain conditions, so (12) still holds in this more general setting.

20 Asymptotic inference confidence ellipsoid In view of (12), an approximate 100(1 α)% confidence ellipsoid for θ 0 is { θ :( θ θ) T ( 2 l n ( θ) ) } ( θ θ) χ 2 p;1 α, (14) where χ 2 p;1 α is the (1 α)th quantile of the χ2 p -distribution. Asymptotic inference hypothesis testing To test H 0 : θ Θ 0, where Θ 0 is a q-dimensional subspace of the parameter space with 0 q < p, the generalized likelihood ratio (GLR) statistic is Λ =2 { l n ( θ) sup θ Θ 0 l n (θ) }, (15) where l n (θ) is the log-likelihood function, defined in (13). The derivation of the limiting normal distribution of (11) can be modified to show that Λ has a limiting χ 2 p q distribution under H 0. (16) Hence the GLR test with significance level (type I error probability) α rejects H 0 if Λ exceeds χ 2 p q;1 α.

21 Asymptotic inference delta method Let g : p. If θ is the MLE of θ, then g( θ) is the MLE of g(θ). If g is continuously differentiable in some neighborhood of the true parameter value θ 0, then g( θ) =g(θ 0 ) + ( g(θ )) T ( θ θ 0 ), where θ lies between θ and θ 0. Therefore g( θ) g(θ 0 ) is asymptotically normal with mean 0 and variance ( ) T ( ) 1 ( ) g( θ) 2 l n ( θ) g( θ). (17) This is called the delta method for evaluating the asymptotic variance of g( θ). The above asymptotic normality of g( θ) g(θ 0 ) can be used to construct confidence intervals for g(θ 0 ). Parametric bootstrap When θ 0 is unknown, the parametric bootstrap draws bootstrap samples (X b,1,...,x b,n ), b =1,..., B, from Fb θ, where θ is the MLE of θ. Then we can use these bootstrap samples to estimate the bias and standard error of θ. To construct a bootstrap confidence region for g(θ 0 ), we can use ( g( θ) g(θ 0 )) / ( ) T ( ) 1 ( ) g( θ) 2 l n ( θ) g( θ) as an approximate pivot whose (asymptotic) distribution does not depend on the unknown θ 0 and use the quantiles of its bootstrap distribution in place of standard normal quantiles.

Lecture 32: Asymptotic confidence sets and likelihoods

Lecture 32: Asymptotic confidence sets and likelihoods Asymptotic criterion In some problems, especially in nonparametric problems, it is difficult to find a reasonable confidence set with a given confidence