arxiv: v1 [math.st] 24 Aug 2014

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 24 Aug 2014"

Cuthbert Thomas
5 years ago
Views:

1 MAXIMUM LIKELIHOOD ESTIMATION FOR LINEAR GAUSSIAN COVARIANCE MODELS arxiv: v1 [math.st] 24 Aug 2014 By Piotr Zwiernik, Caroline Uhler, and Donald Richards University of California, Berkeley, IST Austria, Penn State University We study parameter estimation in linear Gaussian covariance models, which are p-dimensional Gaussian models with linear constraints on the covariance matrix. Maximum likelihood estimation for this class of models leads to a non-convex optimization problem which typically has many local optima. We prove that the log-likelihood function is concave over a large region of the cone of positive definite matrices. Using recent results on the asymptotic distribution of extreme eigenvalues of the Wishart distribution, we provide sufficient conditions for any hill climbing method to converge to the global optimum. The proofs of these results utilize large-sample asymptotic theory under the scheme n/p γ > 1. Remarkably, our numerical simulations indicate that our results remain valid for min{n, p} as small as 2. An important consequence of this analysis is that for sample sizes n 14p, maximum likelihood estimation for linear Gaussian covariance models behaves as if it were a convex optimization problem. 1. Introduction. In various statistical applications the covariance matrix carries a certain structure and has to satisfy certain constraints. Pouramadhi [Pou11] gives an excellent review about covariance estimation in general. In this paper, we study Gaussian models with linear constraints on the covariance matrix. Such models arise for example in maximum likelihood estimation of correlation matrices for Gaussian populations [RM94, SWY00, SO91] or covariance matrices with prescribed zeros [DR04, CDR07], applications of stationary stochastic processes from repeated time series data [And70, And73], Brownian motion tree models for phylogenetic analyses [Fel73, Fel81], and network tomography models for analyzing the structure of the connections in the Internet [TYBN04, EDBN10]. To define Gaussian models with linear constraints on the covariance matrix, let S p be the set of symmetric p p matrices and denote by S p 0 the open convex cone in S p of positive definite matrices. For v = (v 1,... v r ) R r, the r-dimensional Euclidean space, a random vector X taking values in R p is said MSC 2010 subject classifications: Primary 62H12, 62F30; Secondary 90C25, 52A41 Keywords and phrases: Linear Gaussian covariance model; Convex optimization; Tracy- Widom law; Wishart distribution; Eigenvalues of random matrices; Brownian motion tree model; Network tomography; 1

2 2 P. ZWIERNIK, C. UHLER, D. RICHARDS to satisfy the linear Gaussian covariance model given by G 0, G 1,..., G r S p if X follows a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ v, denoted X N p (µ, Σ v ), where Σ v = G 0 + r v ig i. In this paper we make no assumptions on the mean vector µ and always use the sample mean as estimator for µ. Then the linear covariance model, which we denote by M(G) with G = (G 0,..., G r ), can be parametrized by { (1.1) Θ G = v = (v 1,... v r ) R r r } G0 + v i G i S p 0. We assume that G 1,..., G r are linearly independent, meaning that Σ v G 0 is the zero matrix if and only if v = 0. This assumption ensures that the model is identifiable. We also assume that G 0 is orthogonal to the linear subspace spanned by G 1,..., G r. Typically G 0 is either the zero matrix or the identity matrix and the linear constraints on the diagonal and the offdiagonal elements are disjoint, hence satisfying orthogonality. Linear Gaussian covariance models were introduced by Anderson [And70], motivated by the linear structure of covariance matrices in different time series models. Anderson studied maximum likelihood estimation for these models and suggested iterative procedures such as the Newton-Raphson method [And70] and a scoring method [And73] for finding the maximum likelihood estimator (MLE). An example of a linear Gaussian covariance model is the set of all p p correlation matrices; this model is defined by Θ G = {v R (p 2) Ip + } v ij (E ij + E ji ) S p 0, 1 i<j p where I p denotes the identity matrix of size p p and E ij is a p p matrix, where the (i, j) entry is 1 and all other entries are 0. Another example of a linear Gaussian covariance model is the Brownian motion tree model [Fel73]: Given a rooted tree T on the set of nodes V = {1,..., r} and with p < r leaves, the corresponding Brownian motion tree model consists of all covariance matrices of the form Σ v = i V v i e de(i) e T de(i), where e de(i) R p is a 0/1-vector with entry 1 at position j if leaf j is a descendent of node i and 0 otherwise. Note that every leaf is a descendent of itself. For example, when T is a star tree on 3 leaves, then v 1 + v 4 v 4 v 4 (1.2) Σ v = v 4 v 2 + v 4 v 4, v 4 v 4 v 3 + v 4

3 where v 1, v 2, v 3 parametrize the leaves and v 4 parametrizes the root. We draw from N (µ, Σ v ) a random sample X 1,..., X n and form the corresponding sample covariance matrix 3 S n = 1 n n (X i X)(X i X) T, where X = 1 n n X i. Up to a constant term, the log-likelihood function l: Θ G R, is given by (1.3) l(v) = n 2 log det Σ v n 2 tr(s nσ 1 v ). Since the linear constraints are on the covariance matrix, the log-likelihood function is typically non-convex and may have multiple local optima. This complicates inference and parameter estimation considerably. A classical example of this problem is the case of estimating the correlation matrix of a Gaussian model, a venerable problem that has been sought after for decades [RM94, SWY00, SO91]. In Section 3 we show how the theory we develop in this paper can be used to solve this problem. A special class of linear Gaussian covariance models where maximum likelihood estimation is unproblematic are models where Σ and Σ 1 have the same pattern, i.e., Σ = G 0 + r v i G i and Σ 1 = G 0 + r w i G i. An example of such models are Brownian motion tree models on the star tree as given in (1.2) with equal variances. In fact, Szatrowski showed in a series of papers [Sza78, Sza80, Sza85] that the MLE for linear Gaussian covariance models has an explicit representation, i.e., it is a known linear combination of entries of the sample covariance matrix, if and only if Σ and Σ 1 have the same pattern. Szatrowski proved that for this model class the MLE is the average of the corresponding elements of the sample covariance matrix and that Anderson s scoring method [And73] yields the MLE in one iteration when started in any positive definite matrix that is in the model. In this paper, we show that for general linear Gaussian covariance models the problem of maximizing the log-likelihood function is, with high probability, essentially a convex optimization problem and the MLE can be found using any hill-climbing method such as the Newton-Raphson algorithm. To be more precise, in Section 2 we analyze the likelihood function for linear Gaussian covariance models, proving that the log-likelihood function is

4 4 P. ZWIERNIK, C. UHLER, D. RICHARDS strictly concave in the convex region (1.4) 2Sn := {v Θ G 0 Σ v 2S n }. We also prove that 2Sn contains the true data-generating parameter, the global maximum of the likelihood function and the least squares estimator with high probability. Therefore, the problem of maximizing the loglikelihood function is essentially a convex problem and any hill-climbing algorithm, when initiated at the least squares estimator, will remain in the convex region 2Sn and converge to the MLE in a monotonic increasing manner. Note that the region 2Sn is contained in a larger subset of Θ G where the likelihood function is concave. This subset is, in turn, contained in an even larger region where the likelihood function is unimodal. Therefore, the probability bounds that we derive in this paper are lower bounds for the exact probabilities that the optimization procedure is well behaved in the sense that any hill-climbing method will converge monotonically to the MLE. In addition to the theoretical analysis of the behavior of the log-likelihood function, we also investigate the performance of the Newton-Raphson method for finding the MLE on simulated data. In Section 3, we discuss the problem of estimating a correlation matrix, a covariance matrix with linear constraints on the diagonal. In Section 4, we analyze how the MLE compares to the least squares estimator in terms of various loss functions and show that for linear Gaussian covariance models the MLE is usually a better estimator than the least squares estimator. 2. Convexity of the likelihood function. Given a sample covariance matrix S n based on a random sample of size n from N (µ, Σ ), we denote by 2Sn the convex subset of the parameter space as in (1.4). Consider the set D 2Sn = {Σ S p 0 0 Σ 2S n}. Note that D 2Sn and 2Sn are random and depend on the number of observations n. By construction, D 2Sn contains a matrix Σ v (i.e., D 2Sn Σ v ) if and only if 2Sn contains v (i.e., 2Sn v). Therefore, we can identify 2Sn with D 2Sn and use the notations 2Sn Σ v and 2Sn v interchangeably. In this section, we analyze the log-likelihood function l(v) for linear Gaussian covariance models. We prove that l(v) is strictly concave over the region 2Sn. Then we give probabilistic guarantees for the true data-generating covariance matrix Σ, the global optimum ˆv, and the least squares estimator v, to be contained in the convex region 2Sn.

5 2.1. The log-likelihood function. We denote by i = / v i the partial derivative with respect to v i, i = 1,..., r. The log-likelihood function is given in (1.3), and its first-order partial derivatives are i l(v) = n 2 tr(σ 1 v Since i Σ v = G i for i = 1,..., r we obtain and hence (2.1) i Σ v ) + n 2 tr(s nσ 1 v ( i Σ v )Σ 1 v ). 2 n i l(v) = tr(σ 1 v G i ) + tr(s n Σ 1 v G i Σ 1 2 n v l(v) = [tr((σ 1 v Σ 1 v v ) S n Σ 1 )G i )] r. Next, we analyze the second-order partial derivatives of l(v). For every i, j = 1,..., r a simple computation shows that 2 n 2 l(v) = tr(σ 1 v G i (2Σ 1 v S n Σ 1 v i j In particular, for every y R r we obtain (2.2) y T ( v T v l(v)) y = tr(σ 1 v v Σ y (2Σ 1 S n Σ 1 v Σ 1 v )G j ). v Σ 1 v )Σ y ), where Σ y = r y ig i, and we note that Σ y is symmetric but not necessarily positive definite. We now use Equation (2.2) to prove that l(v) is negative definite on the convex region 2Sn defined in (1.4): 5 Lemma 2.1. If 2Sn v, then v T v l(v) is negative definite. Proof. We reorder the terms in Equation (2.2) to obtain for each y R r, y T ( v T v l(v)) y = tr(σ 1 v Σ y (2Σ 1 = tr((2s n Σ v )Σ 1 v v S n Σ 1 v By assumption 2S n Σ v 0. Also, Σ 1 v Σ y Σ 1 v Σ y Σ 1 v so we have y T ( v v T l(v)) y 0. Σ 1 v )Σ y ) Σ y Σ 1 v Σ y Σ 1 v ). is positive semidefinite, Therefore, v v T l(v) is negative semidefinite. Next, suppose that y T ( v v T l(v)) y = 0 for some y. Again because 2S n Σ v is positive definite and Σ 1 v Σ y Σ 1 v Σ y Σ 1 v is positive semidefinite, it follows that Σ 1 v Σ y Σ 1 v Σ y Σ 1 v = 0, equivalently, Σ y Σ 1 v Σ y = 0. Since Σ y Σ 1 v Σ y has the same set of eigenvalues as Σ 1/2 v Σ 2 yσ 1/2 v, we obtain Σ y = 0, which holds if and only if y = 0. This completes the proof.

6 6 P. ZWIERNIK, C. UHLER, D. RICHARDS As a direct consequence of Lemma 2.1 we obtain the following result: Proposition 2.2. The log-likelihood function l: Θ G R is strictly concave on 2Sn. In particular, maximizing l(v) over 2Sn is a convex problem The true covariance matrix and the Tracy-Widom law. For A S p, we denote by λ min (A) the minimal eigenvalue of A and by λ max (A) the maximal eigenvalue of A. We denote by A the spectral norm of A, i.e., A = max( λ min (A), λ max (A) ). In the following, we will often make use of the well-known fact that (2.3) A < ɛ if and only if ɛ I p A ɛ I p. Given a sample covariance matrix S n based on a random sample of size n p from N (µ, Σ ), then ns n follows a Wishart distribution, W p (n 1, Σ ). If Σ = I p, we say that ns n follows a white Wishart (or standard Wishart) distribution. Thus, W n 1 := nσ 1/2 S n Σ 1/2 W p (n 1, I p ); and the condition 2S n Σ 0 is equivalent to W n 1 n 2 I p, which holds if and only if λ min (W n 1 ) > n/2. Since by assumption Σ 0 we obtain the following lemma: Proposition 2.3. The probability that the true data-generating covariance matrix Σ is contained in 2Sn does not depend on Σ and is equal to the probability that λ min (W n 1 ) > n/2, where W n 1 W p (n 1, I p ). It is generally non-trivial to approximate the distributions of the extreme eigenvalues of the Wishart distribution; Muirhead [Mui82, Section 9.7] surveyed the results available and provided expressions for the distributions of the extreme eigenvalues in terms of zonal polynomial series expansions, which are difficult to approximate accurately. Nevertheless, substantial progress has been achieved recently in the asymptotic scenario in which n and p are large and n/p γ 1. It is noteworthy that these asymptotic approximations are very accurate even for values of n as small as 2. We refer to [Bai99, Joh01] for a review of these results. In order to describe the probability that the true data-generating covariance matrix Σ lies in 2Sn, we build on the recent work of Ma [Ma12], who gave improved approximations for the extreme eigenvalues of the white Wishart distribution. Ma defined the quantities, ( µ n,p = (n 1 2 )1/2 (p 1 2 )1/2) 2, σ n,p = ((n ( 1 2 )1/2 (p 1 2 )1/2) (p 1 2 ) 1/2 (n 1 2 ) 1/2) 1/3

7 7 and (2.4) τ n,p = σ n,p µ n,p, ν n,p = log(µ n,p ) τ 2 n,p. In the following, we denote by F the Tracy-Widom distribution [TW96]. Then, Ma proved the following theorem: Theorem 2.4. (Ma [Ma12]) Suppose that W n has a white Wishart distribution, W p (n, I p ), with n > p. Then, as n and n/p γ (1, ), log λ min (W n ) ν n,p τ n,p L G, where G(z) = 1 F ( z) denotes the reflected Tracy-Widom distribution. As a consequence of Ma s theorem we obtain: P(λ min (W n 1 ) > n ( log 2 ) = P λmin (W n 1 ) ν n 1,p > log(n/2) ν ) n 1,p τ n 1,p τ n 1,p ( ) log(n/2) νn 1,p 1 G, τ n 1,p where, for sequences a n,p and b n,p, we use the notation a n,p b n,p to describe that a n,p /b n,p 1 as n and n/p γ (1, ). So, we conclude that ( ) (2.5) P( 2Sn Σ νn 1,p log(n/2) ) F. τ n 1,p Ma also showed that the approximation by the reflected Tracy-Widom law in Theorem 2.4 is of order O(min(n, p) 2/3 ). Although this statement is proven rigorously only for even p (see [Ma12, Theorem 2]), numerical simulations in [Ma12] suggest that this result holds also for odd p. Another important consequence of Theorem 2.4 is that P( 2Sn Σ ) is very well approximated by a function that depends only on the ratio n/p. In Figure 1, we provide graphs from simulated data to analyze the accuracy of the approximation by the Tracy-Widom law for small p. By Proposition 2.3, P( 2Sn Σ ) = P(λ min (W n 1 ) > n 2 ), a quantity which does not depend on Σ, so we used the white Wishart distribution in our simulations. In each plot in Figure 1, the dimension p is fixed and the sample size n varies between p and 20p. The solid blue line is the simulated probability that λ min (W n 1 ) > n/2 and the dashed black line is the corresponding approximation resulting from the Tracy-Widom distribution, for which we used

8 8 P. ZWIERNIK, C. UHLER, D. RICHARDS p=3 p=5 p=10 Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Sample size (a) p = Sample size (b) p = Sample size (c) p = 10 Fig 1. In each plot, p {3, 5, 10} is fixed and n varies between p and 20p. The solid blue line provides the simulated values of P(λ min(w n 1) > n/2) and the dashed black line is the corresponding approximation by the reflected Tracy-Widom distribution. the R package RMTstat[JMPS09]. The values for each pair (p, n) are based on 10,000 replications. Figure 1 shows that the approximation is extremely good even for small values of p. For large values of n/p the two lines coincide even when p = 3. The approximation is nearly perfect over the whole range of n as long as p 5 and if p 10 the two lines coincide. As an example, suppose we wish to obtain P( 2Sn Σ ) > We use the approximation in (2.5) and the fact that F (0.98) 0.95 to compute n such that (2.6) ν n 1,p log(n/2) τ n 1,p > 0.98 for given p. In Table 1, we provide for various values of p the minimal sample size resulting in P( 2Sn Σ ) > It is interesting to study the behavior of the curves in Figure 1 for growing values of p. Our simulations show that its graph converges to the graph of the step function, that is a function which is zero for γ < γ and is one for γ γ, where γ is some fixed number. This observation suggests that for large p the minimal sample size such that P( 2Sn Σ ) > 0.95 is equal to the minimal sample size such that P( 2Sn Σ ) > 0. in the following result we formalize this observation. Table 1 The minimal sample sizes resulting in P( 2Sn Σ ) > p large n p

9 Proposition 2.5. Define a sequence (f n,p ) n,p=1 of numbers given by ( ) νn 1,p log(n/2) := F, f n,p τ n 1,p where ν n 1,p, τ n 1,p are defined as in (2.4) and F is the Tracy-Widom distribution. Define γ = Suppose that n, p and n/p γ > 1. Then { 1 if γ γ f n,p 0 if γ < γ. Proof. Consider the expression ν n 1,p log(n/2) τ n 1,p and fix n = γp letting p go to infinity. Direct computation shows that the limit of this expression is + if γ γ and otherwise. The result follows from F ( ) = 0 and F (+ ) = 1. Proposition 2.5 shows that the choice of the threshold 0.95 to construct Table 1 is irrelevant for large p. Moreover, note that the results in this section do not depend on the chosen model. They can be applied to any Gaussian model, where the interest lies in estimating the covariance matrix The global maximum of the log-likelihood function. We denote the global maximum of the log-likelihood function l(v) by ˆv and the corresponding covariance matrix by Σˆv. In the following theorem, we apply certain concentration of measure results given in the appendix to obtain finite sample bounds on the probability that 2Sn Σˆv. By construction, if Σˆv exists, then Σˆv 0. Hence it suffices to analyze the probability that 2S n Σˆv 0. Theorem 2.6. Let ˆv be the MLE based on a sample covariance matrix S n of a linear Gaussian covariance model with parameter v R r and corresponding true covariance matrix Σ v. Let ɛ = 1 2 log( 1 2 ) Then, p ψ 1 ( n i 2 2(n 1)p ) + ( p log( 2 n 1 ) + p ψ( n i 2 )) 2 P( 2Sn Σˆv ) 1 n 2 ɛ 2 ɛ 2, where ψ( ) denotes the digamma function and ψ 1 ( ) the trigamma function. As n, p 1 2(n 1)p n 2 ɛ 2 ψ 1 ( n i 2 ) + ( p log( 2 n 1 ) + p ɛ 2 ψ( n i 2 )) 2 9 = 1 4p nɛ 2 +O ( 1 n 2 ).

10 10 P. ZWIERNIK, C. UHLER, D. RICHARDS Proof. Let w Θ G be a point outside of 2Sn that gives the highest value of the likelihood function. Such a w must exists since the unconstrained likelihood over S p 0 has a unique maximum at S n. Since Σ w is positive definite, it lies outside of 2Sn if and only if 2S n Σ w 0. This means that Σ 1/2 w S n Σ 1/2 w 1 2 I p, or equivalently, λ min (Σ 1/2 w S n Σ 1/2 w ) 1 2. Since l(ˆv) l(v ) and the log-likelihood function is concave over 2Sn, we obtain P( 2Sn Σˆv ) P(l(v ) > l(w)) = P(l(v ) l(w)). Now note that P(l(v ) l(w)) = P(log det(σ w )+tr(s n Σ 1 w ) log det(σ v ) tr(s n Σ 1 v ) 0). Since tr(s n Σ 1 v ) = 1 n tr(w n 1), where W n 1 W(n 1, I p ), we can apply Chebyshev s inequality (see (A.1) in the Appendix) to obtain that with probability at least 1 2(n 1)p n 2 ɛ 2, (2.7) In addition, n 1 n p ɛ tr(s nσ 1 v ) n 1 n p + ɛ. log det S n log det Σ v = log det W n 1 p log(n), and we can again apply Chebyshev s inequality (see (A.2) in the Appendix) to obtain that with probability at least p p det := 1 ψ 1( n i 2 ) + ( p log( 2 n ) + p n i ψ( 2 )) 2 ɛ 2, we have that (2.8) log det S n ɛ log det Σ v log det S n + ɛ. The events (2.7) and (2.8) are not independent. So we use P(A B) P(A) + P(B) 1 and find that with probability at least 1 2(n 1)p n 2 ɛ 2 p det, log det(σ w ) + tr(s n Σ 1 w ) log det(σ v ) tr(s n Σ 1 v ) log det(σ w ) + tr(s n Σ 1 w ) log det(s n ) n 1 n p 2ɛ log det(σ 1/2 w S n Σw 1/2 ) + tr(s n Σ 1 w ) p 2ɛ.

11 We now show that the last expression is nonnegative when ɛ 1 2 log( 1 2 ) 1 4. We denote the eigenvalues of Σ 1/2 w S n Σ 1/2 w by λ 1 λ p. Then log det(σ 1/2 w S n Σ 1/2 w ) + tr(s n Σ 1 w ) p 2ɛ = 11 p (λ i 1 log(λ i )) 2ɛ. Note that the function f(x) = x 1 log(x) is non-negative for x > 0, strictly decreasing for 0 < x < 1, reaches a global minimum at x = 1, and is strictly increasing for x > 1. Also note that by assumption 0 < λ 1 < 1/2. Hence, p (λ i 1 log(λ i )) 2ɛ p (λ i 1 log(λ i )) 1/2 log(1/2) 2ɛ i=2 0, 1/2 log(1/2) 2ɛ when ɛ 1 2 log( 1 2 ) 1 2(n 1)p 4. Since 1 p n 2 ɛ 2 det is an increasing function in ɛ, we plug in ɛ = 1 2 log( 1 2 ) 1 4. As n, 1 2p nɛ 2 p det which completes the proof. = 1 2p nɛ 2 p ( 2 n + O( 1 ) ) + ( p log( 2 n 2 n ) + p( log( n 2 ) + O( 1 n ))) 2 ɛ 2 = 1 4p ( 1 ) nɛ 2 + O n 2, More accurate finite sample bounds for this result can be obtained by applying inequalities for the quantiles of the χ 2 -distribution instead of Chebyshev s inequality. Such bounds are given in Proposition A.1 in the Appendix Least squares estimation for linear Gaussian covariance models. Anderson [And70] describes an unbiased estimator for the covariance matrix and suggests using this unbiased estimator as a starting point for the Newton- Raphson algorithm. He treats the case G 0 = 0 and suggests the unbiased estimator given by the solution to the following set of linear equations: (2.9) r v i tr(g i G j ) = n n 1 tr(s ng j ), j = 1,..., r. We denote by v the estimator obtained from solving (2.9) without the scaling by. As we show now, v is the least squares estimator and is obtained by n n 1

12 12 P. ZWIERNIK, C. UHLER, D. RICHARDS orthogonally projecting S n onto the linear subspace defining M(G): Let G denote the p 2 r-matrix whose columns are given by vec(g i ), i = 1,..., r. Define H G := G(G T G) 1 G T. The fact that G T G is positive definite and hence invertible follows from the assumption that G i, i = 1,..., r, are linearly independent. The matrix H G is a symmetric matrix representing the projection in R p2 onto the linear subspace spanned by vec(g i ), i = 1,..., r. In particular, it holds that HG 2 = H G. The defining equations for v can be expressed as v = (G T G) 1 G T vec(s n ), and since G 0 is orthogonal to the linear subspace spanned by G 1,..., G r, vec(σ v ) = vec(g 0 ) + G v = vec(g 0 ) + H G vec(s n ). For instance, for Brownian motion tree models G 0 = 0 and Σ v can be seen as a sort of average over all paths between siblings in the tree. For a star tree model with p leaves Σ v is of the form (2.10) (Σ v ) ii = (S n ) ii and (Σ v ) ij = ( 1 p ) (S n ) ij. 2 i<j We now prove that Σ v also lies in 2Sn with high probability. Observe that if W n has a standard Wishart distribution W p (n, I p ), then we can write W n = AA T, where A is a p n matrix whose entries are independent standard normal random variables. We denote the minimal and maximal singular values of A by s min (A) and s max (A), respectively. So we have s min (A) = λmin (W n ) and s max (A) = λ max (W n ). We will make use of the following two results from [DS01] and [Ver10, Section 5.3.1]: Theorem 2.7 (Theorem II.13 in [DS01]). Let A be a p n matrix whose entries are independent standard normal random variables. Then for every t 0, with probability at least 1 2 exp( t 2 /2) we have n p t smin (A) s max (A) n + p + t. Lemma 2.8 (Lemma 5.36 in [Ver10]). Consider a matrix B that satisfies BB T I p max(δ, δ 2 ) for some δ > 0. Then 1 δ s min (B) s max (B) 1 + δ.

13 We now use these results to give exponential finite sample bounds on the probability that Σ v 2Sn. As we will see, these bounds hold also when n and p are large as long as their ratio n/p is at least Let ˆv be the MLE based on a sample covariance matrix S n of a linear Gaussian covariance model with parameter v R r and corresponding true covariance matrix Σ v. Theorem 2.9. Let Σ v denote the least squares estimator based on a sample covariance matrix S n of a linear Gaussian covariance model given by Σ v. Let γ = n/p and suppose that γ > Then 13 ɛ = ɛ(γ) := γ > 0 and 2S n Σ v 0 with probability at least 1 2 exp( nɛ 2 /2). Also, Σ v 0 with probability at least 1 2 exp( nɛ 2 /2). Consequently, P( 2Sn Σ v ) 1 4 exp( nɛ 2 /2). Proof. Note that 2S n Σ v 0 if and only if C := Σ 1/2 v S nσ 1/2 v + Σ 1/2 v (S n Σ v )Σ 1/2 v 0, or equivalently, if λ min (C) > 0. The eigenvalue stability inequality (see [Tao12, Equation (1.63)]) says that for symmetric matrices A and B and for any eigenvalue λ i, (2.11) λ i (A + B) λ i (A) B. As a special case of this inequality we obtain λ min (A + B) λ min (A) B and hence (2.12) λ min (C) λ min (Σ 1/2 v S nσ 1/2 Σ v ) 1/2 v (S n Σ v )Σ 1/2 v λ min (Σ 1/2 v S nσ 1/2 v ) Σ 1 v S n Σ v = 1 n λ min(w n 1 ) Σ 1 v S n Σ v, where W n 1 := nσ 1/2 v S nσ 1/2 v W (n 1, I p). For the second inequality we used the fact that ABA A 2 B and that A 2 = A 2 for A positive definite. Since v is the least squares estimator, S n Σ v F S n Σ v F.

14 14 P. ZWIERNIK, C. UHLER, D. RICHARDS So as a consequence of standard inequalities on matrix norms we obtain S n Σ v S n Σ v F S n Σ v F p S n Σ v. This implies that (2.13) λ min (C) 1 n λ min(w n 1 ) p Σ 1 v S n Σ v = 1 n λ min(w n 1 ) p ( ) Σ 1 v Σ 1/2 1 v n W n 1 I p Σ 1/2 v 1 n λ min(w n 1 ) p Σ 1 v Σ v 1 n W n 1 I p. Let κ := Σ 1 v Σv 1 denote the condition number of Σ v. Define δ := 2 + κ p κ p(2 + κ p)). 2 For fixed κ this expression as a function of p [1, ) is monotonically decreasing and converges to 1/2. Hence, δ > 1 2. In addition, δ is bounded above by n λ min(w n 1 ) 1 δ and hence the last line in (2.13) can be further bounded as follows: 2, which is attained when p = 1 and κ = 1. Suppose that 1 n W n 1 I p δ. Then, by Lemma 2.8, 1 n λ min(w n 1 ) κ p 1 n W n 1 I p (1 δ)2 κδ p = 0. As a consequence, P(2S n Σ v 0) = P(λ min (C) 0) ( ) 1 P n W n 1 I p δ, and it is therefore sufficient to bound P ( 1 n W n 1 I p δ ). Note that 1 n W n 1 I p δ is equivalent to (2.14) (1 δ)n λ min (W n 1 ) λ max (W n 1 ) (1 + δ)n. By Theorem 2.7 we have that for every t 0 n 1 n γ t λ min (W n 1 ) λ max (W n 1 ) n 1 + n γ + t with probability at least 1 2 exp( t 2 /2). To bound the probability of event (2.14) we fix t as follows: t := ( ) 1 n 1 n + δ n 1. γ

15 15 Since δ 1 2 and n 1 n < 1, t n ( ) , γ which is nonnegative when γ So since t 0, we can apply Theorem 2.7 to conclude that ( n 1 n 2 n ) 1 + δ λ min (W n 1 ) λ max (W n 1 ) n 1 + δ with probability at least 1 2 exp( t 2 /2). Note, however, that n 1 δ ( n 2 n 1 n 1 + δ ), whenever n 15, which holds by the constraints on γ. Hence also with probability at least 1 2 exp( t 2 /2) we have that n 1 δ λmin (W n 1 ) λ max (W n 1 ) n 1 + δ, which is equivalent to (2.14). This establishes the exponential bound on the probability that 2S n Σ v 0. The same bound holds for the event Σ v 0, since Σ v 0 is equivalent to C := Σ 1/2 v S nσ 1/2 v + Σ 1/2 v (Σ v S n )Σ 1/2 v 0 or in other words λ min (C ) > 0. Again by the eigenvalue stability inequality (2.11) we have λ min (C ) λ min (Σ 1/2 v S nσ 1/2 v ) Σ 1/2 v (S n Σ v )Σ 1/2 v, the same expression as in (2.12). Finally, because the two events {Σ v 0} and {2S n Σ v 0} are not independent, we utilize the inequality P(A B) P(A) + P(B) 1 to complete the proof. We now show simulation results for P( Sn Σ v ) for a specific example, namely for Brownian motion tree models on the star tree. Example Consider the model M(G), where G 0 is the zero matrix, G i = E ii for i = 1,..., p, where E ii is a matrix with exactly one 1 at position (i, i) and zeros otherwise, and G p+1 = 11 T. This linear covariance model corresponds to a Brownian motion tree model on the star tree. Suppose that the true covariance matrix Σ v is given by vi = i for i = 1,..., p and = 1. We performed simulations similar to the ones that led to Figure 1. v p+1

16 16 P. ZWIERNIK, C. UHLER, D. RICHARDS p=3 p=5 p=10 Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Sample size (a) p = Sample size (b) p = Sample size (c) p = 10 Fig 2. Illustration of Example In each plot p {3, 5, 10} is fixed and n varies between p and 20p. The solid blue curve shows the simulated probability that 0 Σ v 2S n and the dashed black curve is the approximation of the probability that 0 Σ v 2S n by the reflected Tracy-Widom distribution (see Figure 1). For fixed p {3, 5, 10} we let n vary between p and 20p. For each pair (p, n) we generated 10, 000 times a sample of size n from N p (0, Σ v ), computed the corresponding sample covariance matrix and the least squares estimator, and checked if the least squares estimator lies in the region 2Sn. The simulated probabilities of the event 0 Σ v 2S n are given by the solid blue line in Figure 2. As in Figure 1 the dashed black line represents the approximated probabilities (by the Tracy-Widom law) of the event 0 Σ v 2S n. Figure 1 indicates that on average Σ v lies in 2Sn more often than Σ v. 3. Estimating correlation matrices. In this section, we discuss the problem of maximum likelihood estimation of correlation matrices, a venerable problem that becomes difficult even for 3 3 matrices [RM94, SWY00, SO91]. We demonstrate how this problem can be solved using the Newton- Raphson method and show how the Newton-Raphson algorithm performs for estimating a 3 3 and a 4 4 correlation matrix. We start the algorithm in the least squares estimator. For correlation models with no additional structure, the least squares estimator is given by v ij = (S n ) ij, 1 i < j p and the corresponding correlation matrix is Σ v = I p + v ij (E ij + E ji ), 1 i<j p where E ij is the zero matrix with one 1-entry in position (i, j). Let v (k) be

17 17 Paths Paths Paths Ratio S Newton steps (a) n = 10 Ratio S Newton steps (b) n = 50 Ratio S Newton steps (c) n = 100 Fig 3. Newton paths for the 3 3 correlation matrix (a). Ratio Paths S Newton steps (a) n = 10 Ratio Paths S Newton steps (b) n = 50 Ratio Paths S Newton steps (c) n = 100 Fig 4. Newton paths for the 4 4 correlation matrix (b). the k-th step estimate of the parameter vector v obtained using the Newton- Raphson algorithm. In step k + 1 we compute the update r k+1 := v (k+1) v (k) = v T v l(v (k) ) l(v (k) ). The gradient is given in (2.1) and the Hessian in Lemma 2.1. In the following, we show how the Newton method performed on two examples. We sampled n observations from a multivariate normal distribution with correlation matrix 1 1/2 1/3 1/4 1 1/2 1/3 (a) Σ v = 1/2 1 1/4, (b) Σ v = 1/2 1 1/5 1/6 1/3 1/5 1 1/7. 1/3 1/4 1 1/4 1/6 1/7 1 In Figure 3 we show the Newton paths for the 3 3 correlation matrix given in (a) and in Figure 4 for the 4 4 correlation matrix given in (b). Note

18 18 P. ZWIERNIK, C. UHLER, D. RICHARDS Table 2 Probability that the MLE is on the boundary of 2Sn. 3 3 correlation matrix 4 4 correlation matrix P(Σˆv 0) P(2S n Σˆv 0) P(Σˆv 0) P(2S n Σˆv 0) n = (±0.046) (±0.030) (±0.044) (±0.024) n = (±0.017) (±0.005) (±0.022) (±0.006) n = (±0.005) (±0.000) (±0.007) (±0.000) that the scaling of the y-axis is different in every plot for better visibility of the different Newton paths. We plotted the ratio of the likelihood of the correlation matrix obtained by the Newton algorithm and the likelihood of the true data-generating correlation matrix. We show the first 10 steps of the Newton algorithm. The last point is the ratio of the likelihood of the sample covariance matrix and the true data-generating covariance matrix. For the MLE Σˆv it holds that 1 likelihood(σˆv) likelihood(σ v ) likelihood(s n) likelihood(σ v ), which is also what we observe in Figures 3 and 4. This is an indication that the Newton-Raphson algorithm has converged to the MLE. To produce Figures 3 and 4 we ran 50 simulations for n = 10, n = 50 and n = 100 and plotted the simulations for which Σ v was not singular. Otherwise the corresponding likelihood is undefined. In Table 2 we give the average and standard deviation for P(Σˆv 0) and P(2S n Σˆv 0), corresponding to the probability of landing on the boundary of the cone of positive definite matrices and the boundary of the region 2Sn. In Figures 3 and 4 we see that the Newton-Raphson method converges in about 3 steps for estimating 3 3 correlation matrices with 100 observations. It takes slightly longer for less observations and for 4 4 matrices, but in all scenarios the Newton-Raphson method seems to converge in under 10 steps. 4. Comparison of the MLE and the least squares estimator. In this section, we analyze through simulations how the MLE compares to the least squares estimator with respect to various loss functions. We argue that for linear Gaussian covariance models the MLE is usually a better estimator than the least squares estimator. One reason is that especially for small sample size Σ v can be negative definite, while Σˆv (if it exists) is always positive semidefinite. Furthermore, as we show in the following simulation study, even when Σ v is positive definite, the MLE usually has a smaller loss compared to Σ v.

19 We analyze four different loss functions: (a) the l -loss: ˆΣ Σ, (b) the Frobenius loss: ˆΣ Σ F, (c) the quadratic loss: ˆΣΣ 1, I p (d) the entropy loss: tr(ˆσσ 1 ) log det(ˆσσ 1 ) p The loss functions (a) and (b) are standard loss functions, the loss functions (c) and (d) were proposed in [RB94]. As an example we study the time series model for circular serial correlation coefficients discussed in [And70, (5.9)]. This model is generated by G 0 = 0, G 1 = I p and G 2 defined by { 1 if i j = 1 mod p (G 2 ) ij = 0 otherwise. In Figure 5 we show the losses resulting from the four different loss functions above when simulating data under two time series models for circular serial correlation coefficients on 10 nodes with true covariance matrix v = (0, 1, 0.3) and v = (0, 1, 0.45). Note that the covariance matrix is singular for v = (0, 1, 0.5). Every point in Figure 5 corresponds to 1000 simulations and we considered only the simulations where the least squares estimator is positive semidefinite. We see that especially for small sample sizes and when the true covariance matrix is close to being singular, the MLE has a significantly smaller loss than the least squares estimator. This is to be expected in particular for the entropy loss, since this loss function seems to favor the MLE. However, it may be surprising that even with respect to the l -loss the MLE compares favorably to the least squares estimator. This shows that pushing the estimator to be positive semidefinite, as is the case for the MLE, puts a constraint on all entries jointly and leads even entrywise to an improved estimator. We here only show our simulation results for the time series model for circular serial correlation coefficients. However, we made the same observations for various other models. 5. Discussion. The likelihood function for linear Gaussian covariance models is, in general, multimodal. However, as we have proved in this paper, the multimodality matters only if the model is highly misspecified or if the sample size is not large enough to compensate for the dimension of the model. More precisely, we identified a convex region 2Sn on which the likelihood function is strictly concave. Using recent results linking the distribution of extreme eigenvalues of the Wishart distribution to the Tracy-Widom law, we gave asymptotic certificates for when the true covariance matrix and the 19

20 20 P. ZWIERNIK, C. UHLER, D. RICHARDS Max loss Frobenius loss squared Squared loss Entropy loss Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE log(sample size) (a) v = (0, 1, 0.3), l -loss log(sample size) (b) v = (0, 1, 0.3), Frobenius loss log(sample size) (c) v = (0, 1, 0.3), quadratic loss log(sample size) (d) v = (0, 1, 0.3), entropy loss Max loss Frobenius loss squared Squared loss Entropy loss Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE log(sample size) (e) v = (0, 1, 0.45), l -loss log(sample size) (f) v = (0, 1, 0.45), Frobenius loss log(sample size) (g) v = (0, 1, 0.45), quadratic loss log(sample size) (h) v = (0, 1, 0.45), entropy loss Fig 5. Comparison of MLE and least squares estimator using different loss functions for the time series model for circular serial correlation coefficients on 10 nodes defined by v = (0, 1, 0.3) (top line) and v = (0, 1, 0.45) (bottom line). global maximum of the likelihood function both lie in the region 2Sn. Since it is known that approximation by the Tracy-Widom law is accurate even for very small sample sizes, this makes our result useful for applications. An important corollary of our work is that: In the case of linear models for p p covariance matrices, where p is as low as 2, a sample size of 14 p suffices for the true covariance matrix to lie in 2Sn with probability at least Therefore, since the goal is to estimate the true covariance matrix, the estimation process should focus on the region 2Sn, and then a boundary point that maximizes the likelihood function over 2Sn could possibly be of more interest than the global maximum. We emphasize that our results provide lower bounds on the probabilities that the maximum likelihood estimation problem for linear Gaussian covariance models is well behaved. This is due to the fact that 2Sn is contained in a larger region over which the likelihood function is strictly concave, and this region is contained in an even larger region over which the likelihood function is unimodal. We believe that the analysis of these larger regions, and an analysis of the robustness with respect to model misspecification, will lead to many interesting research questions.

21 APPENDIX A: CONCENTRATION OF MEASURE INEQUALITIES FOR THE WISHART DISTRIBUTION Let W n be a white Wishart random variable, i.e., W n W p (n, I p ). In the following, we derive finite sample bounds for P( tr(w n ) np ɛ) and for P( log det W n p log(n) ɛ). It is well known that tr(w n ) has a χ 2 np distribution; see, e.g. [Mui82, Theorem ]. So, by Chebyshev s inequality we find that (A.1) P( Tr(W n ) np ɛ) 1 2np ɛ 2. Finite sample bounds that are more accurate than (A.1) can be obtained using [IL06, Ing10, LM00]. We now present the bounds by Laurent and Massart [LM00]. Let X χ 2 d ; using the original formulation of [LM00, page 1325], we obtain for any positive x, P(X d 2 dx + 2x) exp( x), P(d X 2 dx) exp( x). This can be rewritten for ɛ > 0 as ( P(X d ɛ) exp 2( 1 ) ) d + ɛ d(d + 2ɛ), ) P(d X ɛ) exp ( ɛ2. 4d Consequently, we obtain the following result: 21 Proposition A.1. Let W n W p (n, I p ). Then, P( tr(w n ) np ɛ) 1 exp ( 1 ( np + ɛ np(np + 2ɛ)) ) ) exp ( ɛ2. 2 4np We now derive finite sample bounds for log det(w n ). In order to apply Chebyshev s inequality, we calculate the mean and variance of log det(w n ). First, we calculate the moment-generating function of log det(w n ): M(t) := E exp(t log det(w n )) p = E(det(W n ) t ) = E Yi t = p E(Y t i ),

22 22 P. ZWIERNIK, C. UHLER, D. RICHARDS where Y i χ 2 n+1 i and the Y i are mutually independent. It is well known that E(Yi t ) = 2t Γ(t (n + 1 i)) Γ( 1 2 (n + 1 i)), t > 1 2, and in that case we obtain Hence, log M(t) = pt log 2 + M(t) = p p 2 t Γ(t (n + 1 i)) Γ( 1 2 (n + 1 i)). log Γ(t (n + 1 i)) log Γ( 1 2 (n + 1 i)). Differentiating with respect to t and then setting t = 0, we obtain E(log det(w n )) = d dt log M(t) t=0 = p log 2 + p ψ( 1 2 (n + 1 i)), where ψ(t) = d log Γ(t)/dt is the digamma function. Differentiating a second time and then setting t = 0, we obtain Var(log det(w n )) = d2 dt 2 log M(t) t=0 = p ψ 1 ( 1 2 (n + 1 i)), where ψ 1 (t) = d 2 log Γ(t)/dt 2 is the trigamma function. Therefore, by the non-central version of Chebyshev s inequality, we have P( log det W n p log(n)) ɛ) p 1 ψ 1( n+1 i 2 ) + ( p log ( ) 2 n + p n+1 i ψ( 2 ) ) 2 (A.2) ɛ 2. ACKNOWLEDGMENTS We would like to thank Robert D. Nowak, Ming Yuan, Lior Pachter and Mathias Drton for helpful discussions. P.Z. s research is supported by the European Union 7th Framework Programme (PIOF-GA ). This project was started while C.U. was visiting the Simons Institute at UC Berkeley for the program on Theoretical

23 Foundations of Big Data during the fall 2013 semester. D.R. s research was partially supported by the U.S. National Science Foundation grant DMS and by a Romberg Guest Professorship at the Heidelberg University Graduate School for Mathematical and Computational Methods in the Sciences, funded by German Universities Excellence Initiative grant GSC 220/2. REFERENCES [And70] T. W. Anderson. Estimation of covariance matrices which are linear combinations or whose inverses are linear combinations of given matrices. In Essays in Probability and Statistics, pages University of North Carolina Press, Chapel Hill, N.C., [And73] T. W. Anderson. Asymptotically efficient estimation of covariance matrices with linear structure. Annals of Statistics, 1: , [Bai99] Z. D. Bai. Methodologies in spectral analysis of large-dimensional random matrices, a review. Statistica Sinica, 9(3): , With comments by G. J. Rodgers and J. W. Silverstein; and a rejoinder by the author. [CDR07] S. Chaudhuri, M. Drton, and T. Richardson. Estimation of a covariance matrix with zeros. Biometrika, 94: , [DR04] M. Drton and T. S. Richardson. Multimodality of the likelihood in the bivariate seemingly unrelated regressions model. Biometrika, 91(2): , [DS01] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach spaces. Handbook of the Geometry of Banach Spaces, 1: , [EDBN10] B. Eriksson, G. Dasarathy, P. Barford, and R. Nowak. Toward the practical use of network tomography for internet topology discovery. In IEEE INFOCOM, pages 1 9, [Fel73] J. Felsenstein. Maximum-likelihood estimation of evolutionary trees from continuous characters. American Journal of Human Genetics, 25: , [Fel81] J. Felsenstein. Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution, 35: , [IL06] T. Inglot and T. Ledwina. Asymptotic optimality of new adaptive test in regression model. Annales de l Institut Henri Poincare (B) Probability and Statistics, 42: , [Ing10] T. Inglot. Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics, 30: , [JMPS09] I. M. Johnstone, Z. Ma, P. O. Perry, and M. Shahram. RMTstat: Distributions, Statistics and Tests derived from Random Matrix Theory, R package version 0.2. [Joh01] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29: , [LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28: , [Ma12] Z. Ma. Accuracy of the Tracy-Widom limits for the extreme eigenvalues in white Wishart matrices. Bernoulli, 18: ,

24 24 P. ZWIERNIK, C. UHLER, D. RICHARDS [Mui82] R. J. Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, Inc., New York, Wiley Series in Probability and Mathematical Statistics. [Pou11] M. Pourahmadi. Covariance estimation: The GLM and regularization perspectives. Statistical Science, 3: , [RB94] B. Ruoyong and J. O. Berger. Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22: , [RM94] P. J. Rousseeuw and G. Molenberghs. The shape of correlation matrices. The American Statistician, 48: , [SO91] A. Stuart and J. K. Ord. Kendalls Advanced Theory of Statistics 2. Classical Inference and Relationship. Arnold, [SWY00] C. G. Small, J. Wang, and Z. Yang. Eliminating multiple root problems in estimation. Statistical Science, 15: , [Sza78] T. H. Szatrowski. Explicit solutions, one iteration convergence and averaging in the multivariate normal estimation problem for patterned means and covariances. Annals of the Institute of Statistical Mathematics A, 30:81 88, [Sza80] T. H. Szatrowski. Necessary and sufficient conditions for explicit solutions in the multivariate normal estimation problem for patterned means and covariances. Annals of Statistics, 8: , [Sza85] T. H. Szatrowski. Patterned covariances. In Encyclopedia of Statistical Sciences, pages Wiley, New York, [Tao12] T. Tao. Topics in Random Matrix Theory, volume 30. American Mathematical Society, Providence, R.I., [TW96] C. A. Tracy and H. Widom. On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics, 177: , [TYBN04] Y. Tsang, M. Yildiz, P. Barford, and R. Nowak. Network radar: tomography from round trip time measurements. In Internet Measurement Conference (IMC), pages , [Ver10] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arxiv preprint arxiv: , P. Zwiernik Department of Statistics University of California, Berkeley Berkeley, CA 94720, U.S.A. pzwiernik@berkeley.edu D. Richards Department of Statistics Penn State University University Park, PA 16802, U.S.A. richards@stat.psu.edu C. Uhler IST Austria 3400 Klosterneuburg Austria caroline.uhler@ist.ac.at

Parameter estimation in linear Gaussian covariance models

Parameter estimation in linear Gaussian covariance models Caroline Uhler (IST Austria) Joint work with Piotr Zwiernik (UC Berkeley) and Donald Richards (Penn State University) Big Data Reunion Workshop