arxiv: v1 [math.st] 24 Aug 2014

Size: px
Start display at page:

Download "arxiv: v1 [math.st] 24 Aug 2014"

Transcription

1 MAXIMUM LIKELIHOOD ESTIMATION FOR LINEAR GAUSSIAN COVARIANCE MODELS arxiv: v1 [math.st] 24 Aug 2014 By Piotr Zwiernik, Caroline Uhler, and Donald Richards University of California, Berkeley, IST Austria, Penn State University We study parameter estimation in linear Gaussian covariance models, which are p-dimensional Gaussian models with linear constraints on the covariance matrix. Maximum likelihood estimation for this class of models leads to a non-convex optimization problem which typically has many local optima. We prove that the log-likelihood function is concave over a large region of the cone of positive definite matrices. Using recent results on the asymptotic distribution of extreme eigenvalues of the Wishart distribution, we provide sufficient conditions for any hill climbing method to converge to the global optimum. The proofs of these results utilize large-sample asymptotic theory under the scheme n/p γ > 1. Remarkably, our numerical simulations indicate that our results remain valid for min{n, p} as small as 2. An important consequence of this analysis is that for sample sizes n 14p, maximum likelihood estimation for linear Gaussian covariance models behaves as if it were a convex optimization problem. 1. Introduction. In various statistical applications the covariance matrix carries a certain structure and has to satisfy certain constraints. Pouramadhi [Pou11] gives an excellent review about covariance estimation in general. In this paper, we study Gaussian models with linear constraints on the covariance matrix. Such models arise for example in maximum likelihood estimation of correlation matrices for Gaussian populations [RM94, SWY00, SO91] or covariance matrices with prescribed zeros [DR04, CDR07], applications of stationary stochastic processes from repeated time series data [And70, And73], Brownian motion tree models for phylogenetic analyses [Fel73, Fel81], and network tomography models for analyzing the structure of the connections in the Internet [TYBN04, EDBN10]. To define Gaussian models with linear constraints on the covariance matrix, let S p be the set of symmetric p p matrices and denote by S p 0 the open convex cone in S p of positive definite matrices. For v = (v 1,... v r ) R r, the r-dimensional Euclidean space, a random vector X taking values in R p is said MSC 2010 subject classifications: Primary 62H12, 62F30; Secondary 90C25, 52A41 Keywords and phrases: Linear Gaussian covariance model; Convex optimization; Tracy- Widom law; Wishart distribution; Eigenvalues of random matrices; Brownian motion tree model; Network tomography; 1

2 2 P. ZWIERNIK, C. UHLER, D. RICHARDS to satisfy the linear Gaussian covariance model given by G 0, G 1,..., G r S p if X follows a multivariate Gaussian distribution with mean vector µ and covariance matrix Σ v, denoted X N p (µ, Σ v ), where Σ v = G 0 + r v ig i. In this paper we make no assumptions on the mean vector µ and always use the sample mean as estimator for µ. Then the linear covariance model, which we denote by M(G) with G = (G 0,..., G r ), can be parametrized by { (1.1) Θ G = v = (v 1,... v r ) R r r } G0 + v i G i S p 0. We assume that G 1,..., G r are linearly independent, meaning that Σ v G 0 is the zero matrix if and only if v = 0. This assumption ensures that the model is identifiable. We also assume that G 0 is orthogonal to the linear subspace spanned by G 1,..., G r. Typically G 0 is either the zero matrix or the identity matrix and the linear constraints on the diagonal and the offdiagonal elements are disjoint, hence satisfying orthogonality. Linear Gaussian covariance models were introduced by Anderson [And70], motivated by the linear structure of covariance matrices in different time series models. Anderson studied maximum likelihood estimation for these models and suggested iterative procedures such as the Newton-Raphson method [And70] and a scoring method [And73] for finding the maximum likelihood estimator (MLE). An example of a linear Gaussian covariance model is the set of all p p correlation matrices; this model is defined by Θ G = {v R (p 2) Ip + } v ij (E ij + E ji ) S p 0, 1 i<j p where I p denotes the identity matrix of size p p and E ij is a p p matrix, where the (i, j) entry is 1 and all other entries are 0. Another example of a linear Gaussian covariance model is the Brownian motion tree model [Fel73]: Given a rooted tree T on the set of nodes V = {1,..., r} and with p < r leaves, the corresponding Brownian motion tree model consists of all covariance matrices of the form Σ v = i V v i e de(i) e T de(i), where e de(i) R p is a 0/1-vector with entry 1 at position j if leaf j is a descendent of node i and 0 otherwise. Note that every leaf is a descendent of itself. For example, when T is a star tree on 3 leaves, then v 1 + v 4 v 4 v 4 (1.2) Σ v = v 4 v 2 + v 4 v 4, v 4 v 4 v 3 + v 4

3 where v 1, v 2, v 3 parametrize the leaves and v 4 parametrizes the root. We draw from N (µ, Σ v ) a random sample X 1,..., X n and form the corresponding sample covariance matrix 3 S n = 1 n n (X i X)(X i X) T, where X = 1 n n X i. Up to a constant term, the log-likelihood function l: Θ G R, is given by (1.3) l(v) = n 2 log det Σ v n 2 tr(s nσ 1 v ). Since the linear constraints are on the covariance matrix, the log-likelihood function is typically non-convex and may have multiple local optima. This complicates inference and parameter estimation considerably. A classical example of this problem is the case of estimating the correlation matrix of a Gaussian model, a venerable problem that has been sought after for decades [RM94, SWY00, SO91]. In Section 3 we show how the theory we develop in this paper can be used to solve this problem. A special class of linear Gaussian covariance models where maximum likelihood estimation is unproblematic are models where Σ and Σ 1 have the same pattern, i.e., Σ = G 0 + r v i G i and Σ 1 = G 0 + r w i G i. An example of such models are Brownian motion tree models on the star tree as given in (1.2) with equal variances. In fact, Szatrowski showed in a series of papers [Sza78, Sza80, Sza85] that the MLE for linear Gaussian covariance models has an explicit representation, i.e., it is a known linear combination of entries of the sample covariance matrix, if and only if Σ and Σ 1 have the same pattern. Szatrowski proved that for this model class the MLE is the average of the corresponding elements of the sample covariance matrix and that Anderson s scoring method [And73] yields the MLE in one iteration when started in any positive definite matrix that is in the model. In this paper, we show that for general linear Gaussian covariance models the problem of maximizing the log-likelihood function is, with high probability, essentially a convex optimization problem and the MLE can be found using any hill-climbing method such as the Newton-Raphson algorithm. To be more precise, in Section 2 we analyze the likelihood function for linear Gaussian covariance models, proving that the log-likelihood function is

4 4 P. ZWIERNIK, C. UHLER, D. RICHARDS strictly concave in the convex region (1.4) 2Sn := {v Θ G 0 Σ v 2S n }. We also prove that 2Sn contains the true data-generating parameter, the global maximum of the likelihood function and the least squares estimator with high probability. Therefore, the problem of maximizing the loglikelihood function is essentially a convex problem and any hill-climbing algorithm, when initiated at the least squares estimator, will remain in the convex region 2Sn and converge to the MLE in a monotonic increasing manner. Note that the region 2Sn is contained in a larger subset of Θ G where the likelihood function is concave. This subset is, in turn, contained in an even larger region where the likelihood function is unimodal. Therefore, the probability bounds that we derive in this paper are lower bounds for the exact probabilities that the optimization procedure is well behaved in the sense that any hill-climbing method will converge monotonically to the MLE. In addition to the theoretical analysis of the behavior of the log-likelihood function, we also investigate the performance of the Newton-Raphson method for finding the MLE on simulated data. In Section 3, we discuss the problem of estimating a correlation matrix, a covariance matrix with linear constraints on the diagonal. In Section 4, we analyze how the MLE compares to the least squares estimator in terms of various loss functions and show that for linear Gaussian covariance models the MLE is usually a better estimator than the least squares estimator. 2. Convexity of the likelihood function. Given a sample covariance matrix S n based on a random sample of size n from N (µ, Σ ), we denote by 2Sn the convex subset of the parameter space as in (1.4). Consider the set D 2Sn = {Σ S p 0 0 Σ 2S n}. Note that D 2Sn and 2Sn are random and depend on the number of observations n. By construction, D 2Sn contains a matrix Σ v (i.e., D 2Sn Σ v ) if and only if 2Sn contains v (i.e., 2Sn v). Therefore, we can identify 2Sn with D 2Sn and use the notations 2Sn Σ v and 2Sn v interchangeably. In this section, we analyze the log-likelihood function l(v) for linear Gaussian covariance models. We prove that l(v) is strictly concave over the region 2Sn. Then we give probabilistic guarantees for the true data-generating covariance matrix Σ, the global optimum ˆv, and the least squares estimator v, to be contained in the convex region 2Sn.

5 2.1. The log-likelihood function. We denote by i = / v i the partial derivative with respect to v i, i = 1,..., r. The log-likelihood function is given in (1.3), and its first-order partial derivatives are i l(v) = n 2 tr(σ 1 v Since i Σ v = G i for i = 1,..., r we obtain and hence (2.1) i Σ v ) + n 2 tr(s nσ 1 v ( i Σ v )Σ 1 v ). 2 n i l(v) = tr(σ 1 v G i ) + tr(s n Σ 1 v G i Σ 1 2 n v l(v) = [tr((σ 1 v Σ 1 v v ) S n Σ 1 )G i )] r. Next, we analyze the second-order partial derivatives of l(v). For every i, j = 1,..., r a simple computation shows that 2 n 2 l(v) = tr(σ 1 v G i (2Σ 1 v S n Σ 1 v i j In particular, for every y R r we obtain (2.2) y T ( v T v l(v)) y = tr(σ 1 v v Σ y (2Σ 1 S n Σ 1 v Σ 1 v )G j ). v Σ 1 v )Σ y ), where Σ y = r y ig i, and we note that Σ y is symmetric but not necessarily positive definite. We now use Equation (2.2) to prove that l(v) is negative definite on the convex region 2Sn defined in (1.4): 5 Lemma 2.1. If 2Sn v, then v T v l(v) is negative definite. Proof. We reorder the terms in Equation (2.2) to obtain for each y R r, y T ( v T v l(v)) y = tr(σ 1 v Σ y (2Σ 1 = tr((2s n Σ v )Σ 1 v v S n Σ 1 v By assumption 2S n Σ v 0. Also, Σ 1 v Σ y Σ 1 v Σ y Σ 1 v so we have y T ( v v T l(v)) y 0. Σ 1 v )Σ y ) Σ y Σ 1 v Σ y Σ 1 v ). is positive semidefinite, Therefore, v v T l(v) is negative semidefinite. Next, suppose that y T ( v v T l(v)) y = 0 for some y. Again because 2S n Σ v is positive definite and Σ 1 v Σ y Σ 1 v Σ y Σ 1 v is positive semidefinite, it follows that Σ 1 v Σ y Σ 1 v Σ y Σ 1 v = 0, equivalently, Σ y Σ 1 v Σ y = 0. Since Σ y Σ 1 v Σ y has the same set of eigenvalues as Σ 1/2 v Σ 2 yσ 1/2 v, we obtain Σ y = 0, which holds if and only if y = 0. This completes the proof.

6 6 P. ZWIERNIK, C. UHLER, D. RICHARDS As a direct consequence of Lemma 2.1 we obtain the following result: Proposition 2.2. The log-likelihood function l: Θ G R is strictly concave on 2Sn. In particular, maximizing l(v) over 2Sn is a convex problem The true covariance matrix and the Tracy-Widom law. For A S p, we denote by λ min (A) the minimal eigenvalue of A and by λ max (A) the maximal eigenvalue of A. We denote by A the spectral norm of A, i.e., A = max( λ min (A), λ max (A) ). In the following, we will often make use of the well-known fact that (2.3) A < ɛ if and only if ɛ I p A ɛ I p. Given a sample covariance matrix S n based on a random sample of size n p from N (µ, Σ ), then ns n follows a Wishart distribution, W p (n 1, Σ ). If Σ = I p, we say that ns n follows a white Wishart (or standard Wishart) distribution. Thus, W n 1 := nσ 1/2 S n Σ 1/2 W p (n 1, I p ); and the condition 2S n Σ 0 is equivalent to W n 1 n 2 I p, which holds if and only if λ min (W n 1 ) > n/2. Since by assumption Σ 0 we obtain the following lemma: Proposition 2.3. The probability that the true data-generating covariance matrix Σ is contained in 2Sn does not depend on Σ and is equal to the probability that λ min (W n 1 ) > n/2, where W n 1 W p (n 1, I p ). It is generally non-trivial to approximate the distributions of the extreme eigenvalues of the Wishart distribution; Muirhead [Mui82, Section 9.7] surveyed the results available and provided expressions for the distributions of the extreme eigenvalues in terms of zonal polynomial series expansions, which are difficult to approximate accurately. Nevertheless, substantial progress has been achieved recently in the asymptotic scenario in which n and p are large and n/p γ 1. It is noteworthy that these asymptotic approximations are very accurate even for values of n as small as 2. We refer to [Bai99, Joh01] for a review of these results. In order to describe the probability that the true data-generating covariance matrix Σ lies in 2Sn, we build on the recent work of Ma [Ma12], who gave improved approximations for the extreme eigenvalues of the white Wishart distribution. Ma defined the quantities, ( µ n,p = (n 1 2 )1/2 (p 1 2 )1/2) 2, σ n,p = ((n ( 1 2 )1/2 (p 1 2 )1/2) (p 1 2 ) 1/2 (n 1 2 ) 1/2) 1/3

7 7 and (2.4) τ n,p = σ n,p µ n,p, ν n,p = log(µ n,p ) τ 2 n,p. In the following, we denote by F the Tracy-Widom distribution [TW96]. Then, Ma proved the following theorem: Theorem 2.4. (Ma [Ma12]) Suppose that W n has a white Wishart distribution, W p (n, I p ), with n > p. Then, as n and n/p γ (1, ), log λ min (W n ) ν n,p τ n,p L G, where G(z) = 1 F ( z) denotes the reflected Tracy-Widom distribution. As a consequence of Ma s theorem we obtain: P(λ min (W n 1 ) > n ( log 2 ) = P λmin (W n 1 ) ν n 1,p > log(n/2) ν ) n 1,p τ n 1,p τ n 1,p ( ) log(n/2) νn 1,p 1 G, τ n 1,p where, for sequences a n,p and b n,p, we use the notation a n,p b n,p to describe that a n,p /b n,p 1 as n and n/p γ (1, ). So, we conclude that ( ) (2.5) P( 2Sn Σ νn 1,p log(n/2) ) F. τ n 1,p Ma also showed that the approximation by the reflected Tracy-Widom law in Theorem 2.4 is of order O(min(n, p) 2/3 ). Although this statement is proven rigorously only for even p (see [Ma12, Theorem 2]), numerical simulations in [Ma12] suggest that this result holds also for odd p. Another important consequence of Theorem 2.4 is that P( 2Sn Σ ) is very well approximated by a function that depends only on the ratio n/p. In Figure 1, we provide graphs from simulated data to analyze the accuracy of the approximation by the Tracy-Widom law for small p. By Proposition 2.3, P( 2Sn Σ ) = P(λ min (W n 1 ) > n 2 ), a quantity which does not depend on Σ, so we used the white Wishart distribution in our simulations. In each plot in Figure 1, the dimension p is fixed and the sample size n varies between p and 20p. The solid blue line is the simulated probability that λ min (W n 1 ) > n/2 and the dashed black line is the corresponding approximation resulting from the Tracy-Widom distribution, for which we used

8 8 P. ZWIERNIK, C. UHLER, D. RICHARDS p=3 p=5 p=10 Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Sample size (a) p = Sample size (b) p = Sample size (c) p = 10 Fig 1. In each plot, p {3, 5, 10} is fixed and n varies between p and 20p. The solid blue line provides the simulated values of P(λ min(w n 1) > n/2) and the dashed black line is the corresponding approximation by the reflected Tracy-Widom distribution. the R package RMTstat[JMPS09]. The values for each pair (p, n) are based on 10,000 replications. Figure 1 shows that the approximation is extremely good even for small values of p. For large values of n/p the two lines coincide even when p = 3. The approximation is nearly perfect over the whole range of n as long as p 5 and if p 10 the two lines coincide. As an example, suppose we wish to obtain P( 2Sn Σ ) > We use the approximation in (2.5) and the fact that F (0.98) 0.95 to compute n such that (2.6) ν n 1,p log(n/2) τ n 1,p > 0.98 for given p. In Table 1, we provide for various values of p the minimal sample size resulting in P( 2Sn Σ ) > It is interesting to study the behavior of the curves in Figure 1 for growing values of p. Our simulations show that its graph converges to the graph of the step function, that is a function which is zero for γ < γ and is one for γ γ, where γ is some fixed number. This observation suggests that for large p the minimal sample size such that P( 2Sn Σ ) > 0.95 is equal to the minimal sample size such that P( 2Sn Σ ) > 0. in the following result we formalize this observation. Table 1 The minimal sample sizes resulting in P( 2Sn Σ ) > p large n p

9 Proposition 2.5. Define a sequence (f n,p ) n,p=1 of numbers given by ( ) νn 1,p log(n/2) := F, f n,p τ n 1,p where ν n 1,p, τ n 1,p are defined as in (2.4) and F is the Tracy-Widom distribution. Define γ = Suppose that n, p and n/p γ > 1. Then { 1 if γ γ f n,p 0 if γ < γ. Proof. Consider the expression ν n 1,p log(n/2) τ n 1,p and fix n = γp letting p go to infinity. Direct computation shows that the limit of this expression is + if γ γ and otherwise. The result follows from F ( ) = 0 and F (+ ) = 1. Proposition 2.5 shows that the choice of the threshold 0.95 to construct Table 1 is irrelevant for large p. Moreover, note that the results in this section do not depend on the chosen model. They can be applied to any Gaussian model, where the interest lies in estimating the covariance matrix The global maximum of the log-likelihood function. We denote the global maximum of the log-likelihood function l(v) by ˆv and the corresponding covariance matrix by Σˆv. In the following theorem, we apply certain concentration of measure results given in the appendix to obtain finite sample bounds on the probability that 2Sn Σˆv. By construction, if Σˆv exists, then Σˆv 0. Hence it suffices to analyze the probability that 2S n Σˆv 0. Theorem 2.6. Let ˆv be the MLE based on a sample covariance matrix S n of a linear Gaussian covariance model with parameter v R r and corresponding true covariance matrix Σ v. Let ɛ = 1 2 log( 1 2 ) Then, p ψ 1 ( n i 2 2(n 1)p ) + ( p log( 2 n 1 ) + p ψ( n i 2 )) 2 P( 2Sn Σˆv ) 1 n 2 ɛ 2 ɛ 2, where ψ( ) denotes the digamma function and ψ 1 ( ) the trigamma function. As n, p 1 2(n 1)p n 2 ɛ 2 ψ 1 ( n i 2 ) + ( p log( 2 n 1 ) + p ɛ 2 ψ( n i 2 )) 2 9 = 1 4p nɛ 2 +O ( 1 n 2 ).

10 10 P. ZWIERNIK, C. UHLER, D. RICHARDS Proof. Let w Θ G be a point outside of 2Sn that gives the highest value of the likelihood function. Such a w must exists since the unconstrained likelihood over S p 0 has a unique maximum at S n. Since Σ w is positive definite, it lies outside of 2Sn if and only if 2S n Σ w 0. This means that Σ 1/2 w S n Σ 1/2 w 1 2 I p, or equivalently, λ min (Σ 1/2 w S n Σ 1/2 w ) 1 2. Since l(ˆv) l(v ) and the log-likelihood function is concave over 2Sn, we obtain P( 2Sn Σˆv ) P(l(v ) > l(w)) = P(l(v ) l(w)). Now note that P(l(v ) l(w)) = P(log det(σ w )+tr(s n Σ 1 w ) log det(σ v ) tr(s n Σ 1 v ) 0). Since tr(s n Σ 1 v ) = 1 n tr(w n 1), where W n 1 W(n 1, I p ), we can apply Chebyshev s inequality (see (A.1) in the Appendix) to obtain that with probability at least 1 2(n 1)p n 2 ɛ 2, (2.7) In addition, n 1 n p ɛ tr(s nσ 1 v ) n 1 n p + ɛ. log det S n log det Σ v = log det W n 1 p log(n), and we can again apply Chebyshev s inequality (see (A.2) in the Appendix) to obtain that with probability at least p p det := 1 ψ 1( n i 2 ) + ( p log( 2 n ) + p n i ψ( 2 )) 2 ɛ 2, we have that (2.8) log det S n ɛ log det Σ v log det S n + ɛ. The events (2.7) and (2.8) are not independent. So we use P(A B) P(A) + P(B) 1 and find that with probability at least 1 2(n 1)p n 2 ɛ 2 p det, log det(σ w ) + tr(s n Σ 1 w ) log det(σ v ) tr(s n Σ 1 v ) log det(σ w ) + tr(s n Σ 1 w ) log det(s n ) n 1 n p 2ɛ log det(σ 1/2 w S n Σw 1/2 ) + tr(s n Σ 1 w ) p 2ɛ.

11 We now show that the last expression is nonnegative when ɛ 1 2 log( 1 2 ) 1 4. We denote the eigenvalues of Σ 1/2 w S n Σ 1/2 w by λ 1 λ p. Then log det(σ 1/2 w S n Σ 1/2 w ) + tr(s n Σ 1 w ) p 2ɛ = 11 p (λ i 1 log(λ i )) 2ɛ. Note that the function f(x) = x 1 log(x) is non-negative for x > 0, strictly decreasing for 0 < x < 1, reaches a global minimum at x = 1, and is strictly increasing for x > 1. Also note that by assumption 0 < λ 1 < 1/2. Hence, p (λ i 1 log(λ i )) 2ɛ p (λ i 1 log(λ i )) 1/2 log(1/2) 2ɛ i=2 0, 1/2 log(1/2) 2ɛ when ɛ 1 2 log( 1 2 ) 1 2(n 1)p 4. Since 1 p n 2 ɛ 2 det is an increasing function in ɛ, we plug in ɛ = 1 2 log( 1 2 ) 1 4. As n, 1 2p nɛ 2 p det which completes the proof. = 1 2p nɛ 2 p ( 2 n + O( 1 ) ) + ( p log( 2 n 2 n ) + p( log( n 2 ) + O( 1 n ))) 2 ɛ 2 = 1 4p ( 1 ) nɛ 2 + O n 2, More accurate finite sample bounds for this result can be obtained by applying inequalities for the quantiles of the χ 2 -distribution instead of Chebyshev s inequality. Such bounds are given in Proposition A.1 in the Appendix Least squares estimation for linear Gaussian covariance models. Anderson [And70] describes an unbiased estimator for the covariance matrix and suggests using this unbiased estimator as a starting point for the Newton- Raphson algorithm. He treats the case G 0 = 0 and suggests the unbiased estimator given by the solution to the following set of linear equations: (2.9) r v i tr(g i G j ) = n n 1 tr(s ng j ), j = 1,..., r. We denote by v the estimator obtained from solving (2.9) without the scaling by. As we show now, v is the least squares estimator and is obtained by n n 1

12 12 P. ZWIERNIK, C. UHLER, D. RICHARDS orthogonally projecting S n onto the linear subspace defining M(G): Let G denote the p 2 r-matrix whose columns are given by vec(g i ), i = 1,..., r. Define H G := G(G T G) 1 G T. The fact that G T G is positive definite and hence invertible follows from the assumption that G i, i = 1,..., r, are linearly independent. The matrix H G is a symmetric matrix representing the projection in R p2 onto the linear subspace spanned by vec(g i ), i = 1,..., r. In particular, it holds that HG 2 = H G. The defining equations for v can be expressed as v = (G T G) 1 G T vec(s n ), and since G 0 is orthogonal to the linear subspace spanned by G 1,..., G r, vec(σ v ) = vec(g 0 ) + G v = vec(g 0 ) + H G vec(s n ). For instance, for Brownian motion tree models G 0 = 0 and Σ v can be seen as a sort of average over all paths between siblings in the tree. For a star tree model with p leaves Σ v is of the form (2.10) (Σ v ) ii = (S n ) ii and (Σ v ) ij = ( 1 p ) (S n ) ij. 2 i<j We now prove that Σ v also lies in 2Sn with high probability. Observe that if W n has a standard Wishart distribution W p (n, I p ), then we can write W n = AA T, where A is a p n matrix whose entries are independent standard normal random variables. We denote the minimal and maximal singular values of A by s min (A) and s max (A), respectively. So we have s min (A) = λmin (W n ) and s max (A) = λ max (W n ). We will make use of the following two results from [DS01] and [Ver10, Section 5.3.1]: Theorem 2.7 (Theorem II.13 in [DS01]). Let A be a p n matrix whose entries are independent standard normal random variables. Then for every t 0, with probability at least 1 2 exp( t 2 /2) we have n p t smin (A) s max (A) n + p + t. Lemma 2.8 (Lemma 5.36 in [Ver10]). Consider a matrix B that satisfies BB T I p max(δ, δ 2 ) for some δ > 0. Then 1 δ s min (B) s max (B) 1 + δ.

13 We now use these results to give exponential finite sample bounds on the probability that Σ v 2Sn. As we will see, these bounds hold also when n and p are large as long as their ratio n/p is at least Let ˆv be the MLE based on a sample covariance matrix S n of a linear Gaussian covariance model with parameter v R r and corresponding true covariance matrix Σ v. Theorem 2.9. Let Σ v denote the least squares estimator based on a sample covariance matrix S n of a linear Gaussian covariance model given by Σ v. Let γ = n/p and suppose that γ > Then 13 ɛ = ɛ(γ) := γ > 0 and 2S n Σ v 0 with probability at least 1 2 exp( nɛ 2 /2). Also, Σ v 0 with probability at least 1 2 exp( nɛ 2 /2). Consequently, P( 2Sn Σ v ) 1 4 exp( nɛ 2 /2). Proof. Note that 2S n Σ v 0 if and only if C := Σ 1/2 v S nσ 1/2 v + Σ 1/2 v (S n Σ v )Σ 1/2 v 0, or equivalently, if λ min (C) > 0. The eigenvalue stability inequality (see [Tao12, Equation (1.63)]) says that for symmetric matrices A and B and for any eigenvalue λ i, (2.11) λ i (A + B) λ i (A) B. As a special case of this inequality we obtain λ min (A + B) λ min (A) B and hence (2.12) λ min (C) λ min (Σ 1/2 v S nσ 1/2 Σ v ) 1/2 v (S n Σ v )Σ 1/2 v λ min (Σ 1/2 v S nσ 1/2 v ) Σ 1 v S n Σ v = 1 n λ min(w n 1 ) Σ 1 v S n Σ v, where W n 1 := nσ 1/2 v S nσ 1/2 v W (n 1, I p). For the second inequality we used the fact that ABA A 2 B and that A 2 = A 2 for A positive definite. Since v is the least squares estimator, S n Σ v F S n Σ v F.

14 14 P. ZWIERNIK, C. UHLER, D. RICHARDS So as a consequence of standard inequalities on matrix norms we obtain S n Σ v S n Σ v F S n Σ v F p S n Σ v. This implies that (2.13) λ min (C) 1 n λ min(w n 1 ) p Σ 1 v S n Σ v = 1 n λ min(w n 1 ) p ( ) Σ 1 v Σ 1/2 1 v n W n 1 I p Σ 1/2 v 1 n λ min(w n 1 ) p Σ 1 v Σ v 1 n W n 1 I p. Let κ := Σ 1 v Σv 1 denote the condition number of Σ v. Define δ := 2 + κ p κ p(2 + κ p)). 2 For fixed κ this expression as a function of p [1, ) is monotonically decreasing and converges to 1/2. Hence, δ > 1 2. In addition, δ is bounded above by n λ min(w n 1 ) 1 δ and hence the last line in (2.13) can be further bounded as follows: 2, which is attained when p = 1 and κ = 1. Suppose that 1 n W n 1 I p δ. Then, by Lemma 2.8, 1 n λ min(w n 1 ) κ p 1 n W n 1 I p (1 δ)2 κδ p = 0. As a consequence, P(2S n Σ v 0) = P(λ min (C) 0) ( ) 1 P n W n 1 I p δ, and it is therefore sufficient to bound P ( 1 n W n 1 I p δ ). Note that 1 n W n 1 I p δ is equivalent to (2.14) (1 δ)n λ min (W n 1 ) λ max (W n 1 ) (1 + δ)n. By Theorem 2.7 we have that for every t 0 n 1 n γ t λ min (W n 1 ) λ max (W n 1 ) n 1 + n γ + t with probability at least 1 2 exp( t 2 /2). To bound the probability of event (2.14) we fix t as follows: t := ( ) 1 n 1 n + δ n 1. γ

15 15 Since δ 1 2 and n 1 n < 1, t n ( ) , γ which is nonnegative when γ So since t 0, we can apply Theorem 2.7 to conclude that ( n 1 n 2 n ) 1 + δ λ min (W n 1 ) λ max (W n 1 ) n 1 + δ with probability at least 1 2 exp( t 2 /2). Note, however, that n 1 δ ( n 2 n 1 n 1 + δ ), whenever n 15, which holds by the constraints on γ. Hence also with probability at least 1 2 exp( t 2 /2) we have that n 1 δ λmin (W n 1 ) λ max (W n 1 ) n 1 + δ, which is equivalent to (2.14). This establishes the exponential bound on the probability that 2S n Σ v 0. The same bound holds for the event Σ v 0, since Σ v 0 is equivalent to C := Σ 1/2 v S nσ 1/2 v + Σ 1/2 v (Σ v S n )Σ 1/2 v 0 or in other words λ min (C ) > 0. Again by the eigenvalue stability inequality (2.11) we have λ min (C ) λ min (Σ 1/2 v S nσ 1/2 v ) Σ 1/2 v (S n Σ v )Σ 1/2 v, the same expression as in (2.12). Finally, because the two events {Σ v 0} and {2S n Σ v 0} are not independent, we utilize the inequality P(A B) P(A) + P(B) 1 to complete the proof. We now show simulation results for P( Sn Σ v ) for a specific example, namely for Brownian motion tree models on the star tree. Example Consider the model M(G), where G 0 is the zero matrix, G i = E ii for i = 1,..., p, where E ii is a matrix with exactly one 1 at position (i, i) and zeros otherwise, and G p+1 = 11 T. This linear covariance model corresponds to a Brownian motion tree model on the star tree. Suppose that the true covariance matrix Σ v is given by vi = i for i = 1,..., p and = 1. We performed simulations similar to the ones that led to Figure 1. v p+1

16 16 P. ZWIERNIK, C. UHLER, D. RICHARDS p=3 p=5 p=10 Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Frequency Simulated probability T W approximation 0.95 line Sample size (a) p = Sample size (b) p = Sample size (c) p = 10 Fig 2. Illustration of Example In each plot p {3, 5, 10} is fixed and n varies between p and 20p. The solid blue curve shows the simulated probability that 0 Σ v 2S n and the dashed black curve is the approximation of the probability that 0 Σ v 2S n by the reflected Tracy-Widom distribution (see Figure 1). For fixed p {3, 5, 10} we let n vary between p and 20p. For each pair (p, n) we generated 10, 000 times a sample of size n from N p (0, Σ v ), computed the corresponding sample covariance matrix and the least squares estimator, and checked if the least squares estimator lies in the region 2Sn. The simulated probabilities of the event 0 Σ v 2S n are given by the solid blue line in Figure 2. As in Figure 1 the dashed black line represents the approximated probabilities (by the Tracy-Widom law) of the event 0 Σ v 2S n. Figure 1 indicates that on average Σ v lies in 2Sn more often than Σ v. 3. Estimating correlation matrices. In this section, we discuss the problem of maximum likelihood estimation of correlation matrices, a venerable problem that becomes difficult even for 3 3 matrices [RM94, SWY00, SO91]. We demonstrate how this problem can be solved using the Newton- Raphson method and show how the Newton-Raphson algorithm performs for estimating a 3 3 and a 4 4 correlation matrix. We start the algorithm in the least squares estimator. For correlation models with no additional structure, the least squares estimator is given by v ij = (S n ) ij, 1 i < j p and the corresponding correlation matrix is Σ v = I p + v ij (E ij + E ji ), 1 i<j p where E ij is the zero matrix with one 1-entry in position (i, j). Let v (k) be

17 17 Paths Paths Paths Ratio S Newton steps (a) n = 10 Ratio S Newton steps (b) n = 50 Ratio S Newton steps (c) n = 100 Fig 3. Newton paths for the 3 3 correlation matrix (a). Ratio Paths S Newton steps (a) n = 10 Ratio Paths S Newton steps (b) n = 50 Ratio Paths S Newton steps (c) n = 100 Fig 4. Newton paths for the 4 4 correlation matrix (b). the k-th step estimate of the parameter vector v obtained using the Newton- Raphson algorithm. In step k + 1 we compute the update r k+1 := v (k+1) v (k) = v T v l(v (k) ) l(v (k) ). The gradient is given in (2.1) and the Hessian in Lemma 2.1. In the following, we show how the Newton method performed on two examples. We sampled n observations from a multivariate normal distribution with correlation matrix 1 1/2 1/3 1/4 1 1/2 1/3 (a) Σ v = 1/2 1 1/4, (b) Σ v = 1/2 1 1/5 1/6 1/3 1/5 1 1/7. 1/3 1/4 1 1/4 1/6 1/7 1 In Figure 3 we show the Newton paths for the 3 3 correlation matrix given in (a) and in Figure 4 for the 4 4 correlation matrix given in (b). Note

18 18 P. ZWIERNIK, C. UHLER, D. RICHARDS Table 2 Probability that the MLE is on the boundary of 2Sn. 3 3 correlation matrix 4 4 correlation matrix P(Σˆv 0) P(2S n Σˆv 0) P(Σˆv 0) P(2S n Σˆv 0) n = (±0.046) (±0.030) (±0.044) (±0.024) n = (±0.017) (±0.005) (±0.022) (±0.006) n = (±0.005) (±0.000) (±0.007) (±0.000) that the scaling of the y-axis is different in every plot for better visibility of the different Newton paths. We plotted the ratio of the likelihood of the correlation matrix obtained by the Newton algorithm and the likelihood of the true data-generating correlation matrix. We show the first 10 steps of the Newton algorithm. The last point is the ratio of the likelihood of the sample covariance matrix and the true data-generating covariance matrix. For the MLE Σˆv it holds that 1 likelihood(σˆv) likelihood(σ v ) likelihood(s n) likelihood(σ v ), which is also what we observe in Figures 3 and 4. This is an indication that the Newton-Raphson algorithm has converged to the MLE. To produce Figures 3 and 4 we ran 50 simulations for n = 10, n = 50 and n = 100 and plotted the simulations for which Σ v was not singular. Otherwise the corresponding likelihood is undefined. In Table 2 we give the average and standard deviation for P(Σˆv 0) and P(2S n Σˆv 0), corresponding to the probability of landing on the boundary of the cone of positive definite matrices and the boundary of the region 2Sn. In Figures 3 and 4 we see that the Newton-Raphson method converges in about 3 steps for estimating 3 3 correlation matrices with 100 observations. It takes slightly longer for less observations and for 4 4 matrices, but in all scenarios the Newton-Raphson method seems to converge in under 10 steps. 4. Comparison of the MLE and the least squares estimator. In this section, we analyze through simulations how the MLE compares to the least squares estimator with respect to various loss functions. We argue that for linear Gaussian covariance models the MLE is usually a better estimator than the least squares estimator. One reason is that especially for small sample size Σ v can be negative definite, while Σˆv (if it exists) is always positive semidefinite. Furthermore, as we show in the following simulation study, even when Σ v is positive definite, the MLE usually has a smaller loss compared to Σ v.

19 We analyze four different loss functions: (a) the l -loss: ˆΣ Σ, (b) the Frobenius loss: ˆΣ Σ F, (c) the quadratic loss: ˆΣΣ 1, I p (d) the entropy loss: tr(ˆσσ 1 ) log det(ˆσσ 1 ) p The loss functions (a) and (b) are standard loss functions, the loss functions (c) and (d) were proposed in [RB94]. As an example we study the time series model for circular serial correlation coefficients discussed in [And70, (5.9)]. This model is generated by G 0 = 0, G 1 = I p and G 2 defined by { 1 if i j = 1 mod p (G 2 ) ij = 0 otherwise. In Figure 5 we show the losses resulting from the four different loss functions above when simulating data under two time series models for circular serial correlation coefficients on 10 nodes with true covariance matrix v = (0, 1, 0.3) and v = (0, 1, 0.45). Note that the covariance matrix is singular for v = (0, 1, 0.5). Every point in Figure 5 corresponds to 1000 simulations and we considered only the simulations where the least squares estimator is positive semidefinite. We see that especially for small sample sizes and when the true covariance matrix is close to being singular, the MLE has a significantly smaller loss than the least squares estimator. This is to be expected in particular for the entropy loss, since this loss function seems to favor the MLE. However, it may be surprising that even with respect to the l -loss the MLE compares favorably to the least squares estimator. This shows that pushing the estimator to be positive semidefinite, as is the case for the MLE, puts a constraint on all entries jointly and leads even entrywise to an improved estimator. We here only show our simulation results for the time series model for circular serial correlation coefficients. However, we made the same observations for various other models. 5. Discussion. The likelihood function for linear Gaussian covariance models is, in general, multimodal. However, as we have proved in this paper, the multimodality matters only if the model is highly misspecified or if the sample size is not large enough to compensate for the dimension of the model. More precisely, we identified a convex region 2Sn on which the likelihood function is strictly concave. Using recent results linking the distribution of extreme eigenvalues of the Wishart distribution to the Tracy-Widom law, we gave asymptotic certificates for when the true covariance matrix and the 19

20 20 P. ZWIERNIK, C. UHLER, D. RICHARDS Max loss Frobenius loss squared Squared loss Entropy loss Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE log(sample size) (a) v = (0, 1, 0.3), l -loss log(sample size) (b) v = (0, 1, 0.3), Frobenius loss log(sample size) (c) v = (0, 1, 0.3), quadratic loss log(sample size) (d) v = (0, 1, 0.3), entropy loss Max loss Frobenius loss squared Squared loss Entropy loss Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE Loss Unbiased estimator MLE log(sample size) (e) v = (0, 1, 0.45), l -loss log(sample size) (f) v = (0, 1, 0.45), Frobenius loss log(sample size) (g) v = (0, 1, 0.45), quadratic loss log(sample size) (h) v = (0, 1, 0.45), entropy loss Fig 5. Comparison of MLE and least squares estimator using different loss functions for the time series model for circular serial correlation coefficients on 10 nodes defined by v = (0, 1, 0.3) (top line) and v = (0, 1, 0.45) (bottom line). global maximum of the likelihood function both lie in the region 2Sn. Since it is known that approximation by the Tracy-Widom law is accurate even for very small sample sizes, this makes our result useful for applications. An important corollary of our work is that: In the case of linear models for p p covariance matrices, where p is as low as 2, a sample size of 14 p suffices for the true covariance matrix to lie in 2Sn with probability at least Therefore, since the goal is to estimate the true covariance matrix, the estimation process should focus on the region 2Sn, and then a boundary point that maximizes the likelihood function over 2Sn could possibly be of more interest than the global maximum. We emphasize that our results provide lower bounds on the probabilities that the maximum likelihood estimation problem for linear Gaussian covariance models is well behaved. This is due to the fact that 2Sn is contained in a larger region over which the likelihood function is strictly concave, and this region is contained in an even larger region over which the likelihood function is unimodal. We believe that the analysis of these larger regions, and an analysis of the robustness with respect to model misspecification, will lead to many interesting research questions.

21 APPENDIX A: CONCENTRATION OF MEASURE INEQUALITIES FOR THE WISHART DISTRIBUTION Let W n be a white Wishart random variable, i.e., W n W p (n, I p ). In the following, we derive finite sample bounds for P( tr(w n ) np ɛ) and for P( log det W n p log(n) ɛ). It is well known that tr(w n ) has a χ 2 np distribution; see, e.g. [Mui82, Theorem ]. So, by Chebyshev s inequality we find that (A.1) P( Tr(W n ) np ɛ) 1 2np ɛ 2. Finite sample bounds that are more accurate than (A.1) can be obtained using [IL06, Ing10, LM00]. We now present the bounds by Laurent and Massart [LM00]. Let X χ 2 d ; using the original formulation of [LM00, page 1325], we obtain for any positive x, P(X d 2 dx + 2x) exp( x), P(d X 2 dx) exp( x). This can be rewritten for ɛ > 0 as ( P(X d ɛ) exp 2( 1 ) ) d + ɛ d(d + 2ɛ), ) P(d X ɛ) exp ( ɛ2. 4d Consequently, we obtain the following result: 21 Proposition A.1. Let W n W p (n, I p ). Then, P( tr(w n ) np ɛ) 1 exp ( 1 ( np + ɛ np(np + 2ɛ)) ) ) exp ( ɛ2. 2 4np We now derive finite sample bounds for log det(w n ). In order to apply Chebyshev s inequality, we calculate the mean and variance of log det(w n ). First, we calculate the moment-generating function of log det(w n ): M(t) := E exp(t log det(w n )) p = E(det(W n ) t ) = E Yi t = p E(Y t i ),

22 22 P. ZWIERNIK, C. UHLER, D. RICHARDS where Y i χ 2 n+1 i and the Y i are mutually independent. It is well known that E(Yi t ) = 2t Γ(t (n + 1 i)) Γ( 1 2 (n + 1 i)), t > 1 2, and in that case we obtain Hence, log M(t) = pt log 2 + M(t) = p p 2 t Γ(t (n + 1 i)) Γ( 1 2 (n + 1 i)). log Γ(t (n + 1 i)) log Γ( 1 2 (n + 1 i)). Differentiating with respect to t and then setting t = 0, we obtain E(log det(w n )) = d dt log M(t) t=0 = p log 2 + p ψ( 1 2 (n + 1 i)), where ψ(t) = d log Γ(t)/dt is the digamma function. Differentiating a second time and then setting t = 0, we obtain Var(log det(w n )) = d2 dt 2 log M(t) t=0 = p ψ 1 ( 1 2 (n + 1 i)), where ψ 1 (t) = d 2 log Γ(t)/dt 2 is the trigamma function. Therefore, by the non-central version of Chebyshev s inequality, we have P( log det W n p log(n)) ɛ) p 1 ψ 1( n+1 i 2 ) + ( p log ( ) 2 n + p n+1 i ψ( 2 ) ) 2 (A.2) ɛ 2. ACKNOWLEDGMENTS We would like to thank Robert D. Nowak, Ming Yuan, Lior Pachter and Mathias Drton for helpful discussions. P.Z. s research is supported by the European Union 7th Framework Programme (PIOF-GA ). This project was started while C.U. was visiting the Simons Institute at UC Berkeley for the program on Theoretical

23 Foundations of Big Data during the fall 2013 semester. D.R. s research was partially supported by the U.S. National Science Foundation grant DMS and by a Romberg Guest Professorship at the Heidelberg University Graduate School for Mathematical and Computational Methods in the Sciences, funded by German Universities Excellence Initiative grant GSC 220/2. REFERENCES [And70] T. W. Anderson. Estimation of covariance matrices which are linear combinations or whose inverses are linear combinations of given matrices. In Essays in Probability and Statistics, pages University of North Carolina Press, Chapel Hill, N.C., [And73] T. W. Anderson. Asymptotically efficient estimation of covariance matrices with linear structure. Annals of Statistics, 1: , [Bai99] Z. D. Bai. Methodologies in spectral analysis of large-dimensional random matrices, a review. Statistica Sinica, 9(3): , With comments by G. J. Rodgers and J. W. Silverstein; and a rejoinder by the author. [CDR07] S. Chaudhuri, M. Drton, and T. Richardson. Estimation of a covariance matrix with zeros. Biometrika, 94: , [DR04] M. Drton and T. S. Richardson. Multimodality of the likelihood in the bivariate seemingly unrelated regressions model. Biometrika, 91(2): , [DS01] K. R. Davidson and S. J. Szarek. Local operator theory, random matrices and banach spaces. Handbook of the Geometry of Banach Spaces, 1: , [EDBN10] B. Eriksson, G. Dasarathy, P. Barford, and R. Nowak. Toward the practical use of network tomography for internet topology discovery. In IEEE INFOCOM, pages 1 9, [Fel73] J. Felsenstein. Maximum-likelihood estimation of evolutionary trees from continuous characters. American Journal of Human Genetics, 25: , [Fel81] J. Felsenstein. Evolutionary trees from gene frequencies and quantitative characters: Finding maximum likelihood estimates. Evolution, 35: , [IL06] T. Inglot and T. Ledwina. Asymptotic optimality of new adaptive test in regression model. Annales de l Institut Henri Poincare (B) Probability and Statistics, 42: , [Ing10] T. Inglot. Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics, 30: , [JMPS09] I. M. Johnstone, Z. Ma, P. O. Perry, and M. Shahram. RMTstat: Distributions, Statistics and Tests derived from Random Matrix Theory, R package version 0.2. [Joh01] I. M. Johnstone. On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29: , [LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, 28: , [Ma12] Z. Ma. Accuracy of the Tracy-Widom limits for the extreme eigenvalues in white Wishart matrices. Bernoulli, 18: ,

24 24 P. ZWIERNIK, C. UHLER, D. RICHARDS [Mui82] R. J. Muirhead. Aspects of Multivariate Statistical Theory. John Wiley & Sons, Inc., New York, Wiley Series in Probability and Mathematical Statistics. [Pou11] M. Pourahmadi. Covariance estimation: The GLM and regularization perspectives. Statistical Science, 3: , [RB94] B. Ruoyong and J. O. Berger. Estimation of a covariance matrix using the reference prior. Annals of Statistics, 22: , [RM94] P. J. Rousseeuw and G. Molenberghs. The shape of correlation matrices. The American Statistician, 48: , [SO91] A. Stuart and J. K. Ord. Kendalls Advanced Theory of Statistics 2. Classical Inference and Relationship. Arnold, [SWY00] C. G. Small, J. Wang, and Z. Yang. Eliminating multiple root problems in estimation. Statistical Science, 15: , [Sza78] T. H. Szatrowski. Explicit solutions, one iteration convergence and averaging in the multivariate normal estimation problem for patterned means and covariances. Annals of the Institute of Statistical Mathematics A, 30:81 88, [Sza80] T. H. Szatrowski. Necessary and sufficient conditions for explicit solutions in the multivariate normal estimation problem for patterned means and covariances. Annals of Statistics, 8: , [Sza85] T. H. Szatrowski. Patterned covariances. In Encyclopedia of Statistical Sciences, pages Wiley, New York, [Tao12] T. Tao. Topics in Random Matrix Theory, volume 30. American Mathematical Society, Providence, R.I., [TW96] C. A. Tracy and H. Widom. On orthogonal and symplectic matrix ensembles. Communications in Mathematical Physics, 177: , [TYBN04] Y. Tsang, M. Yildiz, P. Barford, and R. Nowak. Network radar: tomography from round trip time measurements. In Internet Measurement Conference (IMC), pages , [Ver10] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arxiv preprint arxiv: , P. Zwiernik Department of Statistics University of California, Berkeley Berkeley, CA 94720, U.S.A. pzwiernik@berkeley.edu D. Richards Department of Statistics Penn State University University Park, PA 16802, U.S.A. richards@stat.psu.edu C. Uhler IST Austria 3400 Klosterneuburg Austria caroline.uhler@ist.ac.at

Parameter estimation in linear Gaussian covariance models

Parameter estimation in linear Gaussian covariance models Parameter estimation in linear Gaussian covariance models Caroline Uhler (IST Austria) Joint work with Piotr Zwiernik (UC Berkeley) and Donald Richards (Penn State University) Big Data Reunion Workshop

More information

Concentration Inequalities for Random Matrices

Concentration Inequalities for Random Matrices Concentration Inequalities for Random Matrices M. Ledoux Institut de Mathématiques de Toulouse, France exponential tail inequalities classical theme in probability and statistics quantify the asymptotic

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Total positivity in Markov structures

Total positivity in Markov structures 1 based on joint work with Shaun Fallat, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik (arxiv:1510.01290) Faculty of Science Total positivity in Markov structures Steffen Lauritzen

More information

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May

University of Luxembourg. Master in Mathematics. Student project. Compressed sensing. Supervisor: Prof. I. Nourdin. Author: Lucien May University of Luxembourg Master in Mathematics Student project Compressed sensing Author: Lucien May Supervisor: Prof. I. Nourdin Winter semester 2014 1 Introduction Let us consider an s-sparse vector

More information

High-dimensional asymptotic expansions for the distributions of canonical correlations

High-dimensional asymptotic expansions for the distributions of canonical correlations Journal of Multivariate Analysis 100 2009) 231 242 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva High-dimensional asymptotic

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

Random matrices: Distribution of the least singular value (via Property Testing)

Random matrices: Distribution of the least singular value (via Property Testing) Random matrices: Distribution of the least singular value (via Property Testing) Van H. Vu Department of Mathematics Rutgers vanvu@math.rutgers.edu (joint work with T. Tao, UCLA) 1 Let ξ be a real or complex-valued

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

Orthogonal decompositions in growth curve models

Orthogonal decompositions in growth curve models ACTA ET COMMENTATIONES UNIVERSITATIS TARTUENSIS DE MATHEMATICA Volume 4, Orthogonal decompositions in growth curve models Daniel Klein and Ivan Žežula Dedicated to Professor L. Kubáček on the occasion

More information

Semidefinite Programming

Semidefinite Programming Semidefinite Programming Notes by Bernd Sturmfels for the lecture on June 26, 208, in the IMPRS Ringvorlesung Introduction to Nonlinear Algebra The transition from linear algebra to nonlinear algebra has

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Chapter 4: Asymptotic Properties of the MLE (Part 2) Chapter 4: Asymptotic Properties of the MLE (Part 2) Daniel O. Scharfstein 09/24/13 1 / 1 Example Let {(R i, X i ) : i = 1,..., n} be an i.i.d. sample of n random vectors (R, X ). Here R is a response

More information

TAMS39 Lecture 2 Multivariate normal distribution

TAMS39 Lecture 2 Multivariate normal distribution TAMS39 Lecture 2 Multivariate normal distribution Martin Singull Department of Mathematics Mathematical Statistics Linköping University, Sweden Content Lecture Random vectors Multivariate normal distribution

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

An Algebraic and Geometric Perspective on Exponential Families

An Algebraic and Geometric Perspective on Exponential Families An Algebraic and Geometric Perspective on Exponential Families Caroline Uhler (IST Austria) Based on two papers: with Mateusz Micha lek, Bernd Sturmfels, and Piotr Zwiernik, and with Liam Solus and Ruriko

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

A CLASS OF ORTHOGONALLY INVARIANT MINIMAX ESTIMATORS FOR NORMAL COVARIANCE MATRICES PARAMETRIZED BY SIMPLE JORDAN ALGEBARAS OF DEGREE 2

A CLASS OF ORTHOGONALLY INVARIANT MINIMAX ESTIMATORS FOR NORMAL COVARIANCE MATRICES PARAMETRIZED BY SIMPLE JORDAN ALGEBARAS OF DEGREE 2 Journal of Statistical Studies ISSN 10-4734 A CLASS OF ORTHOGONALLY INVARIANT MINIMAX ESTIMATORS FOR NORMAL COVARIANCE MATRICES PARAMETRIZED BY SIMPLE JORDAN ALGEBARAS OF DEGREE Yoshihiko Konno Faculty

More information

ELEMENTS OF PROBABILITY THEORY

ELEMENTS OF PROBABILITY THEORY ELEMENTS OF PROBABILITY THEORY Elements of Probability Theory A collection of subsets of a set Ω is called a σ algebra if it contains Ω and is closed under the operations of taking complements and countable

More information

Decomposable and Directed Graphical Gaussian Models

Decomposable and Directed Graphical Gaussian Models Decomposable Decomposable and Directed Graphical Gaussian Models Graphical Models and Inference, Lecture 13, Michaelmas Term 2009 November 26, 2009 Decomposable Definition Basic properties Wishart density

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Lecture 4: Purifications and fidelity

Lecture 4: Purifications and fidelity CS 766/QIC 820 Theory of Quantum Information (Fall 2011) Lecture 4: Purifications and fidelity Throughout this lecture we will be discussing pairs of registers of the form (X, Y), and the relationships

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

A note on profile likelihood for exponential tilt mixture models

A note on profile likelihood for exponential tilt mixture models Biometrika (2009), 96, 1,pp. 229 236 C 2009 Biometrika Trust Printed in Great Britain doi: 10.1093/biomet/asn059 Advance Access publication 22 January 2009 A note on profile likelihood for exponential

More information

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data Sri Lankan Journal of Applied Statistics (Special Issue) Modern Statistical Methodologies in the Cutting Edge of Science Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Flat and multimodal likelihoods and model lack of fit in curved exponential families

Flat and multimodal likelihoods and model lack of fit in curved exponential families Mathematical Statistics Stockholm University Flat and multimodal likelihoods and model lack of fit in curved exponential families Rolf Sundberg Research Report 2009:1 ISSN 1650-0377 Postal address: Mathematical

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2. APPENDIX A Background Mathematics A. Linear Algebra A.. Vector algebra Let x denote the n-dimensional column vector with components 0 x x 2 B C @. A x n Definition 6 (scalar product). The scalar product

More information

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction Sankhyā : The Indian Journal of Statistics 2007, Volume 69, Part 4, pp. 700-716 c 2007, Indian Statistical Institute More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order

More information

Testing Some Covariance Structures under a Growth Curve Model in High Dimension

Testing Some Covariance Structures under a Growth Curve Model in High Dimension Department of Mathematics Testing Some Covariance Structures under a Growth Curve Model in High Dimension Muni S. Srivastava and Martin Singull LiTH-MAT-R--2015/03--SE Department of Mathematics Linköping

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get 18:2 1/24/2 TOPIC. Inequalities; measures of spread. This lecture explores the implications of Jensen s inequality for g-means in general, and for harmonic, geometric, arithmetic, and related means in

More information

Journal of Multivariate Analysis. Sphericity test in a GMANOVA MANOVA model with normal error

Journal of Multivariate Analysis. Sphericity test in a GMANOVA MANOVA model with normal error Journal of Multivariate Analysis 00 (009) 305 3 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.com/locate/jmva Sphericity test in a GMANOVA MANOVA

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case The largest eigenvalues of the sample covariance matrix 1 in the heavy-tail case Thomas Mikosch University of Copenhagen Joint work with Richard A. Davis (Columbia NY), Johannes Heiny (Aarhus University)

More information

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that 1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction

ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction Acta Math. Univ. Comenianae Vol. LXV, 1(1996), pp. 129 139 129 ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES V. WITKOVSKÝ Abstract. Estimation of the autoregressive

More information

An Introduction to Multivariate Statistical Analysis

An Introduction to Multivariate Statistical Analysis An Introduction to Multivariate Statistical Analysis Third Edition T. W. ANDERSON Stanford University Department of Statistics Stanford, CA WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Contents

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Exponential tail inequalities for eigenvalues of random matrices

Exponential tail inequalities for eigenvalues of random matrices Exponential tail inequalities for eigenvalues of random matrices M. Ledoux Institut de Mathématiques de Toulouse, France exponential tail inequalities classical theme in probability and statistics quantify

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Symmetric Matrices and Eigendecomposition

Symmetric Matrices and Eigendecomposition Symmetric Matrices and Eigendecomposition Robert M. Freund January, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 2 1 Symmetric Matrices and Convexity of Quadratic Functions

More information

arxiv: v1 [math.st] 22 Jun 2018

arxiv: v1 [math.st] 22 Jun 2018 Hypothesis testing near singularities and boundaries arxiv:1806.08458v1 [math.st] Jun 018 Jonathan D. Mitchell, Elizabeth S. Allman, and John A. Rhodes Department of Mathematics & Statistics University

More information

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources

A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources A Single-letter Upper Bound for the Sum Rate of Multiple Access Channels with Correlated Sources Wei Kang Sennur Ulukus Department of Electrical and Computer Engineering University of Maryland, College

More information

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1

Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1 Upper Bound for Intermediate Singular Values of Random Sub-Gaussian Matrices 1 Feng Wei 2 University of Michigan July 29, 2016 1 This presentation is based a project under the supervision of M. Rudelson.

More information

Regression: Lecture 2

Regression: Lecture 2 Regression: Lecture 2 Niels Richard Hansen April 26, 2012 Contents 1 Linear regression and least squares estimation 1 1.1 Distributional results................................ 3 2 Non-linear effects and

More information

Finite Population Sampling and Inference

Finite Population Sampling and Inference Finite Population Sampling and Inference A Prediction Approach RICHARD VALLIANT ALAN H. DORFMAN RICHARD M. ROYALL A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane

More information

Statistical Inference with Monotone Incomplete Multivariate Normal Data

Statistical Inference with Monotone Incomplete Multivariate Normal Data Statistical Inference with Monotone Incomplete Multivariate Normal Data p. 1/4 Statistical Inference with Monotone Incomplete Multivariate Normal Data This talk is based on joint work with my wonderful

More information

Statistical Inference On the High-dimensional Gaussian Covarianc

Statistical Inference On the High-dimensional Gaussian Covarianc Statistical Inference On the High-dimensional Gaussian Covariance Matrix Department of Mathematical Sciences, Clemson University June 6, 2011 Outline Introduction Problem Setup Statistical Inference High-Dimensional

More information

Problems in Linear Algebra and Representation Theory

Problems in Linear Algebra and Representation Theory Problems in Linear Algebra and Representation Theory (Most of these were provided by Victor Ginzburg) The problems appearing below have varying level of difficulty. They are not listed in any specific

More information

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 17 EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 17 Andre Tkacenko Signal Processing Research Group Jet Propulsion Laboratory May 29, 2012 Andre Tkacenko

More information

Stochastic Design Criteria in Linear Models

Stochastic Design Criteria in Linear Models AUSTRIAN JOURNAL OF STATISTICS Volume 34 (2005), Number 2, 211 223 Stochastic Design Criteria in Linear Models Alexander Zaigraev N. Copernicus University, Toruń, Poland Abstract: Within the framework

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

1 Data Arrays and Decompositions

1 Data Arrays and Decompositions 1 Data Arrays and Decompositions 1.1 Variance Matrices and Eigenstructure Consider a p p positive definite and symmetric matrix V - a model parameter or a sample variance matrix. The eigenstructure is

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Gaussian Graphical Models: An Algebraic and Geometric Perspective

Gaussian Graphical Models: An Algebraic and Geometric Perspective Gaussian Graphical Models: An Algebraic and Geometric Perspective Caroline Uhler arxiv:707.04345v [math.st] 3 Jul 07 Abstract Gaussian graphical models are used throughout the natural sciences, social

More information

Random Matrices and Multivariate Statistical Analysis

Random Matrices and Multivariate Statistical Analysis Random Matrices and Multivariate Statistical Analysis Iain Johnstone, Statistics, Stanford imj@stanford.edu SEA 06@MIT p.1 Agenda Classical multivariate techniques Principal Component Analysis Canonical

More information

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018

MATH 5720: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 2018 MATH 57: Unconstrained Optimization Hung Phan, UMass Lowell September 13, 18 1 Global and Local Optima Let a function f : S R be defined on a set S R n Definition 1 (minimizers and maximizers) (i) x S

More information

Rectangular Young tableaux and the Jacobi ensemble

Rectangular Young tableaux and the Jacobi ensemble Rectangular Young tableaux and the Jacobi ensemble Philippe Marchal October 20, 2015 Abstract It has been shown by Pittel and Romik that the random surface associated with a large rectangular Young tableau

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form: 0.1 N. L. P. Katta G. Murty, IOE 611 Lecture slides Introductory Lecture NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP does not include everything

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

x x2 2 + x3 3 x4 3. Use the divided-difference method to find a polynomial of least degree that fits the values shown: (b)

x x2 2 + x3 3 x4 3. Use the divided-difference method to find a polynomial of least degree that fits the values shown: (b) Numerical Methods - PROBLEMS. The Taylor series, about the origin, for log( + x) is x x2 2 + x3 3 x4 4 + Find an upper bound on the magnitude of the truncation error on the interval x.5 when log( + x)

More information

component risk analysis

component risk analysis 273: Urban Systems Modeling Lec. 3 component risk analysis instructor: Matteo Pozzi 273: Urban Systems Modeling Lec. 3 component reliability outline risk analysis for components uncertain demand and uncertain

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

CONCENTRATION OF THE MIXED DISCRIMINANT OF WELL-CONDITIONED MATRICES. Alexander Barvinok

CONCENTRATION OF THE MIXED DISCRIMINANT OF WELL-CONDITIONED MATRICES. Alexander Barvinok CONCENTRATION OF THE MIXED DISCRIMINANT OF WELL-CONDITIONED MATRICES Alexander Barvinok Abstract. We call an n-tuple Q 1,...,Q n of positive definite n n real matrices α-conditioned for some α 1 if for

More information

Statistical Machine Learning for Structured and High Dimensional Data

Statistical Machine Learning for Structured and High Dimensional Data Statistical Machine Learning for Structured and High Dimensional Data (FA9550-09- 1-0373) PI: Larry Wasserman (CMU) Co- PI: John Lafferty (UChicago and CMU) AFOSR Program Review (Jan 28-31, 2013, Washington,

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline. Practitioner Course: Portfolio Optimization September 10, 2008 Before we define dependence, it is useful to define Random variables X and Y are independent iff For all x, y. In particular, F (X,Y ) (x,

More information

Approximating the Covariance Matrix with Low-rank Perturbations

Approximating the Covariance Matrix with Low-rank Perturbations Approximating the Covariance Matrix with Low-rank Perturbations Malik Magdon-Ismail and Jonathan T. Purnell Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180 {magdon,purnej}@cs.rpi.edu

More information

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data

A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data A Multivariate Two-Sample Mean Test for Small Sample Size and Missing Data Yujun Wu, Marc G. Genton, 1 and Leonard A. Stefanski 2 Department of Biostatistics, School of Public Health, University of Medicine

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Synthesis of Gaussian and non-gaussian stationary time series using circulant matrix embedding

Synthesis of Gaussian and non-gaussian stationary time series using circulant matrix embedding Synthesis of Gaussian and non-gaussian stationary time series using circulant matrix embedding Vladas Pipiras University of North Carolina at Chapel Hill UNC Graduate Seminar, November 10, 2010 (joint

More information

A sequential hypothesis test based on a generalized Azuma inequality 1

A sequential hypothesis test based on a generalized Azuma inequality 1 A sequential hypothesis test based on a generalized Azuma inequality 1 Daniël Reijsbergen a,2, Werner Scheinhardt b, Pieter-Tjerk de Boer b a Laboratory for Foundations of Computer Science, University

More information

CHAPTER 6. Representations of compact groups

CHAPTER 6. Representations of compact groups CHAPTER 6 Representations of compact groups Throughout this chapter, denotes a compact group. 6.1. Examples of compact groups A standard theorem in elementary analysis says that a subset of C m (m a positive

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Optimization of Quadratic Forms: NP Hard Problems : Neural Networks

Optimization of Quadratic Forms: NP Hard Problems : Neural Networks 1 Optimization of Quadratic Forms: NP Hard Problems : Neural Networks Garimella Rama Murthy, Associate Professor, International Institute of Information Technology, Gachibowli, HYDERABAD, AP, INDIA ABSTRACT

More information

On Bayesian Computation

On Bayesian Computation On Bayesian Computation Michael I. Jordan with Elaine Angelino, Maxim Rabinovich, Martin Wainwright and Yun Yang Previous Work: Information Constraints on Inference Minimize the minimax risk under constraints

More information

EIGENVALUES AND EIGENVECTORS 3

EIGENVALUES AND EIGENVECTORS 3 EIGENVALUES AND EIGENVECTORS 3 1. Motivation 1.1. Diagonal matrices. Perhaps the simplest type of linear transformations are those whose matrix is diagonal (in some basis). Consider for example the matrices

More information

Nonlinear Programming Models

Nonlinear Programming Models Nonlinear Programming Models Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Nonlinear Programming Models p. Introduction Nonlinear Programming Models p. NLP problems minf(x) x S R n Standard form:

More information

A Test for Order Restriction of Several Multivariate Normal Mean Vectors against all Alternatives when the Covariance Matrices are Unknown but Common

A Test for Order Restriction of Several Multivariate Normal Mean Vectors against all Alternatives when the Covariance Matrices are Unknown but Common Journal of Statistical Theory and Applications Volume 11, Number 1, 2012, pp. 23-45 ISSN 1538-7887 A Test for Order Restriction of Several Multivariate Normal Mean Vectors against all Alternatives when

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information