Parameter estimation in linear Gaussian covariance models

Parameter estimation in linear Gaussian covariance models Caroline Uhler (IST Austria) Joint work with Piotr Zwiernik (UC Berkeley) and Donald Richards (Penn State University) Big Data Reunion Workshop Simons Institute, UC Berkeley December 16, 2014 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 1 / 14

Linear Gaussian covariance model S p : (real) symmetric p p matrices S p 0 : cone of (real) symmetric p p positive definite matrices Definition (Linear Gaussian covariance model) A random vector X R p satisfies the linear Gaussian covariance model M G given by G = (G 0, G 1,..., G r ), G i S p, if X N p (µ, Σ θ ) and Σ θ = G 0 + r θ i G i, θ = (θ 1,... θ r ) R r. i=1 M G parametrized by a spectrahedron { Θ G = θ = (θ 1,... θ r ) R r r } G0 + θ i G i S p 0 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 2 / 14 i=1

Examples of linear Gaussian covariance models Correlation matrices: Θ G = {θ R (p 2) Ip + 1 i<j p θ ij (E ij + E ji ) S p 0 where E ij is the p p zero-matrix, where the (i, j) entry is 1. Covariance matrices with prescribed zeros (relevance networks) Butte et al. (2000), Chaudhuri, Drton & Richardson (2007) Stationary stochastic processes from repeated time series data Anderson (1970, 1973) Brownian motion tree models: Phylogenetic models (Felsenstein (1973, 1981)) Network tomography models for analyzing the structure of the connections in the Internet (Eriksson et al. (2010), Tsang et al. (2004)) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 3 / 14 },

Maximum likelihood estimation X 1,..., X n sample from N p (µ, Σ ), Σ true covariance matrix X = 1 n X i n sample mean, used to estimate µ S n = 1 n i=1 n (X i X )(X i X ) T sample covariance matrix i=1 S n S p 0 with probability 1 if n p Log-likelihood function: l( ; S n ) : S p 0 R log-likelihood function l(σ; S n ) = n 2 log det Σ n 2 tr(s nσ 1 ) For linear Gaussian covariance models we constrain l( ; S n ) to Θ G : MLE ˆΣ := arg max l(σ θ ; S n ) θ Θ G Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 4 / 14

Maximum likelihood estimation For K := Σ 1, l(k; S n ) = n 2 log det K n 2 tr(s nk) is concave in K S p 0 with global maximum K = S 1 n Concave function constrained to affine subspace remains concave ML estimation in Gaussian graphical models is convex problem l(σ; S n ) is not concave for all Σ S p 0, but it is strictly concave over random convex region 2Sn := {Σ S p 0 Σ 2S n } Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 5 / 14

Numerical optimization of the likelihood Newton-Raphson method: Start at natural least-squares estimator Σ given by θ, the solution to r j=1 θ j tr (G i G j ) = tr (S n G i ) for all i = 1,..., r Update: θ (k+1) = θ (k) ( θ T θ l(θ(k) ; S n )) 1 θ l(θ (k) ; S n ) Our observation: For simulated data Newton-Raphson algorithm typically converges in 2-3 steps It converges to a point ˆΣ with larger likelihood than the true (data-generating) covariance matrix Σ, and usually ˆΣ 2Sn. Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 6 / 14

Estimating correlation matrices A venerable problem that becomes difficult even for 3 3 matrices: Rousseeuw & Molenberghs (1994), Small, Wang & Yang (2000), Stuart & Ord (1991) Simulations using Newton-Raphson algorithm Simulations for 2 examples: 1 1/2 1/3 1/4 1 1/2 1/3 Σ = 1/2 1 1/4 and Σ = 1/2 1 1/5 1/6 1/3 1/5 1 1/7. 1/3 1/4 1 1/4 1/6 1/7 1 Least-squares estimator is given by Σ = I p + 1 i<j p (S n) ij (E ij + E ji ) Plot ratio of likelihoods L(Σ (t) )/L(Σ ) and compare to L(S n )/L(Σ ) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 7 / 14

Estimating correlation matrices: Simulations n = 10 n = 50 n = 100 Paths Paths Paths 3 3: Ratio 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ratio 0.50 0.75 1.00 1.25 Ratio 0.7 0.8 0.9 1.0 1.1 1.2 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps Paths Paths Paths 4 4: Ratio 0 1 2 3 4 5 Ratio 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Ratio 0.00 0.25 0.50 0.75 1.00 1.25 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 8 / 14

Geometric picture Σ = true covariance matrix, S n = sample covariance matrix 2Sn = {Σ S p 0 Σ 2S n } (random!) convex region Clearly: S n 2 2Sn With high probability : 2 ˆ 2 2 2Sn 2Sn 2Sn Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 9 / 14

Wishart distribution i.i.d. sample X 1,..., X n from N p (µ, Σ). Then n S n has Wishart distribution W p (n 1, Σ) Q R p p full rank, Y W p (n, Σ), then QYQ T W p (n, QΣQ T ) So taking Q = Σ 1/2, then W n 1 := n Σ 1/2 S n Σ 1/2 has standard Wishart distribution W p (n 1, I p ) Hence: Note: P(Σ 2Sn ) = P(2S n Σ 0) = P(2(Σ ) 1/2 S n (Σ ) 1/2 I p 0) = P(W n 1 n 2 I p 0) = P(λ min (W n 1 ) > n/2) P(Σ 2Sn ) does not depend on Σ Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 10 / 14

Minimum eigenvalue of Wishart matrix What is known about the distribution of λ min (W n )? R. Muirhead, Aspects of Multivariate Statistical Theory (1982): Distribution of λ min (W n ) is known but expressed in terms of complicated functions that are hard to evaluate Approximating the integral P(λ min (W n ) > n/2) is hard Convergence to the asymptotic distribution (for n ) is very slow However: Recent development in random matrix theory: Asymptotic distribution of λ min (W n ) as n, p and n/p γ > 1 is given by Tracy-Widom distribution with convergence rate O(min(n, p) 3/2 ) (Ma, 2012) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 11 / 14

Approximating P(Σ 2Sn ) for small p and n p=10 p=5 Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line 3 9 15 21 27 33 39 45 51 57 Sample size 5 15 25 35 45 55 65 75 85 95 Sample size 10 30 50 70 90 120 150 180 Sample size (a) p = 3 (b) p = 5 (c) p = 10 In each plot, p {3, 5, 10} is fixed and n varies between p and 20p For n > 14 p it holds with probability 0.95 that Σ 2Sn Above curves converge to the graph f : (1, ) [0, 1] with f (n/p) = 1(n/p 6 + 4 2), where 6 + 4 2 11.66 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 12 / 14

Conclusions and Discussion Likelihood function for linear Gaussian covariance models is, in general, multimodal However, multimodality is relevant only if sample size is not sufficiently large to compensate for model dimension Derived asymptotic conditions which guarantee when Σ, ˆΣ and Σ are contained in the convex region 2Sn Our results provide lower bounds on the probabilities that maximum likelihood estimation problem for linear Gaussian covariance models is well behaved 2Sn is contained in larger region over which likelihood function is strictly concave, and this region is contained in even larger region over which likelihood function is unimodal We are studying these regions and working on extensions to learning the model Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 13 / 14

-.*3'9%'"$=*9&*%270%#%*(0F'(0G996*')"0%2"094*04*32#))024*;$2:G0 Reference 96'()*>G9:'&#((=*94*"G'*2$H01*<=*!#462=E G246$2)'F2$245*!G2G5*,-*?)=%:"9"08)*9&*%270%#%*(0F'(0G996* Zwiernik, Uhler & Richards: Maximum likelihood estimation for linear )"0%2"094*04*32#))024*8=8(')*>04*:$9;$'))E Gaussian covariance models (arxiv:1408.5604)!"#$%&'()* Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 14 / 14