Parameter estimation in linear Gaussian covariance models

Similar documents
arxiv: v1 [math.st] 24 Aug 2014

An Algebraic and Geometric Perspective on Exponential Families

Lecture 3 September 1

Lecture 4 September 15

Total positivity in Markov structures

The Geometry of Semidefinite Programming. Bernd Sturmfels UC Berkeley

Finite Singular Multivariate Gaussian Mixture

The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

Semidefinite Programming

Gaussian Graphical Models: An Algebraic and Geometric Perspective

Graphical Gaussian models and their groups

Maximum likelihood estimation

Sparse Covariance Selection using Semidefinite Programming

Multivariate Gaussian Analysis

Multivariate Analysis and Likelihood Inference

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Combinatorial Types of Tropical Eigenvector

Decomposable and Directed Graphical Gaussian Models

Multivariate Normal Models

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Multivariate Normal Models

Permutation-invariant regularization of large covariance matrices. Liza Levina

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Parametric Techniques

Graphical Models for Collaborative Filtering

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Estimation of large dimensional sparse covariance matrices

Geodesic Convexity and Regularized Scatter Estimation

Computing the MLE and the EM Algorithm

Maximum Likelihood Estimation

1 Data Arrays and Decompositions

STA 4273H: Statistical Machine Learning

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Sample Geometry. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Parametric Techniques Lecture 3

Nonlinear Programming Models

STA 414/2104: Machine Learning

The Expectation-Maximization Algorithm

High-dimensional covariance estimation based on Gaussian graphical models

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Lecture 32: Asymptotic confidence sets and likelihoods

Graduate Econometrics I: Maximum Likelihood I

ELE539A: Optimization of Communication Systems Lecture 15: Semidefinite Programming, Detection and Estimation Applications

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Decomposable Graphical Gaussian Models

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

STAT 730 Chapter 4: Estimation

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Multivariate Gaussians, semidefinite matrix completion, and convex algebraic geometry

Statistical Inference with Monotone Incomplete Multivariate Normal Data

Gaussian Graphical Models and Graphical Lasso

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

DS-GA 1002 Lecture notes 12 Fall Linear regression

MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models

STA 294: Stochastic Processes & Bayesian Nonparametrics

Estimators based on non-convex programs: Statistical and computational guarantees

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

CS295: Convex Optimization. Xiaohui Xie Department of Computer Science University of California, Irvine

On the smallest eigenvalues of covariance matrices of multivariate spatial processes

VCMC: Variational Consensus Monte Carlo

Likelihood Analysis of Gaussian Graphical Models

Open Problems in Algebraic Statistics

Differentiation of functions of covariance

CPSC 540: Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Likelihood-Based Methods

Comparison Method in Random Matrix Theory

Linear-Time Inverse Covariance Matrix Estimation in Gaussian Processes

Markov Chains and Hidden Markov Models

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Random matrices: Distribution of the least singular value (via Property Testing)

1.1 Basis of Statistical Decision Theory

Biostat 2065 Analysis of Incomplete Data

EUSIPCO

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Curve Fitting Re-visited, Bishop1.2.5

Invertibility of random matrices

Bayesian Decision and Bayesian Learning

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Sparse Permutation Invariant Covariance Estimation: Motivation, Background and Key Results

Inverse Covariance Estimation with Missing Data using the Concave-Convex Procedure

High Dimensional Covariance and Precision Matrix Estimation

MIMO Capacities : Eigenvalue Computation through Representation Theory

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Random Matrices and Multivariate Statistical Analysis

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Exam 2. Jeremy Morris. March 23, 2006

A tailor made nonparametric density estimate

Unsupervised Learning

Note Set 5: Hidden Markov Models

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

The Maximum Likelihood Threshold of a Graph

Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

Structure estimation for Gaussian graphical models

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

Transcription:

Parameter estimation in linear Gaussian covariance models Caroline Uhler (IST Austria) Joint work with Piotr Zwiernik (UC Berkeley) and Donald Richards (Penn State University) Big Data Reunion Workshop Simons Institute, UC Berkeley December 16, 2014 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 1 / 14

Linear Gaussian covariance model S p : (real) symmetric p p matrices S p 0 : cone of (real) symmetric p p positive definite matrices Definition (Linear Gaussian covariance model) A random vector X R p satisfies the linear Gaussian covariance model M G given by G = (G 0, G 1,..., G r ), G i S p, if X N p (µ, Σ θ ) and Σ θ = G 0 + r θ i G i, θ = (θ 1,... θ r ) R r. i=1 M G parametrized by a spectrahedron { Θ G = θ = (θ 1,... θ r ) R r r } G0 + θ i G i S p 0 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 2 / 14 i=1

Examples of linear Gaussian covariance models Correlation matrices: Θ G = {θ R (p 2) Ip + 1 i<j p θ ij (E ij + E ji ) S p 0 where E ij is the p p zero-matrix, where the (i, j) entry is 1. Covariance matrices with prescribed zeros (relevance networks) Butte et al. (2000), Chaudhuri, Drton & Richardson (2007) Stationary stochastic processes from repeated time series data Anderson (1970, 1973) Brownian motion tree models: Phylogenetic models (Felsenstein (1973, 1981)) Network tomography models for analyzing the structure of the connections in the Internet (Eriksson et al. (2010), Tsang et al. (2004)) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 3 / 14 },

Maximum likelihood estimation X 1,..., X n sample from N p (µ, Σ ), Σ true covariance matrix X = 1 n X i n sample mean, used to estimate µ S n = 1 n i=1 n (X i X )(X i X ) T sample covariance matrix i=1 S n S p 0 with probability 1 if n p Log-likelihood function: l( ; S n ) : S p 0 R log-likelihood function l(σ; S n ) = n 2 log det Σ n 2 tr(s nσ 1 ) For linear Gaussian covariance models we constrain l( ; S n ) to Θ G : MLE ˆΣ := arg max l(σ θ ; S n ) θ Θ G Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 4 / 14

Maximum likelihood estimation For K := Σ 1, l(k; S n ) = n 2 log det K n 2 tr(s nk) is concave in K S p 0 with global maximum K = S 1 n Concave function constrained to affine subspace remains concave ML estimation in Gaussian graphical models is convex problem l(σ; S n ) is not concave for all Σ S p 0, but it is strictly concave over random convex region 2Sn := {Σ S p 0 Σ 2S n } Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 5 / 14

Numerical optimization of the likelihood Newton-Raphson method: Start at natural least-squares estimator Σ given by θ, the solution to r j=1 θ j tr (G i G j ) = tr (S n G i ) for all i = 1,..., r Update: θ (k+1) = θ (k) ( θ T θ l(θ(k) ; S n )) 1 θ l(θ (k) ; S n ) Our observation: For simulated data Newton-Raphson algorithm typically converges in 2-3 steps It converges to a point ˆΣ with larger likelihood than the true (data-generating) covariance matrix Σ, and usually ˆΣ 2Sn. Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 6 / 14

Estimating correlation matrices A venerable problem that becomes difficult even for 3 3 matrices: Rousseeuw & Molenberghs (1994), Small, Wang & Yang (2000), Stuart & Ord (1991) Simulations using Newton-Raphson algorithm Simulations for 2 examples: 1 1/2 1/3 1/4 1 1/2 1/3 Σ = 1/2 1 1/4 and Σ = 1/2 1 1/5 1/6 1/3 1/5 1 1/7. 1/3 1/4 1 1/4 1/6 1/7 1 Least-squares estimator is given by Σ = I p + 1 i<j p (S n) ij (E ij + E ji ) Plot ratio of likelihoods L(Σ (t) )/L(Σ ) and compare to L(S n )/L(Σ ) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 7 / 14

Estimating correlation matrices: Simulations n = 10 n = 50 n = 100 Paths Paths Paths 3 3: Ratio 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Ratio 0.50 0.75 1.00 1.25 Ratio 0.7 0.8 0.9 1.0 1.1 1.2 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps Paths Paths Paths 4 4: Ratio 0 1 2 3 4 5 Ratio 0.00 0.25 0.50 0.75 1.00 1.25 1.50 Ratio 0.00 0.25 0.50 0.75 1.00 1.25 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps 1 2 3 4 5 6 7 8 9 10 S Newton steps Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 8 / 14

Geometric picture Σ = true covariance matrix, S n = sample covariance matrix 2Sn = {Σ S p 0 Σ 2S n } (random!) convex region Clearly: S n 2 2Sn With high probability : 2 ˆ 2 2 2Sn 2Sn 2Sn Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 9 / 14

Wishart distribution i.i.d. sample X 1,..., X n from N p (µ, Σ). Then n S n has Wishart distribution W p (n 1, Σ) Q R p p full rank, Y W p (n, Σ), then QYQ T W p (n, QΣQ T ) So taking Q = Σ 1/2, then W n 1 := n Σ 1/2 S n Σ 1/2 has standard Wishart distribution W p (n 1, I p ) Hence: Note: P(Σ 2Sn ) = P(2S n Σ 0) = P(2(Σ ) 1/2 S n (Σ ) 1/2 I p 0) = P(W n 1 n 2 I p 0) = P(λ min (W n 1 ) > n/2) P(Σ 2Sn ) does not depend on Σ Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 10 / 14

Minimum eigenvalue of Wishart matrix What is known about the distribution of λ min (W n )? R. Muirhead, Aspects of Multivariate Statistical Theory (1982): Distribution of λ min (W n ) is known but expressed in terms of complicated functions that are hard to evaluate Approximating the integral P(λ min (W n ) > n/2) is hard Convergence to the asymptotic distribution (for n ) is very slow However: Recent development in random matrix theory: Asymptotic distribution of λ min (W n ) as n, p and n/p γ > 1 is given by Tracy-Widom distribution with convergence rate O(min(n, p) 3/2 ) (Ma, 2012) Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 11 / 14

Approximating P(Σ 2Sn ) for small p and n p=10 p=5 Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line Frequency 0.0 0.5 1.0 Simulated probability T W approximation 0.95 line 3 9 15 21 27 33 39 45 51 57 Sample size 5 15 25 35 45 55 65 75 85 95 Sample size 10 30 50 70 90 120 150 180 Sample size (a) p = 3 (b) p = 5 (c) p = 10 In each plot, p {3, 5, 10} is fixed and n varies between p and 20p For n > 14 p it holds with probability 0.95 that Σ 2Sn Above curves converge to the graph f : (1, ) [0, 1] with f (n/p) = 1(n/p 6 + 4 2), where 6 + 4 2 11.66 Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 12 / 14

Conclusions and Discussion Likelihood function for linear Gaussian covariance models is, in general, multimodal However, multimodality is relevant only if sample size is not sufficiently large to compensate for model dimension Derived asymptotic conditions which guarantee when Σ, ˆΣ and Σ are contained in the convex region 2Sn Our results provide lower bounds on the probabilities that maximum likelihood estimation problem for linear Gaussian covariance models is well behaved 2Sn is contained in larger region over which likelihood function is strictly concave, and this region is contained in even larger region over which likelihood function is unimodal We are studying these regions and working on extensions to learning the model Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 13 / 14

-.*3'9%'"$=*9&*%270%#%*(0F'(0G996*')"0%2"094*04*32#))024*;$2:G0 Reference 96'()*>G9:'&#((=*94*"G'*2$H01*<=*!#462=E G246$2)'F2$245*!G2G5*,-*?)=%:"9"08)*9&*%270%#%*(0F'(0G996* Zwiernik, Uhler & Richards: Maximum likelihood estimation for linear )"0%2"094*04*32#))024*8=8(')*>04*:$9;$'))E Gaussian covariance models (arxiv:1408.5604)!"#$%&'()* Caroline Uhler (IST Austria) Linear Gaussian covariance models Berkeley, December 2014 14 / 14