High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming Ming Yuan October 24, 2011

Gaussian Graphical Model X = (X 1,..., X p ) indep. N(µ, Σ) Inverse covariance matrix Σ 1 = Ω = (ω i,j ) pxp ω i,j = 0 iff X i and X j are independent

Previous work Yuan and Li (2007), Banerjee, El Ghaoui and d Aspremont (2008), d Aspremont et al ( 2008), Friedman et al (2008) min tr(ˆσω) logdet(ω) + λσ i j ω i,j where ˆΣ is sample covariance matrix. These methods may not perform well when p is larger than n. Meinshausen Buhlmann (2006) Neighborhood selection approach foucs on identifying the correct graphical model instead of estimation

Motivation Focus in this paper: Estimate a high dimensional inverse covariance matrix which can be well approximated by sparse matrices Estimating procedure via linear programming that has the potential to be used in very high dimensional problem Oracle inequalities are established for the estimation error

Methodology X i = (X 1,...X i 1, X i+1,...x p ) X i X i N(µ i + Σ i, i Σ 1 i, i (X i µ i ), Σ ii Σ i, i Σ 1 i, i Σ i,i) Equivalentyly X i = α i + X iθ (i) + ɛ i ] α i = µ i Σ i, i Σ 1 i, i µ i is scalar θ (i) = Σ 1 i, i Σ i,i is a p-1 dimensional vector ɛ i N(0, Σ ii Σ i, i Σ 1 i, i Σ i,i)

Regression and Inverse Covariance Matrix Ω i,i = ( Σ i,i Σ i, i Σ 1 i, i Σ i,i) 1; Ω i,i = ( Σ i,i Σ i, i Σ 1 i, i Σ i,i Ω i,i = (Var(ɛ i )) 1 Ω i,i = (Var(ɛ i )) 1 θ (i) ) 1Σ 1 i, i Σ i,i; Thus the sparsity in the entries of Ω sparsity in θ (i)

Initial estimate Let Z i = X i X i, then the Dantzig selector estimate of θ (i) is min β R p 1 β l 1 subject to E n [(Z i Z iβ)z i ] l δ E n :sample average δ > 0:tuning parameter Since E n Z i Z j = S ij, then equivalently min β R p 1 β l 1 subject to S i,i S i, i β l δ Once an estimate of θ (i) is obtained ˆ Var(ɛ i ) = E n (X i X i ˆθ (i) ) 2 = S ii 2ˆθ (i) S i, i ˆθ (i) We obtain Ω by repeating this procedure for i = 1,..., p.

Symmetrization Ω is not a good estimate, since it is not symmetric Ω could be a good estimate in a certain matrix operator norm

Matrix Operator Norm For vector X=(X 1, X 2,..., X p ) X lq = ( X 1 q +... + X p q ) 1/q For square matrix A p p Ax lq A lq = sup x 0 x lq Special cases: A l1 = max 1 j p A l = max 1 i p p a ij ; i=1 p a ij ; j=1 A l2 = σ max (A)

Symmetrization Construct ˆΩ by min Ωis symmetric Ω Ω l1

Algorithm Summary

Oracle Set A 0: matrix A is symmetric and positive definite λ min and λ max : the smallest and largest eigenvalue respectively A max = max 1 i,j p a ij ; Approximation error is measured by Σ 0 Ω I max

Oracle Inequality Theorem 1 There exist constants C 1, C 2 depending only on ν, τ, λ min (Ω 0 ) and λ max (Ω 0 ), and C 3 depending only on c 0 such that, for any A> 0, with probability at least 1 P A, ˆΩ Ω 0 l1 C 1 provided that inf Ω O(ν,η,τ) ( Ω Ω0 l1 + deg(ω)δ ), and inf Ω O(ν,η,τ) ( Ω Ω0 l1 + deg(ω)δ ) C 2, δ νη + C 3 ντλ 1 min (Ω 0)((A + 1)n 1 log p) 1/2, where deg(ω) = max i j I(Ω ij 0)

Oracle Inequality Recall that for a symmetric matrix A A l = A l1 and A l1 ( A l1 A l ) 1/2 = A l1 Corollary 2 Same as theorem1, for any A> 0, with probability at least 1 P A, ˆΩ Ω 0 l, ˆΩ Ω 0 l2 C 1 inf Ω O(ν,η,τ) ( Ω Ω0 l1 + deg(ω)δ ), provided that the same condtion as theorem 1 Note: The proposed ˆΩ is not guaranteed to be positive definite, but Cor 2 suggests that with overwhelming probability, it is Positive definite, since λ min (ˆΩ) λ min (Ω 0 ) ˆΩ Ω 0 l2

Oracle Inequality Moreover, a positive definite estimate of Ω can always be constructed by replacing its negative eigenvalues with δ as ˆΩ Corollary 3Same as theorem1, for any A> 0, with probability at least 1 P A, ˆΩ 1 Σ 0 l2, ˆΩ Ω 0 l2 C 1 inf Ω O(ν,η,τ) ( Ω Ω0 l1 +deg(ω)δ ), provided that the same condtion as theorem 1

Sparse Models Define: { M 1 (τ 0, ν 0, d) = A 0 : A l1 τ 0, ν0 1 λ min (A) } λ max (A) ν 0, deg(a) d τ 0, ν 0 > 1, and deg(a)=max i j I(A ij 0) Theorem 4 Assume that d(n 1 log p) 1/2 = o(1). Then sup Ω 0 M 1 (ν 0,τ 0,d) ( log p ) ˆΩ Ω 0 lq = O p d, n provided that δ = C(n 1 log p) 1/2 and C is large enough

Sparse Models Theorem 5 Assume that d(n 1 log p) 1/2 = o(1). Then there exists a constant C > 0 depending only on τ 0, and ν 0 such that inf sup Ω Ω 0 M 1 (τ 0,ν 0,d) { P Ω Ω 0 l1 Cd log p } > 0, n where the infimum is taken over all estimate Ω based on observations X (1),..., X (n) Note: the estimability of a sparse inverse covariance matrix is dictated by its degree deg(a) as opposed to the total number of nonzero entries

Approximately Sparse Models In many applications, the inverse covariance matrix is only approximately sparse. Define: { M 2 (τ 0, ν 0, α, M) = A 0 : A 1 l1 τ 0, ν0 1 λ min (A) λ max (A) ν 0, } p j=1 A ij α ) M, τ 0, ν 0 > 1, and 0 < α < 1 M 1 is the limiting case of M 2 when α approach 0.

Approximately Sparse Models Theorem 6 Assume that M(n 1 log p) 1 α 2 = o(1). Then sup ˆΩ Ω 0 lq = O p (M Ω 0 M 2 (ν 0,τ 0,α,M) ( log p n ) 1 α 2 ), provided that δ = C(n 1 log p) 1/2 and C is sufficiently large. Theorem 7 Assume that M(n 1 log p) 1 α 2 = o(1). Then there exists a constant C > 0 depending only on τ 0, and ν 0 such that and inf sup Ω Ω 0 M 1 (τ 0,ν 0,d) inf sup Σ Σ 0 M 1 (τ 0,ν 0,d) { P Ω Ω 0 l1 CM { P Σ Σ 0 l1 CM ( log p n ( log p n ) 1 α } 2 > 0, ) 1 α } 2 > 0, where the infimum is taken over all estimate based on observations X (1),..., X (n)

Simulation Result n = 50 p = 25, 50, 100, or200 Σ 0 ij = ρ i j,where ρ = 0.1, 0.2,...0.7 set δ = (2n 1 log p) 1 for all simulation studies Methods for comparison: (1) l 1 Penalized likelihood estimate ( Yuan and Li 2007) (2) Neighborhood selection approach ( Meinshausen and Buhlmann 2006) Estimation error measured by Ĉ C l2

Simulation Result