Robust and sparse Gaussian graphical modelling under cell-wise contamination

Size: px

Start display at page:

Download "Robust and sparse Gaussian graphical modelling under cell-wise contamination"

Iris Fox
5 years ago
Views:

1 Robust and sparse Gaussian graphical modelling under cell-wise contamination Shota Katayama 1, Hironori Fujisawa 2 and Mathias Drton 3 1 Tokyo Institute of Technology, Japan 2 The Institute of Statistical Mathematics, Japan 3 University of Washington, USA 1 Introduction Let Y = (Y 1,..., Y p ) T be a p-dimensional random vector representing a multivariate observation. The conditional independence graph of Y is the undirected graph G = (V, E) whose vertex set V = 1,..., p} indexes the individual variables and whose edge set E indicates conditional dependences among them. More precisely, (i, j) E if and only if Y i and Y j are conditionally independent given Y V \i,j} = Y k : k i, j}. For a Gaussian vector, the edge set E corresponds to the support of the precision matrix. Indeed, it is well known that if Y follows a multivariate Gaussian distribution N p (µ, Σ) with mean vector µ and covariance matrix Σ, then (i, j) E if and only if Ω ij = 0, where Ω = Σ 1. Inference of the conditional independence graph sheds light on direct as opposed to indirect interactions and has received much recent attention (Drton & Maathuis, 2017). In particular, for high-dimensional Gaussian problems, several techniques have been developed that exploit available sparsity in inference of the support of the precision matrix Ω. Meinshausen & Bühlmann (2006) suggested fitting node-wise linear regression models with l 1 penalty to recover the support of each row. Yuan & Lin (2007), Banerjee et al. (2008) and Friedman et al. (2008) considered the graphical lasso (Glasso) that involves the l 1 penalized log-likelihood function. Cai et al. (2011) proposed the constrained l 1 minimization for inverse matrix estimation (CLIME), which may be formulated as a linear program. In fields such as bioinformatics and economics, data are often not only high-dimensional but also subject to contamination. While suitable for high dimensionality, the above mentioned techniques are sensitive to contamination. Moreover, traditional robust methods may not be appropriate when the number of variables is large. Indeed, they are based on the model in which an observation vector is either without contamination or fully contaminated.

2 Hence, an observation vector is treated as an outlier even if only one of many variables is contaminated. As a result these methods down-weight the entire vector regardless of whether it contains clean values for some variables. Such information loss can become fatal as the dimension increases. As a more realistic model in high dimensional data, Alqallaf et al. (2002) considered cell-wise contamination: the observations X 1,..., X n with p variables are generated by X i = (I p E i )Y i + E i Z i, i = 1,..., n. (1) Here, I p is the p p identity matrix and each E i = diag(e i1,..., E ip ) is a diagonal random matrix with the E ij s independent and Bernoulli distributed with P (E ij = 1) = ε j. The random vectors Y i and Z i are independent, and Y i N p (µ, Σ) corresponds to a clean sample while Z i makes contaminations in some elements of X i. Our goal is to develop a robust estimation method for the conditional independence graph G of Y i from the cell-wise contaminated observations X i. Techniques such as nodewise regression, Glasso and CLIME process an estimate of the covariance matrix. Our strategy is thus simply to apply these procedures using a covariance matrix estimator that is robust against cell-wise contamination. However, while many researchers have considered the traditional whole-vector contamination framework (see, e.g., Maronna et al., 2006, Chapter 6), there are fewer existing methods for cell-wise contamination. Specifically, we are aware of three approaches, namely, use of alternative t-distributions (Finegold & Drton, 2011), use of rank correlations (Loh & Tan, 2015; Öllerer & Croux, 2015), and a pairwise covariance estimation method by Tarr et al. (2016) who adopt an idea of Gnanadesikan & Kettenring (1972). In contrast, in this talk, we provide a robust covariance matrix estimator via - divergence as proposed by Fujisawa & Eguchi (2008). The -divergence can automatically reduce the impact of contaminations, and it is known to be robust even when the number of contaminations is large. 2 Preliminaries 2.1 Graph estimation For concise presentation, we focus on node-wise regression, Glasso and CLIME. Let ˆΣ be an estimator of Σ. For index sets A and B, define ˆΣ A,B as the sub-matrix of ˆΣ with the rows in A and the columns in B. We use the shorthand \j for the set V \j}, so that ˆΣ \j,\j denotes the sub-matrix with both rows and columns in V \j}. In l 1 penalized node-wise

3 regression, one finds ˆβ (j) 1 = argmin β R p 1 2 βt ˆΣ\j,\j β ˆΣ j,\j β + λ β 1, j = 1,..., p, where the tuning parameter λ > 0 controls the strength of the penalty β 1. Large λ yields high sparsity of ˆβ (j). After obtaining ˆβ (1),..., ˆβ (p), the edge set E is estimated by the AND (j) (i) (j) (i) rule Ê = (i, j) : ˆβ i 0 and ˆβ j 0} or the OR rule Ê = (i, j) : ˆβ i 0 or ˆβ j 0}. Node-wise regression is well-defined for any positive semidefinite estimate ˆΣ. The Glasso estimator is obtained by solving ˆΩ Glasso = argmin Ω R p p trˆσω log Ω + λ Ω 1, (2) where Ω 1 is the element-wise l 1 norm of Ω and λ > 0 is a tuning parameter that controls the sparsity of Ω. The edge set may be estimated by Ê = (i, j) : ˆΩij 0}. Efficient algorithms for the Glasso are given in Friedman et al. (2008) and Hsieh et al. (2011). For convergence, the former requires ˆΣ + λi p to be positive semidefinite while the latter requires the same for ˆΣ. Finally, we review the CLIME method. Let ˆΩ 0 = argmin Ω R p p Ω 1 subject to ˆΣΩ I p λ, (3) where means the element-wise infinity norm. Generally, ˆΩ 0 = (ˆω 0 ij) is not symmetric. The CLIME is defined through a simple symmetrization, namely, ˆΩ CLIME ij = ˆω 0 iji( ˆω 0 ij ˆω 0 ji ) + ˆω 0 jii( ˆω 0 ij > ˆω 0 ji ), i, j = 1,..., p, and the edge set is estimated as in Glasso. Cai et al. (2011) translated the matrix optimization problem from (3) into p vector optimization problems. Each of them can be solved by linear programming. The CLIME essentially needs the positive definiteness of ˆΣ. Without it, the optimization problem may be infeasible or return inadequate solutions. 2.2 Robust inference via -divergence Let f and f n be the data generating and empirical densities, respectively. In robust inference one typically assumes that f = (1 ε)g + εh, where g is the density of clean data, h is the density of contamination, and ε 0 is the contamination level. For estimation of g, consider a model with densities g θ indexed by the parameter θ. The Kullback-Leibler (KL) divergence

4 between f n and g θ results in a biased estimate unless ε = 0. To overcome it, Fujisawa & Eguchi (2008) proposed the -divergence given by d (f n, g θ ) = 1 log f n (x)g θ (x) dx log g θ (x) 1+ dx, where > 0 is a constant that controls the trade-off between efficiency and robustness. In fact, the -divergence is equivalent to the KL divergence when 0. The estimator given by minimizing d (f n, g θ ) over a possible parameter space is highly robust. Roughly speaking, in the limiting case n, d (f n, g θ ) can be regarded as d (f, g θ ), and d (f, g θ ) = 1 (1 log ε) } g(x)g θ (x) dx + εν(θ; ) log g θ (x) 1+ dx, where ν(θ; ) = h(x)g θ (x) dx. The -divergence successfully provides robust estimates whenever ν(θ; ) 0 over the parameter space considered. In such a case, we see that d (f, g θ ) d (g, g θ ) (1/) log(1 ε), where the contamination density h is automatically ignored, so that the minimizer of d (f, g θ ) is approximately equal to the minimizer of d (g, g θ ). This is a favorable property, because when g = g θ, the minimizer of d (g, g θ ) is θ. Fortunately, ν(θ; ) is close to zero when h lies in the tail of g θ. To illustrate this fact, assume for a moment that g θ is the density of N(θ, 1). If h is the density of N(α, 1), then ν(θ; ) = c 1, exp c 2, (α θ) 2 } for some c 1,, c 2, > 0. Thus, ν(θ; ) is small whenever α is not too close to the set of parameters θ that determine the better fitting densities g θ. 3 Methods 3.1 Proposed methodology As noted in Section 2.1, we seek a robust covariance estimate ˆΣ for use in graph estimation. In this section, we construct such an estimate via the -divergence. Our estimator ˆΣ is constructed in an element-wise fashion and exhibits robustness to cell-wise contamination. We begin by writing each covariance as Σ = σ jj σkk ρ, (4) where σ jj = Var(Y ij ), σ kk = Var(Y ik ) and ρ = Corr(Y ij, Y ik ), for j, k = 1,..., p. Here, Y i = (Y i1,..., Y ip ) T is the i-th unobserved clean sample in (1). We now derive estimates of the variances and the correlation in (4) based on the observations X i = (X i1,..., X ip ) T,

5 which under cell-wise contamination may have some of their elements corrupted. Fixing a coordinate j 1,..., p}, let g (µj,σ jj ) be the density of N(µ j, σ jj ), and let f (j) n be the empirical density of X 1j,..., X nj. The robust estimators of µ j and σ jj based on -divergence are given by d (f (j) n (ˆµ j, ˆσ jj ) = argmin d (f n (j), g (µj,σ jj )), µ j,σ jj, g (µj,σ jj )) = 1 n log exp } (X ij µ j ) 2 + 2σ jj 1 2(1 + ) log σ jj. Fujisawa & Eguchi (2008) gave an efficient iterative algorithm to compute (ˆµ j, ˆσ jj ). Let µ t j and σ t jj denote the t-th values starting from initializations µ 0 j and σ 0 jj. The algorithm repeats the following steps until convergence: µ t+1 j n w t ijx ij, σ t+1 jj (1 + ) n w t ij(x ij µ t+1 j ) 2, where the weights are updated as wij t = exp 2σ t jj (X ij µ t j) 2 }/ n exp 2σ t jj (X ij µ t j) 2 }. We take the median of X 1j,..., X nj as µ 0 j and the median absolute deviation (MAD) as σ 0 jj. After ˆµ j and ˆσ jj are obtained for j = 1,..., p, we estimate each correlation ρ from the standardized observations Z ij = (X ij ˆµ j )/ ˆσ jj. Let h ρ be the bivariate standardized normal density with correlation ρ, and let f n (j,k) be the empirical density of (Z 1j, Z 1k ),..., (Z nj, Z nk ). Our correlation estimator is d (f (j,k) n, h ρ ) = 1 log n ˆρ = argmin d (f n (j,k), h ρ ), (5) ρ <1 } exp 2(1 ρ 2 )(Z2 ij + Z 2 ik 2ρ Z ij Z ik ) + 1 2(1 + ) log 1 ρ2. The required univariate optimization problem can be solved with standard techniques. We provide a projected gradient descent algorithm below. Finally, we obtain the estimator of Σ as ˆΣ = ˆσ jj ˆσkk ˆρ. We outline the projected gradient descent algorithm for computation of ˆρ from (5). To avoid numerical singularity, we replace the restriction ρ < 1 by ρ R with R 1.

6 For simpler notation, let d (ρ ) = d (f n (j,k), h ρ ). The gradient of this function is d (ρ ) = (1 ρ 2 )2 n w i (1 + ρ 2 )Z ij Z ik ρ (Z 2 ij + Z 2 ik) } ρ, 1 ρ 2 where w i = exp }/ ( Z 2 2(1 ρ 2 ) ij + Zik 2 ) n 2ρ Z ij Z ik exp The objective function d (ρ ) is locally approximated around ρ by } ( Z 2 2(1 ρ 2 ) ij + Zik 2 ) 2ρ Z ij Z ik. ϕ (ρ ; ρ ) = d (ρ ) + d (ρ )(ρ ρ ) + s 2 (ρ ρ ) 2, where s > 0 is the step size parameter. We select sufficiently large s such that d (ρ ) ϕ (ρ ; ρ ). The projected gradient descent minimizes ϕ (ρ ; ρ ) over ρ R instead of d (ρ ). The minimizer is sgn( ρ ) min( ρ, R) with ρ = ρ s 1 d (ρ ). Algorithm 1 summarizes the procedure. Algorithm 1: Projected gradient descent algorithm for ˆρ Input: Standardized data (Z 1j, Z 1k ),..., (Z nj, Z nk ); divergence parameter > 0. Initialize t = 0, ρ 0 = 0, s0 = 0.01 and set R = repeat t in 0 s t,t in s t repeat ν t,t in ρ t,t in+1 ρ t d (ρ t )/st,t in sgn(ν t,t in s t,tin+1 2s t,t in t in t in + 1 until d (ρ t ) ϕ (ρ t ; ρt,t in ρ t+1 ρ t,t in s t+1 s t,t in /2 t t + 1 until convergence; ) min( ν t,t in, R) ); 3.2 Projection of covariance matrix estimate We cannot directly plug an estimate of Σ into the methods introduced in Section 2.1 if it is not positive semidefinite. The node-wise regression and Glasso require a positive semidefinite estimate, and CLIME needs a positive definite one. However, if an estimate ˆΣ is not positive

7 semidefinite, we may project it onto the set of positive (semi)definite matrices. We will simply proceed by solving the problem min S S ˆΣ F subject to S δi p, where δ 0, F denotes the Frobenius norm and S δi p means that S δi p is positive semidefinite. The solution can be calculated by applying the singular value decomposition to ˆΣ and then replacing the singular values λ j by max(λ j, δ) for each j = 1,..., p. References Alqallaf, FA, Konis, KP, Martin, RD & Zamar, RH (2002), Scalable robust covariance and correlation estimates for data mining, In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp ACM, New York, NY, USA. Banerjee, O, El Ghaoui, L & D Aspremont, A (2008), Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data, Journal of Machine Learning Research, 9, Boudt, K, Cornelissen, J & Croux, C (2012), The Gaussian rank correlation estimator: robustness properties, Statistics and Computing, 22, Cai, T, Liu, W & Luo, X (2011), A constrained l 1 minimization approach to sparse precision matrix estimation, Journal of the American Statistical Association, 106, Drton, M & Maathuis, MH (2017), Structure learning in graphical modeling, Annual Review of Statistics and Its Application, 4, Finegold, M & Drton, M (2011), Robust graphical modeling of gene networks using classical and alternative t-distributions, The Annals of Applied Statistics, 2A, Friedman, JH, Hastie, T & Tibshirani, R (2008), Sparse inverse covariance estimation with the graphical lasso, Biostatistics, 9, Fujisawa, H & Eguchi, S (2008), Robust parameter estimation with a small bias against heavy contamination, Journal of Multivariate Analysis, 99, Gnanadesikan, R & Kettenring, JR (1972), Robust estimates, residuals and outlier detection with multiresponse data, Biometrics, 28,

8 Hsieh, C-J, Sustik, MA, Dhillon, IS & Ravikumar, P (2011), Sparse inverse covariance matrix estimation using quadratic approximation, In Advances in Neural Information Processing Systems, pp Liu, H, Han, F, Yuan, M & Lafferty, J (2012), High-dimensional semiparametric Gaussian copula graphical models, The Annals of Statistics, 40, Loh, P-L & Tan, XL (2015), High-dimensional robust precision matrix estimation: Cellwise corruption under ε contamination, arxiv: Maronna, RA, Martin, RD & Yohai, V (2006), Robust Statistics: Theory and Methods, Wiley, New York. Meinshausen, N & Bühlmann, P (2006), High-dimensional graphs and variable selection with the lasso, The Annals of Statistics, 34, Öllerer, V & Croux, X (2015), Robust high-dimensional precision matrix estimation, In Modern Nonparametric, Robust and Multivariate Methods, pp Springer International Publishing. Rousseeuw, PJ & Croux, C (1993), Alternatives to the median absolute deviation, Journal of the American Statistical Association, 88, Tarr, G, Müller, S & Weber NC (2016), Robust estimation of precision matrices under cellwise contamination, Computational Statistics and Data Analysis, 93, Yuan, M & Lin, Y (2007), Model selection and estimation in the Gaussian graphical model, Biometrika, 94,

Robust estimation of scale and covariance with P n and its application to precision matrix estimation

Robust estimation of scale and covariance with P n and its application to precision matrix estimation Garth Tarr, Samuel Müller and Neville Weber USYD 2013 School of Mathematics and Statistics THE UNIVERSITY