Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent

Size: px

Start display at page:

Download "Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent"

Carmel Thomas
5 years ago
Views:

1 Supplementary Materials to Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent Fei Fu and Qing Zhou 1 Proofs 1.1 Proof of Proposition 1 [ ( We want to minimize g = 1 2 log ) 2 ( βkj ξ kj + ckj ξkj) ] 2 +η β kj over β kj. After differentiating g with respect to β kj and setting the derivative to zero, we obtain for β kj > 0, η β kj 2 (2ηξ kj 1) β kj + (c kj η ξ kj ) = 0, (1) and for β kj < 0, η β 2 kj (2ηξ kj + 1) β kj + (c kj η + ξ kj ) = 0. (2) Apparently, both (1) and (2) have the same discriminant = 1 4(c kj ξ 2 kj )η2. The only possible minimizers of g are 0, positive real roots of (1) or negative real roots of (2). In the rest of the proof, we will only show that Proposition 1 holds when ξ kj 0. The proof for ξ kj < 0 is analogous. First, consider ξ kj = 0. It is easily seen that g is minimized at β kj = 0, which is included in the third case of the proposition. ( ) Now consider the case when ξ kj > 0. In this case, let β1 = 2ηξ kj 1 + 2η and β ( ) 2 = 2ηξ kj 1 2η be the two possible real roots of (1). If (1) has two real roots, β2 is a local maximum. Also note that if ξ kj > 0, (2) can only have positive real roots. Thus, g can only be minimized at 0 or β 1 if it is real. Now we only need to find out when 0 or β 1 There are four cases: minimizes g. Case 1. > 0 and β 1 > 0 > β 2 : This is equivalent to 0 < η < ξ kj/c kj. In this case, we have g(β1 ) < g(0) (Figure S1A). Case 2. > 0 and β 1 > β 2 0: This is equivalent to ξ kj/c kj η < and η > (2ξ kj ) 1. In this case, β1 is a local minimum and β 2 Thus, we need to compare g(β1 ) with g(0) to determine arg min g. βkj ( ) 1 2 c kj ξkj 2 is a local maximum (Figure S1B). 1

2 ξ kj = 0.8, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.82, γ = 2 g g ~ β (A) kj ~ β (B) kj ξ kj = 0.4, c kj = 0.8, γ = 0.5 ξ kj = 0.9, c kj = 0.85, γ = 2.5 g g ~ β (C) kj ~ β (D) kj Figure S1: Examples illustrating different scenarios for minimizing g over β kj when ξ kj > 0. Case 3. > 0 and 0 β 1 > β 2 : This is equivalent to ξ kj/c kj η < ( 1 2 c kj ξkj) 2 and η (2ξ kj ) 1. In this case, neither β1 nor β 2 is positive. So arg min g = 0 (Figure S1C). βkj ( 1. Case 4. 0: This is equivalent to η 2 c kj ξkj) 2 If < 0, clearly arg min g = βkj 0. If = 0, β1 = β 2 is an inflection point if they are positive (Figure S1D). So it is also true that arg min βkj g = 0. Therefore, we have shown that Proposition 1 holds. 2

3 1.2 Proof of Theorem 3 We omit the proof of the first part of Theorem 3, since it is similar to that of Theorem 2. To prove the second part, let B = {j : φ j 0}. For notational ease, let us first, by permuting the indices, rewrite the parameter θ as θ = (θ a, θ b ) = (φ A, φ B, σ 2, σ 2 ), where θ a = φ A and θ b = (φ B, σ 2, σ 2 ). Let r = A be the number of zero components of φ. Now we only need to show that with probability tending to 1, for any θ b satisfying θ b θ b = O p(n 1/2 ) and any constant C > 0, (0, θ b ) = arg max θ R ( (θ a, θ b ) ). (3) a C/ n To establish (3), we again study the behavior of R(θ) around the point (0, θ b ) by expanding L(θ) around (0, θ b ). Let a n = 1/ n, θ o = (0, θ b ), and u = (u a, 0) such that u C and θ o + a n u Ω. Then we have the following result similar to that in the proof of Theorem 2: = = R(θ o + a n u) R(θ o ) p [ αk L k (θo k )T u k {1 + o p (1)} α ] k nk 2 ut k I(θo k )u k {1 + o p (1)} k=1 p k=1 [ αk L k (θo k )T nk λ n n r τ j u j j=1 u k {1 + o p (1)} α ] k 2 ut k I(θo k )u k {1 + o p (1)} λ n n γ/2 n r n φ j γ u j. j=1 (4) Note that both the first and the second terms in the last line of (4) are on the order of O p (1) for any fixed constant C. Since φ j is n-consistent, we have n φ j = O p (1), for j = 1,..., r. Then the third term in the last line of (4) is on the order of λ n n (γ 1)/2. Therefore, (3) holds, and the proof is complete. 3

4 2 Supplementary Algorithm The following algorithm is used in the second step of the CD algorithm to check the acyclicity constraint. The time complexity is O(V + E). Algorithm S1 Check whether a DAG G remains acyclic if an edge i j is added. function Cycle(G, i, j) for v V \{i} do C v 0 end for C i 1 Q ENQUEUE(Q, i) while Q do u DEQUEUE(Q) for v Π G u do if v = j then return true else if C v = 0 then C v 1 ENQUEUE(Q, v) end if end if end for end while return false end function 4

5 3 Supplementary Figures 3.1 Demonstration of convergence Figure S2 demonstrates convergence of the CD algorithm on a simulated data set with p = 200. The figure plots the maximum absolute difference (MAD) in the coefficient matrix between two adjacent iterations. The two bumps at iterations 16 and 25 reflect changes in the active set of blocks after a complete cycle. The decrease in the MAD before, between, and after the bumps indicates convergence given a (fixed) active set of blocks. Decreasing MAD at the bumps (iterations 1, 16, and 25) demonstrates convergence in the structure (active set). At the final iteration, the CD algorithm also cycles through all blocks (which allows update on the active set), but the active set stays the same and the MAD is already below the threshold. Therefore, the algorithm stops running. The fact that there are only a few changes in the active set is due to the use of a warm start which often gives a good initial estimate. p = 200, β ij = 0.5 Maximum absolute difference Iteration t Figure S2: A typical plot for the convergence of the CD algorithm. 5

6 3.2 Choice of α Figure S3 shows the sensitivity of the simulation results with p = 100 and β ij = 0.5 (Table 1) to the choice of α [Equation (11) in the main paper]. The solid line plots TPR versus α, the dotdashed line plots R/P versus α, and the long dashed line plots FP/P versus α. The addition of the latter two curves plots FDR versus α. See Table 1 for notations. p = 100, β ij = α Figure S3: Simulation results for different values of α. 6

7 4 Supplementary Tables All notations in the supplementary tables are defined in Table 1 in the main text. Table S1 shows the large sample performance of the CD algorithm and the KO method. The data were simulated in the same way as described in Section 5.1 in the main text. Here, the sample size is n = 6000 and the number of edges is 2p for each data set. Table S1: Large sample results for the CD algorithm and the KO method p β ij CD algorithm KO method P E R M FP TPR FDR TPR FDR

8 Table S2 shows the performance of the CD algorithm for DAGs with different degrees of sparsity. The data were simulated with intervention as described in Section 5.1 in the main text. We fixed the sample size of a data set to n = 5p, where p is the number of nodes, and changed the number of edges from p to 4p. The coefficients β ij = 0.5 for all the results in this table. Table S2: Performance of the CD algorithm for DAGs with different degrees of sparsity p # edges TPR FDR (0.081) (0.088) (0.100) (0.085) (0.109) (0.046) (0.061) (0.058) (0.058) (0.071) (0.061) (0.088) 8

9 The following two tables summarize the results of the CD algorithm on observational data as well as a comparison with the PC-based method on observational data. Table S3: Performance of the CD algorithm on observational data p β ij P E R M FP TPR FDR (0.068) (0.056) (0.106) (0.066) (0.130) (0.091) (0.064) (0.075) (0.141) (0.115) (0.089) (0.091) (0.059) (0.054) (0.071) (0.097) (0.096) (0.091) (0.029) (0.047) (0.053) (0.059) (0.080) (0.095) 9

10 Table S4: Performance comparison between the PC-based method and the CD algorithm on observational data p β ij PC-based method CD algorithm P TPR FDR P TPR FDR (0.037) (0.187) (0.029) (0.137) (0.057) (0.126) (0.060) (0.099) (0.033) (0.118) (0.081) (0.169) (0.057) (0.108) (0.062) (0.075) (0.082) (0.077) (0.143) (0.144) (0.058) (0.079) (0.073) (0.084) (0.050) (0.056) (0.051) (0.048) (0.071) (0.079) (0.099) (0.090) (0.047) (0.068) (0.049) (0.094) (0.031) (0.030) (0.029) (0.028) (0.046) (0.066) (0.083) (0.053) (0.059) (0.066) (0.064) (0.086) 10

Learning Sparse Causal Gaussian Networks With Experimental Intervention: Regularization and Coordinate Descent

Supplementary materials for this article are available online. Please go to www.tandfonline.com/r/jasa Fei FU and Qing ZHOU Learning Sparse Causal Gaussian Networs With Experimental Intervention: Regularization