High-dimensional causal inference, graphical modeling and structural equation models

Size: px
Start display at page:

Download "High-dimensional causal inference, graphical modeling and structural equation models"

Transcription

1 High-dimensional causal inference, graphical modeling and structural equation models Peter Bühlmann Seminar für Statistik, ETH Zürich

2 cannot do confirmatory causal inference without randomized intervention experiments... but we can do better than proceeding naively

3 Goal in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a steady-state system ) it doesn t need to be genes can generalize to intervention at more than one variable/gene

4 Goal in genomics: if we would make an intervention at a single gene, what would be its effect on a phenotype of interest? want to infer/predict such effects without actually doing the intervention i.e. from observational data (from observations of a steady-state system ) it doesn t need to be genes can generalize to intervention at more than one variable/gene

5 Policy making James Heckman: Nobel Prize Economics 2000 e.g.: Pritzker Consortium on Early Childhood Development identifies when and how child intervention programs can be most influential

6 Genomics 1. Flowering of Arabidopsis Thaliana phenotype/response variable of interest: Y = days to bolting (flowering) covariates X = gene expressions from p = genes remark: gene expression : process by which information from a gene is used in the synthesis of a functional gene product (e.g. protein) question: infer/predict the effect of knocking-out/knocking-down (or enhancing) a single gene (expression) on the phenotype/response variable Y?

7 2. Gene expressions of yeast p = 5360 genes phenotype of interest: Y = expression of first gene covariates X = gene expressions from all other genes and then phenotype of interest: Y = expression of second gene covariates X = gene expressions from all other genes and so on infer/predict the effects of a single gene knock-down on all other genes

8 consider the framework of an intervention effect = causal effect (mathematically defined see later)

9 Regression the statistical workhorse : the wrong approach we could use linear model (fitted from n observational data) Y = p β j X (j) + ε, j=1 Var(X (j) ) 1 for all j β j measures the effect of variable X (j) in terms of association i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

10 Regression the statistical workhorse : the wrong approach we could use linear model (fitted from n observational data) Y = p β j X (j) + ε, j=1 Var(X (j) ) 1 for all j β j measures the effect of variable X (j) in terms of association i.e. change of Y as a function of X (j) when keeping all other variables X (k) fixed not very realistic for intervention problem if we change e.g. one gene, some others will also change and these others are not (cannot be) kept fixed

11 and indeed: 1, IDA Lasso Elastic net Random True positives ,000 2,000 3,000 4,000 False positives can do much better than (penalized) regression!

12 and indeed: 1, IDA Lasso Elastic net Random True positives ,000 2,000 3,000 4,000 False positives can do much better than (penalized) regression!

13 Effects of single gene knock-downs on all other genes (yeast) (Maathuis, Colombo, Kalisch & PB, 2010) p = 5360 genes (expression of genes) 231 gene knock downs intervention effects the truth is known in good approximation (thanks to intervention experiments) goal: prediction of the true large intervention effects based on observational data with no knock-downs n = 63 observational data True positives 1, IDA Lasso Elastic net Random ,000 2,000 3,000 4,000 False positives

14 A bit more specifically univariate response Y p-dimensional covariate X question: what is the effect of setting the jth component of X to a certain value x: do(x (j) = x) this is a question of intervention type not the effect of X (j) on Y when keeping all other variables fixed (regression effect) Reichenbach, 1956; Suppes, 1970; Rubin, Dawid, Holland, Pearl, Glymour, Scheines, Spirtes,...

15 Intervention calculus (a review) dynamic notion of an effect: if we set a variable X (j) to a value x (intervention) some other variables X (k) (k j) and maybe Y will change we want to quantify the total effect of X (j) on Y including all changed X (k) on Y a graph or influence diagram will be very useful X 1 X 2 Y X 4 X 3

16 for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993); Colombo et al. (2012)) much more complicated and not validated with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j) X (pa(j)) cond. independent of its non-descendant variables recursive factorization of joint distribution P(Y, X (1),..., X (p) ) = P(Y X (pa(y )) ) p P(X (j) X (pa(j)) ) j=1 for intervention calculus: use truncated factorization (e.g. Pearl)

17 for simplicity: just consider DAGs (Directed Acyclic Graphs) [with hidden variables (Spirtes, Glymour & Scheines (1993); Colombo et al. (2012)) much more complicated and not validated with real data] random variables are represented as nodes in the DAG assume a Markov condition, saying that X (j) X (pa(j)) cond. independent of its non-descendant variables recursive factorization of joint distribution P(Y, X (1),..., X (p) ) = P(Y X (pa(y )) ) p P(X (j) X (pa(j)) ) j=1 for intervention calculus: use truncated factorization (e.g. Pearl)

18 assume Markov property for causal DAG: non-intervention intervention do(x (2) = x) X (1) X (1) X (2) Y X (2) = x Y X (4) X (3) X (4) X (3) P(Y, X (1), X (2), X (3), X (4) ) = P(Y X (1), X (3) ) P(X (1) X (2) ) P(X (2) X (3), X (4) ) P(X (3) ) P(X (4) ) P(Y, X (1), X (3), X (4) do(x (2) = x)) = P(Y X (1), X (3) ) P(X (1) X (2) = x) P(X (3) ) P(X (4) )

19 truncated factorization for do(x (2) = x): P(Y, X (1), X (3), X (4) do(x (2) = x)) = P(Y X (1), X (3) )P(X (1) X (2) = x)p(x (3) )P(X (4) ) = P(Y do(x (2) = x)) P(Y, X (1), X (3), X (4) do(x (2) = x))dx (1) dx (3) dx (4)

20 the truncated factorization is a mathematical consequence of the Markov condition (with respect to the causal DAG) for the observational probability distribution P (plus assumption that structural eqns. are autonomous )

21 the intervention distribution P(Y do(x (2) = x)) can be calculated from observational data distribution need to estimate conditional distributions an influence diagram (causal DAG) need to estimate structure of a graph/influence diagram intervention effect: E[Y do(x (2) = x)] = yp(y do(x (2) = x))dy intervention effect at x 0 : x E[Y do(x (2) = x)] x=x0 in the Gaussian case: Y, X (1),..., X (p) N p+1 (µ, Σ), x E[Y do(x (2) = x)] θ 2 for all x

22 The backdoor criterion (Pearl, 1993) we only need to know the local parental set: for Z = X pa(x), if Y / pa(x) : P(Y do(x = x)) = P(Y X = x, Z )dp(z ) parental set might not be the minimal set but always suffices this is a consequence of the global Markov property: Z X A independent of X B X S : A and B are d-separated by S

23 Gaussian case for Gaussian case: x E[Y do(x (j) = x)] θ j for all x for Y / pa(j): θ j is the regression parameter in Y = θ j X (j) + θ k X (k) + error k pa(j) only need parental set and regression j = 2, pa(j) = {3, 4} X (1) X (2) Y X (4) X (3)

24 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

25 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

26 when having no unmeasured confounder (variable): intervention effect (as defined) = causal effect recap: causal effect = effect from a randomized trial (but we want to infer it without a randomized study... because often we cannot do it, or it is too expensive) structural equation models provide a different (but closely related) route for quantifying intervention effects

27 Inferring intervention effects from observational distribution main problem: inferring DAG from observational data impossible! can only infer equivalence class of DAGs (several DAGs can encode exactly the same conditional independence relationships) Example: X Y X Y X causes Y Y causes X a lot of work about identifiability: Verma & Pearl (1991); Spirtes, Glymour & Scheines (1993); Tian & Pearl ( ); Lauritzen & Richardson (2002); Shpitser & Pearl ( ); vanderweele & Robins ( ); Drton, Foygel & Sullivant (2011);...

28 Markov equivalence class of DAGs P a family of probability distributions on R p+1 Definition: M(D) = {P P; P is Markov w.r.t. DAG D} D D M(D) = M(D ) the definition depends on the family P: typical models: P = Gaussian distributions P = nonparametric model/set of all distributions P = set of distributions from additive structural equation model (see later)

29 A graphical characterization (Frydenberg, 1990; Verma & Pearl, 1991) for P = {Gaussian distributions} or P = {nonparametric distributions} it holds: D D D and D have the same skeleton and the same v-structures

30 for a DAG D, we write its corresponding Markov equivalence class E(D) = {D ; D D}

31 Equivalence class of DAGs Several DAGs can encode exactly the same conditional independence relationships. Such DAGs form an equivalence class. Example: unshielded triple X 1 X 3 X 1 X 3 X 2 X 1 X 2 X 3 false true X 1 X 2 X 3 false true X 1 X 2 X 3 false true X 1 X 2 X 3 true false no v-structure v-structure All DAGs in an equivalence class have the same skeleton and the same v-structures An equivalence class can be uniquely represented by a completed partially directed acyclic graph (CPDAG) CPDAG DAG 1 DAG 2 DAG 3 DAG 4 22

32 we cannot estimate causal/intervention effects from observational distribution but we will be able to estimate lower bounds of causal effects conceptual procedure : probability distribution P from a DAG, generating the data true underlying equivalence class of DAGs (CPDAG) find all DAG-members of true equivalence class (CPDAG): D 1,..., D m for every DAG-member D r, and every variable X (j) : single intervention effect θ r,j summarize them by Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} }{{} identifiable parameter

33 we cannot estimate causal/intervention effects from observational distribution but we will be able to estimate lower bounds of causal effects conceptual procedure : probability distribution P from a DAG, generating the data true underlying equivalence class of DAGs (CPDAG) find all DAG-members of true equivalence class (CPDAG): D 1,..., D m for every DAG-member D r, and every variable X (j) : single intervention effect θ r,j summarize them by Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} }{{} identifiable parameter

34 IDA (oracle version) PC-algorithm do-calculus DAG 1 effect 1 DAG 2 effect 2 oracle CPDAG.. multi-set Θ.. DAG m effect m 17

35 If you want a single number for every variable... instead of the multi-set Θ = {θ r,j ; r = 1,..., m; j = 1,..., p} minimal absolute value α j = min θ r,j (j = 1,..., p), r θ true,j α j minimal absolute effect α j is a lower bound for true absolute intervention effect

36 Computationally tractable algorithm searching all DAGs is computationally infeasible if p is large (we actually can do this up to p 15 20) instead of finding all m DAGs within an equivalence class compute all intervention effects without finding all DAGs (Maathuis, Kalisch & PB, 2009) key idea: exploring local aspects of the graph is sufficient

37 PC-algorithm do-calculus effect 1 effect 2 data CPDAG. multi-set Θ L. effect q 33 the local Θ L = Θ up to multiplicities (Maathuis, Kalisch & PB, 2009)

38 Estimation from finite samples notation: drop the Y -notation (Y = X (1), X (2),..., X (p) ) difficult part: estimation of CPDAG (equivalence class of DAGs) pcalgo(dm = d, alpha = 0.05) structural learning 3 P CPDAG }{{} equiv. class of DAGs two main approaches: multiple testing of conditional dependencies: PC-algorithm as prime example score-based methods: MLE as prime example 10

39 Estimation from finite samples notation: drop the Y -notation (Y = X (1), X (2),..., X (p) ) difficult part: estimation of CPDAG (equivalence class of DAGs) pcalgo(dm = d, alpha = 0.05) structural learning 3 P CPDAG }{{} equiv. class of DAGs two main approaches: multiple testing of conditional dependencies: PC-algorithm as prime example score-based methods: MLE as prime example 10

40 Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here all conditional independencies) assuming faithfulness: can infer the CPDAG from a list of conditional (in-)dependence relations

41 Faithfulness assumption (necessary for conditional dependence testing approaches) a distribution P is called faithful to a DAG G if all conditional independencies can be inferred from the graph (can infer some conditional independencies from a Markov assumption; but we require here all conditional independencies) assuming faithfulness: can infer the CPDAG from a list of conditional (in-)dependence relations

42 What does it mean? X (1) ε (1), X (2) αx (1) + ε (2), X (3) βx (1) + γx (2) + ε (3), ε (1), ε (2), ε (3) i.i.d. N (0, 1) enforce marginal independence of X (1) and X (3) β + αγ = 0, e.g. α = β = 1, γ = 1 Σ = , Σ 1 = failure of faithfulness due to cancellation of coefficients.

43 failure of exact faithfulness is rare (Lebesgue measure zero) but for statistical estimation (in the Gaussian case): often require strong faithfulness (Robins, Scheines, Spirtes & Wasserman, 2003): { } min ρ(i, j S) ; ρ(i, j S) 0, i j, S d τ, τ d log(p)/n (d is the maximal degree of the skeleton of the DAG)

44 ... strong faithfulness can be rather severe (Uhler, Raskutti, PB & Yu, 2013) 3 nodes, full graph imagine a strip around it large volume! strong faithfulness is restrictive in high dimensions unfaithful distributions due to exact cancellation

45 The PC-algorithm (Spirtes & Glymour, 1991) crucial assumption: distribution P is faithful to the true underlying DAG less crucial but convenient: Gaussian assumption for Y, X (1),..., X (p) can work with partial correlations input: ˆΣ MLE but we only need to consider many small sub-matrices of it (assuming sparsity of the graph) output: based on a clever data-dependent (random) sequence of multiple tests estimated CPDAG

46 PC-algorithm: a rough outline for estimating the skeleton of underlying DAG 1. start with full graph 2. remove edge i j if Ĉor(X (i), X (j) ) is small (Fisher s Z-transform and null-distribution of zero correlation) 3. partial correlations of order 1: remove edge i j if Parcor(X (i), X (j) X (k) ) is small for some k in the current neighborhood of i or j (thanks to faithfulness) full graph partial correlation order 1 correlation screening stopped

47 4. move-up to partial correlations of order 2: remove edge i j if partial correlation Parcor(X (i), X (j) X (k), X (l) ) is small for some k, l in the current neighborhood of i or j (thanks to faithfulness) 5. until removal of edges is not possible anymore, i.e. stop at minimal order of partial correlation where edge-removal becomes impossible full graph partial correlation order 1 correlation screening stopped additional step of the algorithm needed for estimating directions yields an estimate of the CPDAG (equivalence class of DAGs)

48 one tuning parameter (cut-off parameter) α for truncation of estimated Z -transformed partial correlations if the graph is sparse (few neighbors) few iterations only and only low-order partial correlations play a role and thus: the estimation algorithm works for p n problems

49 Computational complexity crudely bounded to be polynomial in p sparser underlying structure faster algorithm we can easily do the computations for sparse cases with p hrs CPU time log10( Processor Time [s] ) E[N]=2 E[N]= log10(p)

50 IDA (Intervention calculus when DAG is Absent) (Maathuis, Colombo, Kalisch & PB, 2010) 1. PC-algorithm CPDAG 2. local algorithm ˆΘ local 3. lower bounds for absolute causal effects ˆα j R-package: pcalg this is what we used in the yeast example to score the importance of the genes according to size of ˆα j

51 Statistical theory (Kalisch & PB, 2007; Maathuis, Kalisch & PB, 2009) n i.i.d. observational data points; p variables high-dimensional setting where p n assumptions: X (1),..., X (p) N p (0, Σ) Markov and faithful to true DAG high-dimensionality: log(p) n sparsity: maximal degree d = max j ne(j) n coherence : maximal (partial) correlations C < 1 max{ ρ i,j S ; i j, S d} C < 1 signal strength/strong faithfulness: min{ ρ i,j S ; ρ i,j S 0, i j, S d} d log(p)/n Then, for some suitable tuning param. and 0 < δ < 1: P[ CPDAG = true CPDAG] = 1 O(exp( Cn 1 δ )) P[ ˆΘ L as set = Θ] = 1 O(exp( Cn 1 δ )) (i.e. consistency of lower bounds for causal effects)

52 Main strategy of a proof we have to analyze an algorithm... understand the population version: in particular: can show that algorithm stops in d or d 1 steps analysis of the noisy version: additional errors use a union bound argument to control the overall error (seems very rough but leads to asymptotically optimal regime, e.g., for signal strength)

53 The role of sparsity in causal inference as usual: sparsity is necessary for accurate estimation in presence of noise but here: sparsity (so-called protectedness) is crucial for identifiability as well X Y X Y X causes Y Y causes X cannot tell from observational data the direction of the arrow the same situation arises with a full graph with more than 2 nodes causal identification really needs sparsity the better the sparsity the tighter the bounds for causal effects

54 Maximum likelihood estimation R.A. Fisher Gaussian DAG is Gaussian linear structural equation model: X (1) ε (1) X (2) β 21 X (1) + ε (2) X (3) β 31 X (1) + β 32 X (2) + ε (3) in general: p X (j) β jk X (k) + ε (j) (j = 1,..., p), β jk 0 edge k j k=1 X = BX + ε, ε N p (0, diag(σ1 2,..., σ2 p)) in matrix notation reparameterization

55 ˆΣ, ˆD = argminσ;d a DAG l(σ, D; data) + λ D = argmin B; {σ 2 j ;j} l(b, {σ2 j ; j}; data) + λ B }{{ 0 } ij I(B ij 0) under the non-convex constraint that B corresponds to no directed cycles severe non-convex problem due to the no directed cycle constraint ( 0 -penalty rather than e.g. 1 doesn t make the problem much harder)

56 toy-example X (1) β 1 X (2) + ε 1 X (2) β 2 X (1) + ε 2 beta2 (0,0) X1 X2 beta1 non-convex parameter space! (no straightforward way to do convex relaxation)

57 computation: dynamic programming algorithm (up to p 20) (Silander and Myllymäki, 2006) greedy algorithms on equivalence classes (Chickering, 2002; Hauser & PB, 2012) statistical properties of penalized MLE (for large p): (van de Geer & PB, 2013) no faithfulness assumption required to obtain the minimal edges I-MAP!

58 Successes in biology Effects of single gene knock-downs on all other genes in yeast (Maathuis, Colombo, Kalisch & PB, 2010) 1, IDA Lasso Elastic net Random n = 63 observational data True positives ,000 2,000 3,000 4,000 False positives

59 Arabidopsis thaliana (Stekhoven, Moraes, Sveinbjörnsson, Hennig, Maathuis & PB, 2012) response Y : days to bolting (flowering) of the plant (aim: fast flowering plants) covariates X: gene-expression profile observational data with n = 47 and p = lower bound estimates ˆα j for causal effect of every gene/variable on Y (using the PC-algorithm) apply stability selection (Meinshausen & PB, 2010) assigning uncertainties via control of PCER = E[V ]/p 1 q 2 2π thr 1 (per comparison error rate) p 2

60 Causal gene ranking summary median error Gene rank effect expression (PCER) name 1 AT2G AGL20 (SOC1) 2 AT4G ATCSLG1 3 AT1G PDR12 4 AT3G replication protein-related 5 AT5G ATSUC6 6 AT4G FRI 7 AT1G ATCSLA10 8 AT1G AtGH9B5 9 AT3G protein coding 10 AT1G protein coding 11 AT2G protein coding 12 AT2G protein coding 13 AT2G AVAP5 14 AT3G CYP72A9 15 AT1G protein coding 16 AT5G CHR4 17 AT3G DWF4 18 AT5G FLC 19 AT1G peroxidase, putative 20 AT1G unknown protein biological validation by gene knockout experiments in progress. red: biologically known genes responsible for flowering

61 performed validation experiment with mutants corresponding to these top 20-3 = 17 genes 14 mutants easily available only test for 14 genes more than usual: mutants showed low germination or survival... 9 among the 14 mutants survived (sufficiently strongly), i.e. 9 mutants for which we have an outcome 3 among the 9 mutants (genes) showed a significant effect for Y relative to the wildtype (non-mutated plant) that is: besides the three known genes, we find three additional genes which exhibit a significant difference in terms of time to flowering

62 in short: bounds on causal effects (ˆα j s) based on observational data lead to interesting predictions for interventions in genomics (i.e. which genes would exhibit a large intervention effect) and these predictions have been validated using experiments

63 Fully identifiable cases recap: if P = {Gaussian distributions}, then: cardinality of the Markov-equivalence class of a DAG D E(D) often > 1 same is true for P = {nonparametric distributions} but under some additional constraints, the Markov equivalence class becomes smaller or even consisting of one DAG only (i.e., identifiable)

64 structural equation model (SEM) X j f j (X pa(j), ε j ) (j = 1,..., p) e.g. X j f jk (X k ) + ε j (j = 1,..., p) k pa(j)

65 three types of identifiable SEMs where true DAG is identifiable from observational distribution: LiNGAM (Shimizu et al., 2006) X j k pa(j) with ε j s non-gaussian B jk X k + ε j (j = 1,..., p), as with independent component analysis (ICA) identifying Gaussian components is the hardest linear Gauss with equal error variances (Peters & PB, 2014) X j B jk X k + ε j (j = 1,..., p), k pa(j) Var(ε j ) σ 2 for all j

66 I nonlinear SEM with additive noise (Scholkopf and co-workers, : Hoyer et al., 2009; Peters et al., 2013) Xj fj (Xpa(j) ) + εj (j = 1,..., p) toy example: p = 2 with DAG X (1) = X Y = X (2) errors epsilon y truth: Y = +ε ε indep. of X 0 X Y = X^3 + epsilon y eta x errors 0 x X= Y^{1/3} + eta 5 x reverse mod.: X = Y 1/3 + η η not indep. of Y y

67 strategy to infer structure and model parameters (for identifiable models): order search (order among the variables) & subsequent (sparse) regression

68 order search: MLE for best permutation/order candidate order/permutation π (of {1,..., p}) regressions X π(j) versus X π(j 1), X π(j 2),..., X π(1) (when sparse: only some of the regressors are selected) evaluation score for π: likelihood of all the regressions X π(j) vs. X π(j 1), X π(j 2),..., X π(1) p j=1 p(x π(j) X π(j 1),..., X π(1) ) need a model for the regressions: functional form of regressions (e.g. additive, linear) distribution of error terms (e.g. Gaussian, nonparametric) can compute the likelihood with estimated parameters and search over the orderings/permutations

69 search over the orderings/permutations: there are p! permutations... trick: preliminary neighborhood selection aiming for an undirected graph containing the skeleton of the DAG (i.e. the Markov blanket or a superset thereof) superset skeleton 5 4 and do order search and estimation restricted to superset-skeleton much reduced search space can use unpenalized MLE (for best order/permutation restricted to superset-skeleton) that is: regularization in neighborhood selection suffices!

70 CAM: Causal Additive Model (PB, Peters & Ernest, 2013) X j f jk (X k ) + ε j (j = 1,..., n) k pa(j) ε j N (0, σ 2 j ) underlying DAG is identifiable from joint distribution statistically feasible for estimation due to additive functions good empirical performance log-likelihood equals (up to constant term): p log(ˆσ j π ), (ˆσ j π ) 2 = n 1 j=1 n i=1(x i,π(j) k<j ˆfj,π(k) (X i,k ))

71 fitting method 1. preliminary neighborhood selection (for superset of skeleton of DAG or Markov blanket): nodewise sparse additive regression of one variable X j against all others {X k ; k j} easy (using CV), works well and is established (Ravikumar et al. (2009), Meier, van de Geer & PB (2009),...) 2. estimate a best order by unpenalized MLE restricted to the superset-skeleton 3. based on the estimated order ˆπ: additive sparse model fitting restricted to the superset-skeleton edge pruning step 1: 5 step 2: 5 step 3: 5 for step 2: greedy search restricted to superset-skeleton 4

72 for step 2: greedy search restricted to superset-skeleton exhaustive computation is possible for trees (Bürge, MSC thesis 2014) greedy methods are shown to find the optimum for restricted classes of DAGs (Peters, Balakrishnan and Wainwright, in progress) unpenalized MLE for best order search: very convenient regularization is decoupled and only used in sparse additive regression

73 statistical consistency: high-dimensional p n setting (PB, Peters & Ernest, 2013) main assumptions: maximal degree of DAG is O(1) ( sparse ) sufficiently large identifiability constant (w.r.t. Kullback- Leibler divergence) ( between true and wrong ordering: ) ξ p := p 1 min π / Π 0 E θ 0[ log(p π θ (X))] E 0 θ 0[ log(p θ 0(X))] ξ p log(p)/n exponential moments for X j s and certain smooth functions thereof then: P[ˆπ }{{} Π 0 ] 1 (n ) set of true permutations consistent recovery of true functions f 0 jk and DAG D0 consistent estimation of E[Y do(x = x)]

74 remark: preliminary neighborhood selection yields empirical and computational improvements seems to improve the theoretical results e.g. in linear Gaussian case less stringent conditions than l 0 -regularized MLE (van de Geer & PB, 2013)

75 Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD to true DAG SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013) p = 100 CAM RESIT LiNGAM PC CPC GES SID to true DAG p = 100 CAM RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) RESIT (Mooij et al. 2009) cannot be used for p = 100 CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

76 Empirical results to illustrate what can be achieved with CAM p = 100, n = 200 true model is CAM (additive SEM) with Gaussian error SHD to true DAG SHD: Structural Hamming Distance SID: Structural Intervention Distance (Peters & PB, 2013) p = 100 CAM RESIT LiNGAM PC CPC GES SID to true DAG p = 100 CAM RESIT LiNGAM PC (lower) PC (upper) CPC (lower) CPC (upper) GES (lower) GES (upper) RESIT (Mooij et al. 2009) cannot be used for p = 100 CAM method is impressive where true functions are non-monotone and nonlinear (sampled from Gaussian proc.); for monotone functions: still good but less impressive gains

77 Gene expressions from isoprenoid pathways in Arabidopsis Thaliana (Wille et al., 2004) p = 39, n = 118 top 20 edges from CAM Chloroplast (MEP pathway) Cytoplasm (MVA pathway) stability selection Chloroplast (MEP pathway) Cytoplasm (MVA pathway) DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXPS1 DXPS2 DXPS3 AACT1 AACT2 DXR HMGS DXR HMGS MCT HMGR1 HMGR2 MCT HMGR1 HMGR2 CMK MK CMK MK MECPS MPDC1 MPDC2 MECPS MPDC1 MPDC2 HDS HDS IPPI2 IPPI2 HDR HDR IPPI1 Mitochondrion FPPS1 FPPS2 IPPI1 Mitochondrion FPPS1 FPPS2 UPPS1 UPPS1 GPPS DPPS2 GPPS DPPS2 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 GGPPS1,5,9 DPPS1,3 GGPPS3,4 solid edges: estimated from data stability selection: expected no. of false positives 2 (Meinshausen & PB, 2010)

78 When knowing the order of the variables a new approach... Theorem (Ernest & PB, 2014) For a general nonlinear SEM with functions f j L 2 (P) and regularity condition on the joint density/distribution: that is for all I {1,..., p}: p(x) Mp(x I )p(x I c ) for some 0 < M 1 E[Y do(x = x)] = g(x), g(x) = best additive L 2 -approx. of Y versus X, {X k ; k S(X)} e.g. S(X) = {k; k < j X }, j X = index corresponding to X E[Y do(x = x)] = g(x), g(x) = best additive approx. of Y versus X, {X j ; j S(X)}, g app = argmin fj E[(Y f jx (X) f k (X k )) 2 ], g app j X ( ) = g( ) k S(X)

79 implication: only need to run additive model fitting even for an arbitrary nonlinear SEM e.g. even if f j (X pa(j), ε j ) is very complicated...! call it ord-additive modeling/fitting very robust against model-misspecification! misspecification could be dependent ε j s, i.e. due to hidden variables if we were to consider E[Y do(x 1 = x 1, X 2 = x 2 )] would need to fit first-order interaction model

80 implication: only need to run additive model fitting even for an arbitrary nonlinear SEM e.g. even if f j (X pa(j), ε j ) is very complicated...! call it ord-additive modeling/fitting very robust against model-misspecification! misspecification could be dependent ε j s, i.e. due to hidden variables if we were to consider E[Y do(x 1 = x 1, X 2 = x 2 )] would need to fit first-order interaction model

81 Swiss Air Ltd. ticket pricing and revenue (Ghielmetti et al., in progress) do not need to worry about complicated non-linearities!

82 estimation in practice additive model fitting of Y versus X, {X k ; k S(X)} Hastie & Tibshirani ( 1990); Wood (2006);... high-dimensional scenario when S(X) is large e.g. S(X) and sample size n 50 consistency and optimal convergence rate results for sparse additive model fitting if problem is sparse: maximal degree of true DAG is small identifiability aka restricted eigenvalue conditions Ravikumar et al. (2009); Meier, van de Geer & PB (2009) additive model fitting is well-established and works well including high-dimensional setting

83 estimation in practice additive model fitting of Y versus X, {X k ; k S(X)} Hastie & Tibshirani ( 1990); Wood (2006);... high-dimensional scenario when S(X) is large e.g. S(X) and sample size n 50 consistency and optimal convergence rate results for sparse additive model fitting if problem is sparse: maximal degree of true DAG is small identifiability aka restricted eigenvalue conditions Ravikumar et al. (2009); Meier, van de Geer & PB (2009) additive model fitting is well-established and works well including high-dimensional setting

84 MEP pathway in Arabidopsis Thaliana (Wille et al, 2004) p = 14 expressions of genes, n = 118 order of variables: biological information w.r.t. up-/downstream Chloroplast (MEP pathway) DXPS1 DXPS2 DXPS3 GPPS DXR MCT CMK MECPS HDS HDR IPPI1 rank directed gene pairs (X causes Y ) with (Ê[Y do(x = x) Ê[Y ]) 2 dx top 10 scoring directed gene pairs and check their stability stability selection: E[false positives] 1 (Meinshausen & PB, 2010) GGPPS 2,6,8,10,11,12 PPDS1 PPDS2 do not need to worry about complicated nonlinearities!

85 ord-additive regression: a local operation inferring E[Y do(x = x)] when DAG (or order) is known: ord-additive regression Y versus X, pa(x): local integration of all directed paths from X to Y : global squared error Entire Path Partial Path Parent Adjustment squared error Entire Path Partial Path Parent Adjustment (100%) 16 (96%) 32 (92%) 64 (84%) 128 (68%) 256 (34%) SHD to true DAG (percentage of correct edges) nonsparse DAG 0 0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%) SHD to true DAG (percentage of correct edges) sparse DAG ord-additive regression is much more reliable and robust when having imperfect knowledge of the DAG or the order of variables and computationally much faster

86 Conclusions 1. Beware of over-interpretation! so far, based on current data: we can not reliably infer a causal network despite theorems... (perturbation of the data yields unstable networks) 2. Causal inference relies on subtle uncheckable(!) assumptions 1 experimental validations are important (simple organisms in biology are great for pursuing this!) statistical (and other) inference is often not confirmatory 3. many technical issues in identifiability, high-dimensional statistical inference and optimization

87 4. but there is potential: for stable ranking/prediction of intervention/causal effects... causal inference from purely observed data could have practical value in the prioritization and design of perturbation experiments Editorial in Nature Methods (April 2010) this can be very useful in computational biology and in this sense: causal inference from observational data is much further developed than 30 years ago when it was thought to be impossible

88 Thank you! R-package: pcalg (Kalisch, Mächler, Colombo, Maathuis & PB, 2012) References: Ernest, J. and Bühlmann, P. (2014). On the role of additive regression for (high-dimensional) causal inference. Preprint arxiv: Bühlmann, P., Peters, J. and Ernest, J. (2013). CAM: Causal Additive Models, high-dimensional order search and penalized regression. Preprint arxiv: Peters, J. and Bühlmann, P. (2014). Identifiability of Gaussian structural equation models with equal error variances. Biometrika 101, Uhler, C., Raskutti, G., Bühlmann, P. and Yu, B. (2013). Geometry of faithfulness assumption in causal inference. Annals of Statistics 41, van de Geer, S. and Bühlmann, P. (2013). l0 -penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics 41, Kalisch, M., Mächler, M., Colombo, D., Maathuis, M.H. and Bühlmann, P. (2012). Causal inference using graphical models with the R package pcalg. Journal of Statistical Software 47 (11), Stekhoven, D.J., Moraes, I., Sveinbjörnsson, G., Hennig, L., Maathuis, M.H. and Bühlmann, P. (2011). Causal stability ranking. Bioinformatics 28, Hauser, A. and Bühlmann, P. (2012). Characterization and greedy learning of interventional Markov equivalence classes of directed acyclic graphs. Journal of Machine Learning Research 13, Maathuis, M.H., Colombo, D., Kalisch, M. and Bühlmann, P. (2010). Predicting causal effects in large-scale systems from observational data. Nature Methods 7, Maathuis, M.H., Kalisch, M. and Bühlmann, P. (2009). Estimating high-dimensional intervention effects from observational data. Annals of Statistics 37,

Simplicity of Additive Noise Models

Simplicity of Additive Noise Models Simplicity of Additive Noise Models Jonas Peters ETH Zürich - Marie Curie (IEF) Workshop on Simplicity and Causal Discovery Carnegie Mellon University 7th June 2014 contains joint work with... ETH Zürich:

More information

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data

Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Causal inference (with statistical uncertainty) based on invariance: exploiting the power of heterogeneous data Peter Bühlmann joint work with Jonas Peters Nicolai Meinshausen ... and designing new perturbation

More information

Causal Structure Learning and Inference: A Selective Review

Causal Structure Learning and Inference: A Selective Review Vol. 11, No. 1, pp. 3-21, 2014 ICAQM 2014 Causal Structure Learning and Inference: A Selective Review Markus Kalisch * and Peter Bühlmann Seminar for Statistics, ETH Zürich, CH-8092 Zürich, Switzerland

More information

Predicting causal effects in large-scale systems from observational data

Predicting causal effects in large-scale systems from observational data nature methods Predicting causal effects in large-scale systems from observational data Marloes H Maathuis 1, Diego Colombo 1, Markus Kalisch 1 & Peter Bühlmann 1,2 Supplementary figures and text: Supplementary

More information

arxiv: v3 [stat.me] 10 Mar 2016

arxiv: v3 [stat.me] 10 Mar 2016 Submitted to the Annals of Statistics ESTIMATING THE EFFECT OF JOINT INTERVENTIONS FROM OBSERVATIONAL DATA IN SPARSE HIGH-DIMENSIONAL SETTINGS arxiv:1407.2451v3 [stat.me] 10 Mar 2016 By Preetam Nandy,,

More information

Interpreting and using CPDAGs with background knowledge

Interpreting and using CPDAGs with background knowledge Interpreting and using CPDAGs with background knowledge Emilija Perković Seminar for Statistics ETH Zurich, Switzerland perkovic@stat.math.ethz.ch Markus Kalisch Seminar for Statistics ETH Zurich, Switzerland

More information

arxiv: v6 [math.st] 3 Feb 2018

arxiv: v6 [math.st] 3 Feb 2018 Submitted to the Annals of Statistics HIGH-DIMENSIONAL CONSISTENCY IN SCORE-BASED AND HYBRID STRUCTURE LEARNING arxiv:1507.02608v6 [math.st] 3 Feb 2018 By Preetam Nandy,, Alain Hauser and Marloes H. Maathuis,

More information

arxiv: v1 [math.st] 13 Mar 2013

arxiv: v1 [math.st] 13 Mar 2013 Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs arxiv:1303.316v1 [math.st] 13 Mar 013 Alain Hauser and Peter Bühlmann {hauser,

More information

Using background knowledge for the estimation of total causal e ects

Using background knowledge for the estimation of total causal e ects Using background knowledge for the estimation of total causal e ects Interpreting and using CPDAGs with background knowledge Emilija Perkovi, ETH Zurich Joint work with Markus Kalisch and Marloes Maathuis

More information

Marginal integration for nonparametric causal inference

Marginal integration for nonparametric causal inference Electronic Journal of Statistics Vol. 9 (2015) 3155 3194 ISSN: 1935-7524 DOI: 10.1214/15-EJS1075 Marginal integration for nonparametric causal inference Jan Ernest and Peter Bühlmann Seminar für Statistik

More information

Causal Models with Hidden Variables

Causal Models with Hidden Variables Causal Models with Hidden Variables Robin J. Evans www.stats.ox.ac.uk/ evans Department of Statistics, University of Oxford Quantum Networks, Oxford August 2017 1 / 44 Correlation does not imply causation

More information

High-dimensional learning of linear causal networks via inverse covariance estimation

High-dimensional learning of linear causal networks via inverse covariance estimation High-dimensional learning of linear causal networks via inverse covariance estimation Po-Ling Loh Department of Statistics, University of California, Berkeley, CA 94720, USA Peter Bühlmann Seminar für

More information

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges 1 PRELIMINARIES Two vertices X i and X j are adjacent if there is an edge between them. A path

More information

COMP538: Introduction to Bayesian Networks

COMP538: Introduction to Bayesian Networks COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology

More information

Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs

Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs J. R. Statist. Soc. B (2015) Jointly interventional and observational data: estimation of interventional Markov equivalence classes of directed acyclic graphs Alain Hauser, University of Bern, and Swiss

More information

Recovering the Graph Structure of Restricted Structural Equation Models

Recovering the Graph Structure of Restricted Structural Equation Models Recovering the Graph Structure of Restricted Structural Equation Models Workshop on Statistics for Complex Networks, Eindhoven Jonas Peters 1 J. Mooij 3, D. Janzing 2, B. Schölkopf 2, P. Bühlmann 1 1 Seminar

More information

A review of some recent advances in causal inference

A review of some recent advances in causal inference A review of some recent advances in causal inference Marloes H. Maathuis and Preetam Nandy arxiv:1506.07669v1 [stat.me] 25 Jun 2015 Contents 1 Introduction 1 1.1 Causal versus non-causal research questions.................

More information

Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example

Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example Type-II Errors of Independence Tests Can Lead to Arbitrarily Large Errors in Estimated Causal Effects: An Illustrative Example Nicholas Cornia & Joris M. Mooij Informatics Institute University of Amsterdam,

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Robustification of the PC-algorithm for Directed Acyclic Graphs

Robustification of the PC-algorithm for Directed Acyclic Graphs Robustification of the PC-algorithm for Directed Acyclic Graphs Markus Kalisch, Peter Bühlmann May 26, 2008 Abstract The PC-algorithm ([13]) was shown to be a powerful method for estimating the equivalence

More information

arxiv: v3 [stat.me] 3 Jun 2015

arxiv: v3 [stat.me] 3 Jun 2015 The Annals of Statistics 2015, Vol. 43, No. 3, 1060 1088 DOI: 10.1214/14-AOS1295 c Institute of Mathematical Statistics, 2015 A GENERALIZED BACK-DOOR CRITERION 1 arxiv:1307.5636v3 [stat.me] 3 Jun 2015

More information

Identifiability assumptions for directed graphical models with feedback

Identifiability assumptions for directed graphical models with feedback Biometrika, pp. 1 26 C 212 Biometrika Trust Printed in Great Britain Identifiability assumptions for directed graphical models with feedback BY GUNWOONG PARK Department of Statistics, University of Wisconsin-Madison,

More information

arxiv: v4 [math.st] 19 Jun 2018

arxiv: v4 [math.st] 19 Jun 2018 Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs Emilija Perković, Johannes Textor, Markus Kalisch and Marloes H. Maathuis arxiv:1606.06903v4

More information

Sparse inverse covariance estimation with the lasso

Sparse inverse covariance estimation with the lasso Sparse inverse covariance estimation with the lasso Jerome Friedman Trevor Hastie and Robert Tibshirani November 8, 2007 Abstract We consider the problem of estimating sparse graphs by a lasso penalty

More information

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Proceedings of Machine Learning Research vol 73:21-32, 2017 AMBN 2017 Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs Jose M. Peña Linköping University Linköping (Sweden) jose.m.pena@liu.se

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Causal Inference: Discussion

Causal Inference: Discussion Causal Inference: Discussion Mladen Kolar The University of Chicago Booth School of Business Sept 23, 2016 Types of machine learning problems Based on the information available: Supervised learning Reinforcement

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

Invariance and causality for robust predictions

Invariance and causality for robust predictions Invariance and causality for robust predictions Peter Bühlmann Seminar für Statistik, ETH Zürich joint work with Jonas Peters Univ. Copenhagen Nicolai Meinshausen ETH Zürich Dominik Rothenhäusler ETH Zürich

More information

Causality. Bernhard Schölkopf and Jonas Peters MPI for Intelligent Systems, Tübingen. MLSS, Tübingen 21st July 2015

Causality. Bernhard Schölkopf and Jonas Peters MPI for Intelligent Systems, Tübingen. MLSS, Tübingen 21st July 2015 Causality Bernhard Schölkopf and Jonas Peters MPI for Intelligent Systems, Tübingen MLSS, Tübingen 21st July 2015 Charig et al.: Comparison of treatment of renal calculi by open surgery, (...), British

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Towards an extension of the PC algorithm to local context-specific independencies detection

Towards an extension of the PC algorithm to local context-specific independencies detection Towards an extension of the PC algorithm to local context-specific independencies detection Feb-09-2016 Outline Background: Bayesian Networks The PC algorithm Context-specific independence: from DAGs to

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

ESTIMATING HIGH-DIMENSIONAL INTERVENTION EFFECTS FROM OBSERVATIONAL DATA

ESTIMATING HIGH-DIMENSIONAL INTERVENTION EFFECTS FROM OBSERVATIONAL DATA The Annals of Statistics 2009, Vol. 37, No. 6A, 3133 3164 DOI: 10.1214/09-AOS685 Institute of Mathematical Statistics, 2009 ESTIMATING HIGH-DIMENSIONAL INTERVENTION EFFECTS FROM OBSERVATIONAL DATA BY MARLOES

More information

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS) Journal of Machine Learning Research 18 2018 1-44 Submitted 4/17; Revised 12/17; Published 4/18 Learning Quadratic Variance Function QVF DAG Models via OverDispersion Scoring ODS Gunwoong Park Department

More information

Causal Inference for High-Dimensional Data. Atlantic Causal Conference

Causal Inference for High-Dimensional Data. Atlantic Causal Conference Causal Inference for High-Dimensional Data Atlantic Causal Conference Overview Conditional Independence Directed Acyclic Graph (DAG) Models factorization and d-separation Markov equivalence Structure Learning

More information

High-dimensional graphical model selection: Practical and information-theoretic limits

High-dimensional graphical model selection: Practical and information-theoretic limits 1 High-dimensional graphical model selection: Practical and information-theoretic limits Martin Wainwright Departments of Statistics, and EECS UC Berkeley, California, USA Based on joint work with: John

More information

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Learning With Bayesian Networks. Markus Kalisch ETH Zürich Learning With Bayesian Networks Markus Kalisch ETH Zürich Inference in BNs - Review P(Burglary JohnCalls=TRUE, MaryCalls=TRUE) Exact Inference: P(b j,m) = c Sum e Sum a P(b)P(e)P(a b,e)p(j a)p(m a) Deal

More information

GEOMETRY OF THE FAITHFULNESS ASSUMPTION IN CAUSAL INFERENCE 1

GEOMETRY OF THE FAITHFULNESS ASSUMPTION IN CAUSAL INFERENCE 1 The Annals of Statistics 2013, Vol. 41, No. 2, 436 463 DOI: 10.1214/12-AOS1080 Institute of Mathematical Statistics, 2013 GEOMETRY OF THE FAITHFULNESS ASSUMPTION IN CAUSAL INFERENCE 1 BY CAROLINE UHLER,

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

High dimensional sparse covariance estimation via directed acyclic graphs

High dimensional sparse covariance estimation via directed acyclic graphs Electronic Journal of Statistics Vol. 3 (009) 1133 1160 ISSN: 1935-754 DOI: 10.114/09-EJS534 High dimensional sparse covariance estimation via directed acyclic graphs Philipp Rütimann and Peter Bühlmann

More information

Identifiability of Gaussian structural equation models with equal error variances

Identifiability of Gaussian structural equation models with equal error variances Biometrika (2014), 101,1,pp. 219 228 doi: 10.1093/biomet/ast043 Printed in Great Britain Advance Access publication 8 November 2013 Identifiability of Gaussian structural equation models with equal error

More information

Graphical models. Christophe Giraud. Lecture Notes on High-Dimensional Statistics : Université Paris-Sud and Ecole Polytechnique Maths Department

Graphical models. Christophe Giraud. Lecture Notes on High-Dimensional Statistics : Université Paris-Sud and Ecole Polytechnique Maths Department Graphical Modeling Université Paris-Sud and Ecole Polytechnique Maths Department Lecture Notes on High-Dimensional Statistics : http://www.cmap.polytechnique.fr/ giraud/msv/lecturenotes.pdf 1/73 Please

More information

Estimation of linear non-gaussian acyclic models for latent factors

Estimation of linear non-gaussian acyclic models for latent factors Estimation of linear non-gaussian acyclic models for latent factors Shohei Shimizu a Patrik O. Hoyer b Aapo Hyvärinen b,c a The Institute of Scientific and Industrial Research, Osaka University Mihogaoka

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation

High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 10-014 High-Dimensional Learning of Linear Causal Networks via Inverse Covariance Estimation Po-Ling Loh University

More information

Directed and Undirected Graphical Models

Directed and Undirected Graphical Models Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

arxiv: v4 [stat.ml] 2 Dec 2017

arxiv: v4 [stat.ml] 2 Dec 2017 Distributional Equivalence and Structure Learning for Bow-free Acyclic Path Diagrams arxiv:1508.01717v4 [stat.ml] 2 Dec 2017 Christopher Nowzohour Seminar für Statistik, ETH Zürich nowzohour@stat.math.ethz.ch

More information

Parametrizations of Discrete Graphical Models

Parametrizations of Discrete Graphical Models Parametrizations of Discrete Graphical Models Robin J. Evans www.stat.washington.edu/ rje42 10th August 2011 1/34 Outline 1 Introduction Graphical Models Acyclic Directed Mixed Graphs Two Problems 2 Ingenuous

More information

This paper has been submitted for consideration for publication in Biometrics

This paper has been submitted for consideration for publication in Biometrics Biometrics 63, 1?? March 2014 DOI: 10.1111/j.1541-0420.2005.00454.x PenPC: A Two-step Approach to Estimate the Skeletons of High Dimensional Directed Acyclic Graphs Min Jin Ha Department of Biostatistics,

More information

Probabilistic Causal Models

Probabilistic Causal Models Probabilistic Causal Models A Short Introduction Robin J. Evans www.stat.washington.edu/ rje42 ACMS Seminar, University of Washington 24th February 2011 1/26 Acknowledgements This work is joint with Thomas

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

Arrowhead completeness from minimal conditional independencies

Arrowhead completeness from minimal conditional independencies Arrowhead completeness from minimal conditional independencies Tom Claassen, Tom Heskes Radboud University Nijmegen The Netherlands {tomc,tomh}@cs.ru.nl Abstract We present two inference rules, based on

More information

arxiv: v1 [stat.ml] 23 Oct 2016

arxiv: v1 [stat.ml] 23 Oct 2016 Formulas for counting the sizes of Markov Equivalence Classes Formulas for Counting the Sizes of Markov Equivalence Classes of Directed Acyclic Graphs arxiv:1610.07921v1 [stat.ml] 23 Oct 2016 Yangbo He

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

arxiv: v1 [stat.ml] 16 Jun 2018

arxiv: v1 [stat.ml] 16 Jun 2018 The Reduced PC-Algorithm: Improved Causal Structure Learning in Large Random Networks Arjun Sondhi Department of Biostatistics University of Washington asondhi@uw.edu Ali Shojaie Department of Biostatistics

More information

Extended Bayesian Information Criteria for Gaussian Graphical Models

Extended Bayesian Information Criteria for Gaussian Graphical Models Extended Bayesian Information Criteria for Gaussian Graphical Models Rina Foygel University of Chicago rina@uchicago.edu Mathias Drton University of Chicago drton@uchicago.edu Abstract Gaussian graphical

More information

Statistics for high-dimensional data: Group Lasso and additive models

Statistics for high-dimensional data: Group Lasso and additive models Statistics for high-dimensional data: Group Lasso and additive models Peter Bühlmann and Sara van de Geer Seminar für Statistik, ETH Zürich May 2012 The Group Lasso (Yuan & Lin, 2006) high-dimensional

More information

Estimating High-Dimensional Directed Acyclic Graphs With the PC-Algorithm

Estimating High-Dimensional Directed Acyclic Graphs With the PC-Algorithm Journal of Machine Learning Research? (????)?? Submitted 09/06; Published?? Estimating High-Dimensional Directed Acyclic Graphs With the PC-Algorithm Markus Kalisch Seminar für Statistik ETH Zürich 8092

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Challenges (& Some Solutions) and Making Connections

Challenges (& Some Solutions) and Making Connections Challenges (& Some Solutions) and Making Connections Real-life Search All search algorithm theorems have form: If the world behaves like, then (probability 1) the algorithm will recover the true structure

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

Copula PC Algorithm for Causal Discovery from Mixed Data

Copula PC Algorithm for Causal Discovery from Mixed Data Copula PC Algorithm for Causal Discovery from Mixed Data Ruifei Cui ( ), Perry Groot, and Tom Heskes Institute for Computing and Information Sciences, Radboud University, Nijmegen, The Netherlands {R.Cui,

More information

Bayesian Discovery of Linear Acyclic Causal Models

Bayesian Discovery of Linear Acyclic Causal Models Bayesian Discovery of Linear Acyclic Causal Models Patrik O. Hoyer Helsinki Institute for Information Technology & Department of Computer Science University of Helsinki Finland Antti Hyttinen Helsinki

More information

High dimensional Ising model selection

High dimensional Ising model selection High dimensional Ising model selection Pradeep Ravikumar UT Austin (based on work with John Lafferty, Martin Wainwright) Sparse Ising model US Senate 109th Congress Banerjee et al, 2008 Estimate a sparse

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008 Regularized Estimation of High Dimensional Covariance Matrices Peter Bickel Cambridge January, 2008 With Thanks to E. Levina (Joint collaboration, slides) I. M. Johnstone (Slides) Choongsoon Bae (Slides)

More information

Learning Gaussian Graphical Models with Unknown Group Sparsity

Learning Gaussian Graphical Models with Unknown Group Sparsity Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation

More information

Robust and sparse Gaussian graphical modelling under cell-wise contamination

Robust and sparse Gaussian graphical modelling under cell-wise contamination Robust and sparse Gaussian graphical modelling under cell-wise contamination Shota Katayama 1, Hironori Fujisawa 2 and Mathias Drton 3 1 Tokyo Institute of Technology, Japan 2 The Institute of Statistical

More information

Learning causal network structure from multiple (in)dependence models

Learning causal network structure from multiple (in)dependence models Learning causal network structure from multiple (in)dependence models Tom Claassen Radboud University, Nijmegen tomc@cs.ru.nl Abstract Tom Heskes Radboud University, Nijmegen tomh@cs.ru.nl We tackle the

More information

Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension

Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension Variable Selection in Multivariate Linear Regression Models with Fewer Observations than the Dimension (Last Modified: November 3 2008) Mariko Yamamura 1, Hirokazu Yanagihara 2 and Muni S. Srivastava 3

More information

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables N-1 Experiments Suffice to Determine the Causal Relations Among N Variables Frederick Eberhardt Clark Glymour 1 Richard Scheines Carnegie Mellon University Abstract By combining experimental interventions

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Graphical Models for Non-Negative Data Using Generalized Score Matching

Graphical Models for Non-Negative Data Using Generalized Score Matching Graphical Models for Non-Negative Data Using Generalized Score Matching Shiqing Yu Mathias Drton Ali Shojaie University of Washington University of Washington University of Washington Abstract A common

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Introduction. Basic Probability and Bayes Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 26/9/2011 (EPFL) Graphical Models 26/9/2011 1 / 28 Outline

More information

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation

Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Properties of optimizations used in penalized Gaussian likelihood inverse covariance matrix estimation Adam J. Rothman School of Statistics University of Minnesota October 8, 2014, joint work with Liliana

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

arxiv: v1 [cs.lg] 26 May 2017

arxiv: v1 [cs.lg] 26 May 2017 Learning Causal tructures Using Regression Invariance ariv:1705.09644v1 [cs.lg] 26 May 2017 AmirEmad Ghassami, aber alehkaleybar, Negar Kiyavash, Kun Zhang Department of ECE, University of Illinois at

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Causal Inference & Reasoning with Causal Bayesian Networks

Causal Inference & Reasoning with Causal Bayesian Networks Causal Inference & Reasoning with Causal Bayesian Networks Neyman-Rubin Framework Potential Outcome Framework: for each unit k and each treatment i, there is a potential outcome on an attribute U, U ik,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

Gaussian Graphical Models and Graphical Lasso

Gaussian Graphical Models and Graphical Lasso ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf

More information

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou, Philipp Rutimann, Min Xu and Peter Buhlmann February 3, 2012 Problem definition Want to estimate the covariance matrix

More information

Automatic Causal Discovery

Automatic Causal Discovery Automatic Causal Discovery Richard Scheines Peter Spirtes, Clark Glymour Dept. of Philosophy & CALD Carnegie Mellon 1 Outline 1. Motivation 2. Representation 3. Discovery 4. Using Regression for Causal

More information

Variable Selection for Highly Correlated Predictors

Variable Selection for Highly Correlated Predictors Variable Selection for Highly Correlated Predictors Fei Xue and Annie Qu Department of Statistics, University of Illinois at Urbana-Champaign WHOA-PSI, Aug, 2017 St. Louis, Missouri 1 / 30 Background Variable

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

arxiv: v1 [cs.ai] 16 Sep 2015

arxiv: v1 [cs.ai] 16 Sep 2015 Causal Model Analysis using Collider v-structure with Negative Percentage Mapping arxiv:1509.090v1 [cs.ai 16 Sep 015 Pramod Kumar Parida Electrical & Electronics Engineering Science University of Johannesburg

More information

GRAPHICAL MODELS FOR HIGH DIMENSIONAL GENOMIC DATA. Min Jin Ha

GRAPHICAL MODELS FOR HIGH DIMENSIONAL GENOMIC DATA. Min Jin Ha GRAPHICAL MODELS FOR HIGH DIMENSIONAL GENOMIC DATA Min Jin Ha A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the

More information

11 : Gaussian Graphic Models and Ising Models

11 : Gaussian Graphic Models and Ising Models 10-708: Probabilistic Graphical Models 10-708, Spring 2017 11 : Gaussian Graphic Models and Ising Models Lecturer: Bryon Aragam Scribes: Chao-Ming Yen 1 Introduction Different from previous maximum likelihood

More information

Uncertainty quantification in high-dimensional statistics

Uncertainty quantification in high-dimensional statistics Uncertainty quantification in high-dimensional statistics Peter Bühlmann ETH Zürich based on joint work with Sara van de Geer Nicolai Meinshausen Lukas Meier 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70

More information

High-dimensional Covariance Estimation Based On Gaussian Graphical Models

High-dimensional Covariance Estimation Based On Gaussian Graphical Models Journal of Machine Learning Research 12 (211) 2975-326 Submitted 9/1; Revised 6/11; Published 1/11 High-dimensional Covariance Estimation Based On Gaussian Graphical Models Shuheng Zhou Department of Statistics

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information