Comparing Bayesian Networks and Structure Learning Algorithms

Size: px

Start display at page:

Download "Comparing Bayesian Networks and Structure Learning Algorithms"

Aubrey Mitchell
5 years ago
Views:

1 Comparing Bayesian Networks and Structure Learning Algorithms (and other graphical models) Department of Statistical Sciences October 20, 2009

2 Introduction

3 Introduction Graphical models Graphical models are defined by the combination of: a network structure, either an undirected (Markov networks [2], gene association networks, correlation networks, etc.) or a directed graph (Bayesian networks [7]). Each node corresponds to a random variable. a global probability distribution which can be factorized into a set of local probability distributions (one for each node) according to the topology of the graph. This allows a compact representation of the joint distribution of large numbers of random variables and simplifies inference on their parameters.

4 Introduction A simple Bayesian network: Watson s lawn SPRINKLER RAIN RAIN FALSE TRUE SPRINKLER TRUE FALSE GRASS WET GRASS WET SPRINKLER RAIN TRUE FALSE RAIN TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE

5 Introduction The problem Almost all literature on graphical models focuses on the study of the parameters of the local probability distributions (such as conditional probabilities or partial linear correlations). this makes comparing models learned with different algorithms difficult, because they maximize different scores, use different estimators for the parameters, work under different sets of hypotheses, etc. unless the true global probability distribution is known it s difficult to assess the quality of the estimated models. the few measures of structural difference are completely descriptive in nature (i.e. Hamming distance [6] or SHD [10]), and have no easy interpretation.

6 Modeling undirected network structures

7 Modeling undirected network structures Edges and univariate Bernoulli random variables Each edge e i in an undirected graph U = (V, E) has only two possible states, e i = { 1 if e i E 0 otherwise. Therefore it can be modeled as a Bernoulli random variable E i : e i E i = { 1 e i E with probability p i 0 e i E with probability 1 p i where p i is the probability that the edge e i belongs to the graph. Let s denote it as e i Ber(p i ).

8 Modeling undirected network structures Edge sets as multivariate Bernoulli The natural extension of this approach is to model any set W of edges (such as E or {V V}) as a multivariate Bernoulli random variable W Ber k (p). It is uniquely identified by the parameter set p = {p w : w W, w }, which represents the dependence structure [8] among the marginal distributions W i Ber(p i ), i = 1,..., k of the edges.

9 Modeling undirected network structures Estimation of the parameters of W The parameter set p of W can be estimated via bootstrap [3] as in Friedman et al. [4] or Imoto et al. [5]: 1. For b = 1, 2,..., m 1.1 re-sample a new data set D b from the original data D using either parametric or nonparametric bootstrap. 1.2 learn a graphical model U b = (V, E b ) from D b. 2. Estimate the probability of each subset w of W as ˆp w = 1 m m I {w Eb }(U b ). b=1

10 Properties of the multivariate Bernoulli distribution

11 Properties of the multivariate Bernoulli distribution Moments The first two moments of a multivariate Bernoulli variable W = [W 1, W 2,..., W k ] are where P = [E(W 1 ),..., E(W k )] T Σ = [σ ij ] = [COV(W i, W j )] E(W i ) = p i COV(W i, W j ) = E(W i W j ) E(W i )E(W j ) = p ij p i p j VAR(W i ) = COV(W i, W i ) = p i p 2 i and can be estimated using ˆp i = 1 m I m {ei E b }(U b ) and ˆp ij = 1 m b=1 m I {ei E b,e j E b }(U b ). b=1

12 Properties of the multivariate Bernoulli distribution Uncorrelation and independence Theorem Let B i and B j be two Bernoulli random variables. Then B i and B j are independent if and only if their covariance is zero: B i B j COV(B i, B j ) = 0 Theorem Let B = [B 1, B 2,..., B k ] T and C = [C 1, C 2,..., C l ] T, k, l N be two multivariate Bernoulli random variables. Then B and C are independent if and only if where O is the zero matrix. B C COV(B, C) = O

13 Properties of the multivariate Bernoulli distribution Uncorrelation and independence (an example) Let B = [B 1 B 2 B 3 ] T = B 1 + B 2 ; then we have 0 COV(B 1, B 2 ) = E B 2 [ ] 0 B 1 0 B 3 E B 2 E ([ B 1 0 ]) B = E B 1 B 2 0 B 2 B 3 p 2 [ ] p 1 0 p = p 12 0 p 23 p 1 p 2 0 p 2 p 3 = = p 12 p 1 p 2 0 p 23 p 2 p 3 = O B 1 B

14 Properties of the multivariate Bernoulli distribution Constraints on the covariance matrix Σ The marginal variances of the edges are bounded, because [ p i [0, 1] = σ ii = p i p 2 i 0, 1 ]. 4 The maximum is attained for p i = 1 2, and the minimum for both p i = 0 and p i = 1. For the Cauchy-Schwartz theorem [1] then covariances are bounded too: 0 σij 2 σ ii σ jj 1 [ 16 = σ ij 0, 1 ]. 4 These result in similar bounds on the eigenvalues λ 1,..., λ k of Σ, 0 λ i k 4 and 0 k λ i k 4. i=1

15 Properties of the multivariate Bernoulli distribution Constraints on Σ: a graphical representation Σ 1 = 1 [ ] [ ] = Σ 2 = 1 [ ] [ ] = Σ 3 = 1 [ ] [ ] =

16 Measures of Structure Variability

17 Measures of Structure Variability Entropy of the bootstrapped models Let s consider the graphical models U 1,..., U m learned from the bootstrap samples. minimum entropy: all the models learned from the bootstrap samples have the same structure. In this case: { 1 if ei E p i = and Σ = O. 0 otherwise intermediate entropy: several models are observed with different frequencies m b, m b = m, so ˆp i = 1 m b and ˆp ij = 1 m b. m m b : e i E b b : e i E b,e j E b maximum entropy: all possible models appear with the same frequency, which results in p i = 1 2 and Σ = 1 4 I k.

18 Measures of Structure Variability Entropy of the bootstrapped models minimum entropy maximum entropy

19 Measures of Structure Variability Univariate measures of variability the generalized variance k VAR G (Σ) = det(σ) = λ i [0, 14 ] k i=1 the total variance VAR T (Σ) = tr (Σ) = k λ i i=1 [ 0, k ] 4 the squared Frobenius matrix norm VAR N (Σ) = Σ k 4 I k 2 F = k i=1 ( λ i k ) 2 [ ] k(k 1) 2, k

20 Measures of Structure Variability Measures of structure variability VAR T (Σ) VAR T (Σ) = max Σ VAR T (Σ) = 4 k VAR T (Σ) VAR G (Σ) VAR G (Σ) = max Σ VAR G (Σ) = 4k VAR G (Σ) VAR N (Σ) = max Σ VAR N (Σ) VAR N (Σ) max Σ VAR N (Σ) min Σ VAR N (Σ) = k3 16VAR N (Σ) k(2k 1) All of them vary in the [0, 1] interval and associate high values to networks whose structure display a high entropy in the bootstrap samples.

21 Measures of Structure Variability Structure variability (total variance) minimum entropy maximum entropy

22 Measures of Structure Variability Structure variability (Frobenius norm) minimum entropy maximum entropy

23 Measures of Structure Variability Applications compare the performance of different combinations of learning algorithms and network scores/independence tests on the same data. study the performance of an algorithm at different sample sizes by changing the size bootstrap samples. The simplest way is to test the hypothesis H 0 : Σ = 1 4 I k H 1 : Σ 1 4 I k using either parametric tests or parametric bootstrap. apply many techniques from classical multivariate statistics (such as principal components), graph theory (path analysis) and linear algebra (matrix decompositions).

24 Measures of Structure Variability Comparing learning algorithms performance 1.0 gs iamb mmhc p value sample size

25 Measures of Structure Variability Comparing statistical tests performance 1.0 mi x p value sample size

26 Further Applications

27 Further Applications Distances in the space of graphs The availability of the first two moments of the random vector E allows the computation of the Mahalanobis distance D U = (E E(E)) T Σ 1 (E E(E)) of any possible graphical structure U = (W, E ) with the same vertex set. This method works even when the true network structure is not known, and gives a better representation of the geometry of the space of the graphs than Hamming distance.

28 Further Applications Extensions to directed graphs Each arc a i = (v j, v k ) in a directed graph G = (V, A) has three possible states 1 if a i = {v j v k } (backward) a i = 0 if a i A 1 if a i = {v j v k } (forward) and therefore it can be modeled as a trinomial random variable A i, which is essentially a multinomial random variable with three states. Variability measures (and their normalized variants) can be extended from the undirected case as VAR(A i ) = VAR(E i ) + 4P(forward)P(backward) [0, 1]

29 Thank you.

30 References

31 References References I R. B. Ash. Probability and Measure Theory. Academic Press, 2nd edition, D. I. Edwards. Introduction to Graphical Modelling. Springer, B. Efron and R. Tibshirani. An Introduction to the Bootstrap. Chapman & Hall, 1993.

32 References References II Nir Friedman, Moises Goldszmidt, and Abraham Wyner. Data Analysis with Bayesian Networks: A Bootstrap Approach. In Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99), pages Morgan Kaufmann, S. Imoto, S. Y. Kim, H. Shimodaira, S. Aburatani, K. Tashiro, S. Kuhara, and S. Miyano. Bootstrap Analysis of Gene Networks Based on Bayesian Networks and Nonparametric Regression. Genome Informatics, 13: , D. Jungnickel. Graphs, Networks and Algorithms. Springer, 3rd edition, 2008.

33 References References III K. Korb and A. Nicholson. Bayesian Artificial Intelligence. Chapman and Hall, F. Krummenauer. Limit Theorems for Multivariate Discrete Distributions. Metrika, 47(1):47 69, M. Scutari. Structure Variability in Bayesian Networks. Working Paper , Department of Statistical Sciences,, Deposited in arxiv in the Statistics - Methodology archive, available from

34 References References IV I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning, 65(1):31 78, 2006.

Measures of Variability for Graphical Models

Measures of Variability for Graphical Models marco.scutari@stat.unipd.it Department of Statistical Sciences March 14, 2011 Graphical Models Graphical Models Graphical Models Graphical models are defined