Charles University in Prague Faculty of Mathematics and Physics Department of Probability and Mathematical Statistics. Abstract of Doctoral Thesis

Size: px

Start display at page:

Download "Charles University in Prague Faculty of Mathematics and Physics Department of Probability and Mathematical Statistics. Abstract of Doctoral Thesis"

Derrick Lewis
5 years ago
Views:

1 Charles University in Prague Faculty of Mathematics and Physics Department of Probability and Mathematical Statistics Abstract of Doctoral Thesis INDEPENDENCE MODELS Mgr. Petr Šimeček Supervisor: RNDr. Milan Studený, DrSc., ÚTIA AV ČR Study program: mathematics Study branch: m4, probability and mathematical statistics Prague 2007

2 Univerzita Karlova v Praze Matematicko-fyzikální fakulta Katedra pravděpodobnosti a matematické statistiky Autoreferát dizertační práce NEZÁVISLOSTNÍ MODELY Mgr. Petr Šimeček Školitel: RNDr. Milan Studený, DrSc., ÚTIA AV ČR Studijní program: matematika Studijní obor: m4, pravděpodobnost a matematická statistika Praha 2007

3 Tato dizertační práce byla vypracována v rámci doktorského studia, které uchazeč absolvoval na Matematicko-fyzikální fakultě Univerzity Karlovy v letech Uchazeč: Školitel: Oponenti: Mgr. Petr Šimeček RNDr. Milan Studený, DrSc. Ústav teorie informace a automatizace Akademie věd České republiky Pod Vodárenskou věží 4, Praha 8 Doc. RNDr. Petr Lachout, CSc. Katedra pravděpodobnosti a matematické statistiky Matematicko-fyzikální fakulta Univerzity Karlovy Sokolovská 83, Praha 8 Ing. František Matúš, CSc. Ústav teorie informace a automatizace Akademie věd České republiky Pod Vodárenskou věží 4, Praha 8 Autoreferát byl rozeslán dne Obhajoba se koná dne v 12:00 před komisí pro obhajoby dizertačních prací v oboru m4 - Pravděpodobnost a matematická statistika na Matematickofyzikální fakultě Univerzity Karlovy, Sokolovská 83, Praha 8, v posluchárně KPMS Praktikum. S dizertací je možno se seznámit na Útvaru doktorandského studia Matematickofyzikální fakulty Univerzity Karlovy, Ke Karlovu 3, Praha 2. Předseda oborové rady: Prof. RNDr. Marie Hušková, DrSc. Katedra pravděpodobnosti a mat. statistiky Matematicko-fyzikální fakulta Univerzity Karlovy Sokolovská 83, Praha 8

4 Contents Contents Introduction Basic Concepts Conditional Independence Gaussian Distributional Framework Discrete Distributional Framework Independence Models Graphical Independence Models No Finite Characterization Exists Representable Models Over N = {1, 2, 3, 4} Gaussian Distributional Framework Discrete Distributional Framework Discrete representability Positive binary representability Positive discrete representability Probability Distributions Over an Independence Model Connection Between Graphical Independence Models and Exponential Families In Regular Distributional Framework Model Search Simulation Study References List of Publications

5 Introduction An essential part of probabilistic reasoning is the theory of graphical models endowing a multivariate probabilistic distribution with the independence structure. An interesting question originally coming from J. Pearl is the problem of probabilistic representability, i.e. for which lists of conditional independence constraints there exists a random vector satisfying these and only these independences. In this work the known results are collected and the problem is further studied in several distributional frameworks. The focus is particularly on a special case of four random variables where the solution is found for a general Gaussian distribution (extending the result of R. Lněnička) and partially for discrete and binary distributions. At the end the achieved results are used to derive the variance matrix estimator in a case of regular Gaussian distribution. This abstract covers briefly the contents of the thesis, omitting proofs and focusing on the most important topics. Lemmas and Theorems is not necessarily in the same order as in the thesis (see references). 1 Basic Concepts For the reader s convenience, auxiliary results related to probability theory and independence models are recalled in this section. The object of our interest will be a random vector ξ = (ξ a ) a N indexed by a finite set N = {1,2,...,n}. Its distribution and density with respect to some product σ finite measure will be denoted by P and f. For A N, a subvector (ξ a ) a A is denoted by ξ A ; ξ is presumed to be a constant. Analogously, P A and f A are a marginal distribution and density; x A is an appropriate coordinate projection of a constant vector x = (x a ) a N. A singleton {a} will be shorten by a and an union of sets A B will be written simply as juxtaposition AB. For a square matrix Σ, let Σ 1, Σ, Σ and rank(σ) denote its inverse, generalized inverse, determinant and rank, i.e. the dimension of the space spanned by its rows (or columns), respectively. 1.1 Conditional Independence Provided A,B,C are pairwise disjoint subsets of N, ξ A ξ B ξ C stands for a statement ξ A and ξ B are conditionally independent given ξ C. In particular, unconditional independence (C = ) is abbreviated as ξ A ξ B. The meaning of conditional independence (CI) statement ξ A ξ B ξ C is as follows: if the value of ξ C is known then knowledge about ξ A is irrelevant (useless for prediction,... ) for knowledge of ξ B and vise versa. There are many equivalent ways to formalize this definition. Let us list a few of them in Lemma 1 (see [2], pp. 29 for a detailed discussion). 2

6 Lemma 1. If A,B and C are pairwise disjoint subset of N then statements i) vi) are equivalent: i) f AB C (x AB x C ) = f A C (x A x C ) f B C (x B x C ) ii) f ABC (x ABC ) f C (x C ) = f AC (x AC ) f BC (x BC ) iii) f A BC (x A x BC ) = f A C (x A x C ) iv) f AC B (x AC x B ) = f A C (x A x C ) f C B (x C x B ) v) f ABC (x ABC ) = f A C (x A x C ) f BC (x BC ) vi) g,h functions : f ABC (x ABC ) = g(x AC ) h(x BC ) where f A B denotes a conditional density f A B (x A x B ) = f AB (x AB )/f B (x B ) which is defined under a condition f B (x B ) > 0. Each of equations above should be understood in a sense that it holds for all x such that all conditional densities (in the equation) are defined. Moreover, if multiinformation 1 (MI) of ξ is finite then an another characterization of CI exists (see [15], pp for a proof): Lemma 2. If ξ is a random vector with finite MI and A,B,C are pairwise disjoint subsets of N then and MI(ξ ABC ) + MI(ξ C ) MI(ξ AC ) MI(ξ BC ) 0 MI(ξ ABC ) + MI(ξ C ) MI(ξ AC ) MI(ξ BC ) = 0 ξ A ξ B ξ C. If ξ does not have a density with respect to the given σ finite measure then generalized conditional independence ξ A ξ B ξ C might be defined as L(ξ AB ξ C = x C ) = L(ξ A ξ C = x C ) L(ξ B ξ C = x C ) P C a.e. i.e. a conditional distribution ξ AB ξ C is a product of corresponding distributions of ξ A and ξ B given ξ C. This generalization will be used just in a case of a Gaussian distribution which variance matrix is not regular. As following Lemma 3 shows any CI statement can be written in a form of elementary CI statements, i.e. as a collection (ξ Ai ξ Bi ξ Ci ) i such that A i = B i = 1 for all i. Lemma 3. Let A,B,C be pairwise disjoint subsets of N: (ξ A ξ B ξ C ) ( a A,b B,D ABC \ {a,b} : C D ξ a ξ b ξ D ). Proof. See [4], Lemma 3 for a proof. 1 That is Kullback Leibler divergence ( relative entropy) between a joint distribution and a product of one dimensional marginals. 3

7 1.2 Gaussian Distributional Framework Definition. A Gaussian distribution of a random vector ξ = (ξ 1,...,ξ n ) is a probability distribution specified by its characteristic function ( ) ϕ ξ (t) = E exp(it ξ) = exp it µ t Σt, 2 where µ R n and a symmetric positive semi definite matrix Σ R n R n are mean and variance parameters, respectively. If Σ is positive definite (Σ > 0), i.e. regular, the distribution is called regular Gaussian distribution and has a density with respect to Lebesgue measure ( 1 f(x) = exp 1 ) (2π) n 2 Σ (x µ) Σ 1 (x µ) and finite multiinformation MI(ξ) = 1 2 ( log( Σ ) ) n log σ a a. On the other hand side if Σ is not regular then ξ takes values in an affine subspace of R n, e.g. exists a R n such that a=1 a (ξ µ) = 0 (a.s.), (1) see [1], pp. 71; neither a density with respect to Lebesgue measure exists nor multiinformation is generally finite. Given a matrix Σ = (σ a b ) a,b {1,...,n} and A,B subsets of {1, 2,..., n}, the submatrix with A rows and B columns will be denoted by Σ A B = (σ a b ) a A, b B. The following lemma shows that both marginal and conditional distributions of a Gaussian distribution are also Gaussian. Lemma 4. Let A and B be disjoint subsets of N = {1, 2,...,n}. i) The marginal distribution ξ A is Gaussian distribution with the variance matrix Σ A A. ii) The conditional distribution of ξ A given ξ B = x B is a Gaussian distribution with the variance matrix Σ A A B = Σ A A Σ A B Σ B B Σ B A, where Σ B B is any generalized inverse of Σ B B. iii) Moreover, if Σ > 0 then also Σ A A > 0 and Σ A A B > 0. 4

8 Proof. See [2], pp. 256 for a proof of i) and ii). Regularity of Σ A A is obvious, Cholesky decomposition of Σ > 0 yields to Σ A A B > 0. Let us emphasize that the variance matrix of conditional distribution Σ A A B does not depend on the choice of x B. Lemma 5. Let a and b be distinct elements of {1,...,n} and C {1,...,n}\ab: i) ξ a ξ b σ a b = 0. ii) If Σ C C > 0, then ξ a ξ b ξ C Σ ac bc = 0. iii) If D C such that Σ D D > 0 and rank(σ D D ) = rank(σ C C ), then ξ a ξ b ξ C ξ a ξ b ξ D Σ ad bd = 0. iv) If Σ > 0 then ξ a ξ b ξ C ( Σ 1) N\(aC) N\(bC) = 0. Proof. The first part is a well-known fact (cf. [2], pp.257). The second comes from Cholesky decomposition yielding Σ ac bc = 0 σ a b Σ a C Σ 1 C C Σ C b = 0 (Σ ab ab C ) a b = 0 ξ a ξ b ξ C. The third part follows from Lemma 4 ii) by constructing the generalized inverse (cf. [9], section 1b.5) as follows ( Σ Σ 1 C C = D D 0 ). 0 0 See [3] for a proof of iv). 1.3 Discrete Distributional Framework Definition. A random vector ξ = (ξ 1,...,ξ n ) is called discrete if each ξ a takes values in a state space X a such that 1 < X a <. If a, X a = 2 then ξ is called binary. In a binary case, without loss of generality we will assume a, X a = { 1, 1}. Definition. A discrete random vector ξ is called positive if x X = n X a : 0 < P(ξ = x) < 1. a=1 5

9 In a case of discretely distributed random vector, MI is always finite and variables ξ a and ξ b are conditionally independent given ξ C if and only if x abc X abc : P(ξ abc = x abc )P(ξ C = x C ) = P(ξ ac = x ac )P(ξ bc = x bc ). (2) As shown in the thesis the class of binary distributions can be parametrized by its moments A N as follows e A = E ξ a = x a, (3) a A x N { 1,1} N P(ξ N = x N ) a A e is presumed to be one. The parametrization (3) implies ( ) ( A 1 P(ξ A = x A ) = x b )e B 2 B A and CI condition (2) may be written in a form: ξ a ξ b ξ C if and only if x C { 1, 1} C : ( ) ( ) ( ) ( ) e abd x d e D x d = e ad x d e bd x d. D C d D D C d D D C b B CI statements ξ a ξ b ξ C, C = 0, 1, 2 are bellow: ξ a ξ b e ab = e a e b d D ξ a ξ b ξ c e abc + e ab e c = e a e bc + e b e ac e ab + e abc e c = e ac e bc + e a e b D C d D ξ a ξ b ξ cd e abcd + e abc e d + e abd e c + e ab e cd e abc + e abd e cd + e ab e c + e d e abcd e abd + e abc e cd + e ab e d + e c e abcd e ab + e c e abc + e d e abd + e cd e abcd = e a e bcd + e b e acd + e ac e bd + e ad e bc = e ad e bcd + e bd e acd + e a e bc + e b e ac = e ac e bcd + e bc e acd + e a e bd + e b e ad = e a e b + e ac e bc + e ad e bd + e acd e bcd 1.4 Independence Models Let N = {1, 2,...,n} be a finite set and T N denotes the set of all triplets ab C such that ab is an (unordered) pair of distinct elements of N and C N \ ab. Definition. Subsets of T N will be referred here as formal independence models over N. 6

10 The independence model I(ξ) induced by a random vector ξ = (ξ 1,...,ξ n ) is the independence model over {1, 2,..., n} defined as follows I(ξ) = { xy Z ; ξ x ξ y ξ Z }. Let us emphasize that an independence model I(ξ) uniquely determines also all other conditional independencies among subvectors of ξ (due to Lemma 3). Definition. An independence model I is said to be Gaussian (g ), regular Gaussian (g + ), discrete (d ), positive discrete (d + ), binary (b ) and positive binary (b + ) representable if there exists a random vector ξ such that it is Gaussian, regular Gaussian, discrete, positive discrete, binary and positive binary, respectively, and I = I(ξ). Lemma 6. Let a,b,c be distinct elements of N = {1,...,n} and D N \ abc. If an independence models I over N is either discrete or Gaussian representable, then ( { ab cd, ac D } I ) ( { ac bd, ab D } I ). (4) Moreover, if I is positively discrete or regular Gaussian representable, then ( { ab cd, ac bd } I ) = ( { ab D, ac D } I ). (5) Proof. These are so called semigraphoid and graphoid properties, cf. [2] or [15] for more details. Diagrams proposed by R. Lněnička will be used for a visualisation of independence model I over N such that N 4. Each element of N is plotted as a dot. If ab I then dots corresponding to a and b are joined by a line. If ab c I then we put a line between dots corresponding to a and b and add small line in the middle pointing in c direction. If both ab c and ab d are elements of I, then only one line with two small lines in the middle is plotted. Finally, if ab cd I is visualised by a brace between a and b. For instance, see a diagram of I = { 12, 23 1, 23 4, 34 12, 14, 14 2 } on Figure { 3 Fig. 1: Diagram of an independence model. 7

11 Two independence models I and J over N and M are isomorphic if there exists a bijective mapping π from N to M such that xy Z I π(x)π(y) π(z) J, where π(z) stands for {π(z); z Z}. See Figure 2 for an example of three isomorphic models. An equivalence class of independence models with respect to the isomorphic relation will be referred as type. It is easy to see that models of the same type are either all representable or none of them is representable. Consequently, we can classify the entire type as representable or non-representable. Fig. 2: Example of three isomorphic models. Definition. If I is an independence model over N = {1,...,n} and E,F are disjoint subsets of N then a minor I [ F E is an independence model over N \ EF I [ F E = { ab C ; E (abc) =, ab CF I}. Minorization of independence models is an analogy to operations marginalization and conditioning of distributions. Note that I [ = I and (I [F 1 E 1 ) [ F 2 E 2 = I [ F 1F 2 E 1 E 2. In the discrete distributional framework, the intersection of representable models is also representable (see Lemma 21, pp. 46 in the thesis or [17], pp. 5., for a proof): Lemma 7. If I and I are d representable (d + representable) models, then I I is d representable (d + representable). Lemmas 7 follows that if I is a d representable (d + representable) independence model, then all its minors I [ F E are also d representable (d+ representable); i.e. the classes of d /d + representable models are closed to the operation of minorization (cf. Lemma 22, pp. 46 in the thesis). Also the classes of g /g + representable models are closed to the minors 2 due to Lemma 4 (cf. Lemma 11, pp. 26 in the thesis). 2 However, this property does not hold for the classes of b /b + representable models: I = { ab cd, ab d } is b + representable but I [ d = { ab c, ab } is not b representable. 8

12 1.5 Graphical Independence Models In the applications, the independence model determining a class of probability distributions is not usually given as a list of prescribed conditional independencies but as an undirected or directed acyclic graph. That is not only for interpretation reasons 3 but also because of nice theoretical properties of such models (like exact formulas for ML estimates and computationally efficient ways to evaluate conditional distributions in case of decomposable models). The theory is described in details in [2] and [8]. For a short introduction see Chapter 4 in the thesis. Here, only undirected graphs and independence models derived by a global Markov property will be presented. Definition. An undirected graph G = (N,E) is a pair of a finite vertex set N and a set of undirected edges between these vertices, ( ) N E = {ab; a,b N,a b}. 2 Usually, vertices are visualized as dots and edges as lines between them (see Figure 3). A path between two distinct vertices a,b N is a sequence of k 2 vertices a = v 1,..., v k = b such that v i v i+1 E, i = 1,2,...,k 1. Let us assume a,b N are two distinct vertices and C N \ ab. The vertices a and b is said to be separated by C (in symbols a b C) if every path between a and b contains a vertex from C. Fig. 3: An undirected graph G. Definition. An independence model derived from a graph 4 G = (N,E) is an independence model over N defined as follows I(G) = { ab C ; a b C in a graph G}. 3 An edge stands for a significant correlation, an arrow means direct causality. 4 There exist more ways how to derive a model from a graph, see pp. 61 in the thesis. For a simplicity reason we will now restrict ourselves to so called global separation criterion. 9

13 Example. Consider a graph G = (N,E) on Figure 3. The vertex set is N = {1, 2, 3, 4, 5}; the graph contains edges E = {12, 13, 23, 24, 34, 35}. Every two vertices are connected by a path. Vertices 2 and 5 are separated by a vertex 3 or by a set of vertices 14 but not just by vertex 1 or 4. An independence model derived from the graph G is I(G) = { 14 23, , 15 3, 15 34, 15 23, , 25 3, 25 34, 25 13, , 45 3, 45 13, 45 23, }. The main drawback of such approach is that there are much more representable models than just graphical ones. For instance, in the case N = 4 there are 11 graphical types only while there are 80 g representable, 53 g + representable, 1098 d representable and at least 304 d + representable types. On the other hand side, it is unfeasible to fit data to the most of non graphical independence models. 2 No Finite Characterization Exists Definition. An independence model J over M is forbidden minor for a class of independence models C if no model in C has M minor isomorphic to J. A class of independence models C is characterized by a finite collection of forbidden minors D if any independence model I satisfies that I C if and only if no minor of I is isomorphic to any element of D. Note that other possibilities how to definite finite characterization exist like a characterization by special inference rules (cf. [17]). Theorem 1. None of the classes of g /g + /d /d + /b /b + representable models can be characterized by a finite collection of forbidden minors. The Theorem 1 is proved at the end of Chapters 2 and 3 in the thesis, Theorems 3 and 5 (or see [11]). 3 Representable Models Over N = {1, 2, 3, 4} For N consisting of less than four elements, the problem of representability is easy to solve. All models over N 2 are representable in any above mentioned sense. Potential candidates for representability over N = {1, 2, 3} are models fulfilling properties (4) and (5) from Lemma 6 that fall into 10 types (8 types in case of positive representability). Regarding d /d + representability there are no other constraints (all models representable). Regarding b /b + representability I = { ab c, ab } is not binary representable (cf. Lemma 19, pp. 42 in the thesis). Regarding g /g + representability I = { ab c, ab }, I = { ab, ac } and I = { ab, ac, bc } are not Gaussian representable (cf. Lemma 10, pp. 24 in the thesis). That is why we focus on N = {1, 2, 3, 4} from now to the end of the section. 10

14 3.1 Gaussian Distributional Framework The g + representable models must necessarily have g + representable minors. The list of all types on N = {1, 2, 3, 4} that have g + representable minors are plotted in Figure 4 on page 12. As shown in [3], the last 5 ones, M54 M58, are not g + representable. For the rest of them the corresponding variance matrix has been found by a search through a space of all integer positive finite matrices. Therefore, there are 629 g + representable independence models corresponding to 53 types, M1 M53. For the general Gaussian representability, in [12] (or the thesis, Lemma 13, pp. 29) the following two properties are shown: Lemma 8. Let I be a g representable independence model and a, b, c and d distinct elements of N. i) if { ab c, ac b } I then either ξ b ξ c or { ab, ac } I, ii) if { ab cd, ac bd } I, then either ξ b ξ c or { ab d, ac d } I or ad bc I, where ξ b ξ c stands for functional dependence 5 between ξ b and ξ c yielding { ab c, bd c, ac b, cd b, ab cd, bd ac, ac bd, cd ab } I. M1 M88 in Figure 4 and 5 are types having g representable minors and not contradicting Lemma 8. However, it is proved in [12] (or the thesis, Lemma 14, pp. 30) that just M54 M58 and M86 M88 are not g representable. Therefore, there are 877 g representable independence models falling to 80 types. Achieved results are summarized in the following theorem: Theorem 2. There exist 877 g representable independence models over N = 4 assorting into 80 types. Among them there are 629 g + representable independence models (53 types) and these are just the models meeting a condition (5). There exists a conjecture that all g + representable independence models are b + representable. To show this on N = 4 b + representations of M25 and M37 are needed. The former analogous conjecture that all g representable independence models are b representable was rejected due to M71 which is g representable but cannot be represented by a discrete distribution (or any other distribution with finite multiinformation). The problem is listed in Open problem chapter of [15]. 5 i.e. correlation between ξ b and ξ c equals ±1 11

15 Fig. 4: List of types M1 M53 and their Gaussian representations. 12

16 Fig. 5: List of types M54 M80 and their Gaussian representations. 13

17 3.2 Discrete Distributional Framework The problem of (general) discrete representability has been solved by F. Matúš in [5], [6] and [7]. See [16] for an overview of the results. Only partial results exist for d + representability and b /b + representability at this moment (cf. [14] or Chapter 3 in the thesis) Discrete representability In brief, due to Lemma 7 an intersection of two d representable models is also d representable. As a result, the class of all d representable models over N can be described by a set C of so called irreducible models, i.e. nontrivial 6 d representable models that cannot be written as an intersection of other two d representable models. It is not difficult to evidence that a nontrivial independence model I is d representable if and only if there exists A C such that I = C A C. As shown in [5], [6] and [7] there are only only 13 types of irreducible models, see Figure 6. Fig. 6: Irreducible types. Theorem 3. There exist d representable independence models assorting into 1098 types. 6 set of all triplets over N, i.e. T N 14

18 3.2.2 Positive binary representability There are many special properties of b + representable models. A few of them are listed below (proved in Lemma 20, pp. 43 in the thesis): Lemma 9. If I is b + representable independence model over N and a,b,c,d are distinct elements of N then i) { ab cd, ab c, ad, cd } I = { ad c, cd a } I bd I ii) { ab cd, ab c, ad, bd } I = { ad b, bd a } I cd I iii) { ab cd, ab, bc, cd, ad } I = { ad c, cd a } I { bd c, cd b } I iv) { ab cd, ad b, cd b } I = { bc, cd, bc d } I ab c I v) { ab cd, ab, bc, bd, ab c, bc a } I = ad I ac I cd I ad c I { ab d, bd a } I vi) { ab cd, ab d, bc a } I = ac I ad I cd I cd a I ac d I Using a parametrization from Section 1 a search for rational representations was performed. The results are 139 b + representable types. However, a lot of undecided types with respect to b + representability still exist Positive discrete representability On the one hand, all d + representable independence models must be d representable models and must fulfill a condition (5). There are 5547 such a models (356 types). On the other hand, all b + representable models are d + representable. If we close 139 b + representable types found above to intersection (thanks to Lemma 7), we obtained 4555 d + representable models (299 types). The = 57 undecided types are shown on Figure 7. Further, types P12, P49, P50, P52, P54 (5 types,52 models) were found d + representable just by search for some representation (with a little help from computational software). We may conclude (Theorem 4, pp. 51 in the thesis): Theorem 4. There exist between 4607 and 5547 d + representable independence models ( types). It has been conjectured that all 5547 models (356 types) are d + representable. To prove this representations of types P2, P3, P19, P24, P30, P32, P37, P39, P41, P45, P46 and P57 from Figure 7 have to be found. 15

19 Fig. 7: d representable types meeting (5) but not found as an intersection of 139 types known to be b + representable. 16

20 4 Probability Distributions Over an Independence Model Definition. If I is an independence model then a family of distributions of a random vector ξ such that I I(ξ) will be denoted by P(I). Usually, we fix the distributional framework. This will be indicated by adding a lower index to P. For example P g ({ 12 }) is a family of all Gaussian distributions ξ such that ξ 1 ξ 2 (i.e. an element of a variance matrix σ 1 2 = 0). 4.1 Connection Between Graphical Independence Models and Exponential Families In Regular Distributional Framework In the regular Gaussian distributional framework there is an interesting connection between graphical independence models and exponential families: All graphical models are g + representable and families of (regular) Gaussian distributions over them are exponential as stated in the following lemma (cf. [2]). Lemma 10. If G is an undirected graph then P g +(I(G)) is exponential. More important, this property can be reversed. A sketch of a proof is included in [13] but the idea comes from an oral communication with F. Matúš. Lemma 11. If P g +(I) is exponential then there exists an undirected graph G such that I = I(G). 4.2 Model Search Searching for an appropriate statistical model based on data consists of two processes: parametrization of P(I) and estimation of parameters (usually by maximalization of log-likelihood function) over P(I) for I fixed; and searching the right independence model I either by CI testing or by maximization of some information criterion (like AIC or BIC). For an overview see Section 4.2 and 4.3 in the thesis. Consider now N = 4 and a regular Gaussian distributional framework to be able to parameterize all representable models. There are 53 g + representable types but only 11 of them are graphical. Let X be a dataset with 4 columns and k lines (observations). An estimate of a mean parameter will be the same for all models ˆµ a = 1 k k X ba, a = 1,...,4. b=1 17

21 The variance matrix Σ may be parameterized as σ Σ = 0 σ σ 33 0 Σ σ 44 where Σ 0 = 1 ρ 12 ρ 13 ρ 14 ρ 21 1 ρ 23 ρ 24 ρ 31 ρ 32 1 ρ 34 ρ 41 ρ 42 ρ 43 1 σ σ σ σ 44 Often parametrization by an inverse variance matrix V = Σ 1 is more natural. Analogously as above V may be parameterize as v v V = 0 v v 33 0 V 0 v v 33 0, v v 44 where V 0 is a one diagonal matrix. Let us denote V 0 = (w ij ) i,j=1,...,4. The correspondence of independence restrictions on Σ and V has been described in Lemma 5 iv). The parameters σ 11,...,σ 44 can be estimated as., ˆσ 2 aa = 1 k k (X ba ˆµ a ) 2, a = 1,...,4; b=1 analogously, v 11,...,v 44 can be estimated as diagonal of maximum likelihood Σ 1 estimate in saturated model (no independence restrictions). Constraints for the rest of parameters according to the type M1 M53 (see Figure 4, pp. 12) are listed below. The estimates of non diagonal elements of Σ 0 or V 0 must be found by some iterative optimization method (like optim in R environment, [10]). M1) ρ 12 = 0 M2) ρ 12 = ρ 14 ρ 24 M3) ρ 12 = ρ 14ρ 24 +ρ 13 ρ 23 ρ 13 ρ 24 ρ 34 ρ 14 ρ 23 ρ 34 1 ρ 2 34 M4) ρ 12 = 0, ρ 13 = ρ 14(ρ 23 ρ 34 ρ 24 ) ρ 24 ρ 34 ρ 23 M5) ρ 12 = 0, ρ 34 = ρ 23(1 ρ 2 14 )+ρ 13ρ 24 ρ 14 ρ 24 M6) ρ 12 = 0, ρ 34 = ρ 13 ρ 14 + ρ 23 ρ 24 18

22 M7) ρ 12 = 0, ρ 34 = ρ 14 ρ 13 M8) ρ 12 = ρ 14 ρ 24, ρ 34 = ρ 14 ρ 13 M9) w 12 = 0, w 34 = 0 M10) ρ 12 = ρ 14 ρ 24, ρ 13 = ρ 34+ρ 23 ρ 24 ρ 2 14 ρ 23ρ 24 ρ 2 14 ρ2 24 ρ 34 ρ 14 (1 ρ 2 24 ) M11) ρ 12 = 0, ρ 23 = ρ 24 ρ 34 M12) ρ 12 = 0, ρ 34 = 0 M13) ρ 12 = ρ 14 ρ 24, ρ 13 = ρ 14ρ 24 ρ 23 M14) ρ 12 = ρ 14 ρ 24, ρ 13 = ρ 14(1 ρ 2 23 ρ2 24 +ρ 23ρ 24 ρ 34 ) ρ 34 ρ 23 ρ 24 M15) ρ 14 = ρ 13 ρ 34, ρ 12 = ρ 24 ρ 13 ρ 34 M16) w 14 = 0, w 23 = 0, w 12 = w 13w 24 w 34 1 w 2 34 M17) ρ 12 = 0, ρ 14 = ρ 13 ρ 34, ρ 23 = ρ 34(1 ρ 2 13 ) ρ 24 M18) ρ 12 = 0, ρ 34 = ρ 13 ρ 14, ρ 24 = ρ 14(1 ρ 2 13 ρ2 23 ) ρ 13 ρ 23 M19) ρ 12 = 0, ρ 34 = 0, ρ 13 = ρ 23(1 ρ 2 14 ) ρ 14 ρ 24 M20) w 12 = 0, w 34 = 0, w 13 = w 14w 24 w 23 M21) ρ 12 = 0, ρ 34 = ρ 13 ρ 14, ρ 23 = ρ 24ρ 14 (1 ρ 2 13 ) ρ 13 (1 ρ 2 14 ) M22) ρ 12 = 0, ρ 34 = ρ 23 ρ 24, ρ 13 = ρ 23ρ 24 ρ 14 M23) ρ 12 = 0, ρ 23 = ρ 24 ρ 34, ρ 14 = ρ 13 ρ 34 M24) ρ 12 = 0, ρ 34 = ρ 23 ρ 24, ρ 14 = ρ 13 ρ 23 ρ 24 M25) ρ 12 = 0, ρ 14 = ρ 13 ρ 34, ρ 23 = 1 ρ2 24 ρ2 34 ρ 24 ρ 34 M26) ρ 12 = ρ 14 ρ 24, ρ 13 = ρ 14ρ 24 ρ 23, ρ 34 = ρ 23 ρ 24 M27) ρ 12 = 0, ρ 14 = ρ 13 ρ 34, ρ 24 = ρ 23(1 ρ 2 13 ρ2 34 ) ρ 34 (1 ρ 2 13 ) M28) ρ 12 = ρ 14 ρ 24, ρ 23 = ρ 14 ρ 24 ρ 13, ρ 34 = ρ 14 ρ 2 24ρ 13 M29) w 14 = 0, w 12 = w 13 w 23, w 34 = w 23 w 24 M30) ρ 12 = 0, ρ 34 = 0, ρ 13 = ρ 14ρ 24 ρ 23 19

23 M31) w 23 = 0, w 14 = w 13 w 34, w 12 = w 13 w 34 w 24 M32) w 34 = 0, w 12 = w 14 w 24, w 13 = w 14w 24 w 23 M33) ρ 12 = 0, ρ 23 = 0 M34) w 12 = 0, w 23 = 0 M35) ρ 12 = 0, ρ 34 = 0, ρ 13 = ±ρ 24, ρ 14 = ρ 23 M36) ρ 34 = ρ 23 ρ 24, ρ 12 = ±ρ 23 ρ 24, ρ 13 = ±ρ 24, ρ 14 = ±ρ 23 M37) ρ 12 = 0, ρ 23 = ρ 24 ρ 34, ρ 13 = ± 1 ρ 2 24, ρ 14 = ±ρ 34 1 ρ 2 24 M38) ρ 12 = 0, ρ 14 = 0, ρ 23 = ρ 24(1 ρ 2 13 ) ρ 34 M39) ρ 12 = 0, ρ 23 = 0, ρ 13 = ρ 14 ρ 34 M40) w 14 = 0, w 24 = 0, w 13 = w 12(1 w 2 34 ) w 23 M41) w 14 = 0, w 24 = 0, w 12 = w 13 w 23 M42) ρ 12 = 0, ρ 13 = 0, ρ 23 = 0 M43) w 12 = 0, w 13 = 0, w 23 = 0 M44) ρ 12 = 0, ρ 23 = 0, ρ 14 = ρ 13 ρ 34 M45) ρ 12 = 0, ρ 23 = 0, ρ 34 = 0 M46) w 12 = 0, w 23 = 0, w 34 = 0 M47) ρ 12 = 0, ρ 13 = 0, ρ 14 = 0 M48) ρ 12 = 0, ρ 23 = 0, ρ 13 = 0, ρ 14 = 0 M49) w 12 = 0, w 13 = 0, w 23 = 0, w 14 = 0 M50) ρ 12 = 0, ρ 14 = 0, ρ 23 = 0, ρ 34 = 0 M51) ρ 12 = 0, ρ 13 = 0, ρ 14 = 0, ρ 23 = 0, ρ 34 = 0 M52) ρ 12 = 0, ρ 13 = 0, ρ 14 = 0, ρ 24 = 0, ρ 23 = 0, ρ 34 = 0 M53) saturated model (no constraints) The model selection can be done by AIC/BIC criterion. Dimension of a parameter space D equals 14 minus a number of independence constraints for the given type. For example, in case of M11 D = 14 2 = 12 because there are constraints on ρ 12 and ρ

24 4.3 Simulation Study The variance matrix estimates and model selection algorithm from the previous subsection have been implemented in R package yest attached to the thesis. Model selection process takes about 9 minutes (AMD Sempron 3500+, 1800 GHz, 1 GB RAM). Note that because the sufficient statistics for all estimates above is a maximum likelihood parameters estimate for saturated model, the size of a dataset does not highly influence the time of computation. Validity has been verified by the following algorithm: 1. For each of 53 types take one model (denoted by I) and randomly generate an inverse variance matrix corresponding to I (denoted by V ). 2. Generate k lines dataset, each line from multivariate normal distribution with zero mean parameter and variance matrix V Based on AIC/BIC select one of 629 g + representable independence models (denoted by J). 4. Model selection is called successful if I = J. Otherwise, it is not successful. Transcription to R follows I<-ind.identification(type=testovany.typ)$model V<-ind.rgauss(model=I) data<-generate.data(v,k) J<-model.selection(data)$model if (I==J) print( Success ) else print( Failure ) AIC BIC probabilityof success probabilityof success log(k) log(k) Fig. 8: Success rate of AIC and BIC model selection. 21

25 The simulation was repeated for each of 53 types and 15 possible value of k four times. A graph of average success rate in dependence to log(k) is shown on Figure 8. Failure of a model selection does not automatically imply bad behavior of a variance matrix estimate. The simulation study has been repeated but now Kullback Leibler divergence 7 between estimated and the true distribution has been measured. Four estimates have been evaluated: an estimate based on saturated model (the usual one), an estimate based on BIC model selection over graphical models, an estimate based on BIC model selection over all independence models and an estimate based on actual independence model. Results are visualized on Figure 9: trimmed 8 (20%) average difference between KL divergence of inspected estimate to the true model and KL divergence of estimated based on saturated model to the true model is drawn in dependence to a logarithm of a number of lines in a dataset k. Trimmed Mean Detail difference difference log(k) log(k) Fig. 9: Saturated model full line, graphical models dashed line, independence models dotted line, actual independence model dash-dotted line. We may conclude that to have a good chance (> 80%) to determine the true independence model based on BIC model selection we need about 50 thousands of observations. However, for such a large amount of data the quality of a variance estimator is similar as the quality of an estimator based on a saturated model (the usual one). It seems that the improvement (in a sense average KL divergence with comparison to an estimator based on saturated model) is significant if the number of observations is between 200 and It is obvious that the quality of 7 That is DIV (µ,σ, µ, Σ) = 1 2 ( log ( Σ Σ ) + tr ( Σ 1 Σ µ,σ, µ, Σ are parameters of the first and the second distribution. 8 Without trimming the figures are similar but a bit more messy. ) ) + ( µ µ) Σ 1 ( µ µ) 4, where 22

26 estimation of a variance matrix using just the graphical models is unsatisfactory if the true model is not graphical and the number of observations is below An interesting direction of future research might be study of distributional families above in terms of curved exponential families (to speed up the computation) and estimation of variance matrices for more that four Gaussian distributed random variables. 23

27 References [1] J. Anděl. Matematická statistika. SNTL/ALFA, [2] S. L. Lauritzen. Graphical Models. Oxford University Press, [3] R. Lněnička. On Gaussian conditional independence structures. Technical report, ÚTIA AV ČR, [4] F. Matúš. On equivalence of Markov properties over undirected graphs. Journal of Applied Probability, 29: , [5] F. Matúš and M. Studený. Conditional independences among four random variables I. Combinatorics, Probability & Computing, 4: , [6] F. Matúš. Conditional independences among four random variables II. Combinatorics, Probability & Computing, 4: , [7] F. Matúš. Conditional independences among four random variables III. Combinatorics, Probability & Computing, 8: , [8] R.E. Neapolitan. Learning Bayesian Networks. Chapman & Hall/CRC, [9] R. C. Rao. Linear Statistical Inference and Its Applications. John Wiley & Sons, [10] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN [11] P. Šimeček. Classes of Gaussian, discrete and binary representable independence models have no finite characterization. In M. Hušková and M. Janžura, editors, Proceedings of Prague Stochastics 2006, volume CD, pages MATFYZPRESS, [12] P. Šimeček. Gaussian representation of independence models over four random variables. In A. Rizzi, editor, Proceedings of COMPSTAT 2006, volume CD, pages Springer, [13] P. Šimeček. Independence models. In J. Vejnarová, editor, Proceedings of Workshop on Uncertainty Processing 2006, pages University of Economics, [14] P. Šimeček. A short note on discrete representability of independence models. In M. Studený and J. Vomlel, editors, Proceedings of Workshop on Probabilistic Graphical Models 2006, pages Action M Agency,

28 [15] M. Studený. Probabilistic Conditional Independence Structures. Springer, [16] M. Studený and P. Boček. CI-models arising among 4 random variables. In Proceedings of the 3rd workshop WUPES, pages , [17] M. Studený and J. Vejnarová. The multiinformation function as a tool for measuring stochastic dependence. In M. I. Jordan, editor, Learning in Graphical Models, pages Kluwer,

29 List of Publications P. Šimeček. A short note on structure learning. In Jana Šafránková, editor, Proceedings of the 14th Annual Conference of Doctoral Students - WDS 2004, pp , Prague, Matfyzpress. P. Šimeček and M. Studený. Využití Hilbertovy báze k ověřování shodnosti strukturálních a kombinatorických imsetů. In Jaromír Antoch and Gejza Dohnal, editors, Sborník prací 13. letní školy JČMF ROBUST, pp Jednota českých matematiků a fyziků, P. Šimeček. On the minimal probability of intersection of conditionally independent events. Submitted to Kybernetika, P. Šimeček. Gene Expression Data Analysis for In Vitro Toxicology. In Jaromír Antoch and Gejza Dohnal, editors, Sborník prací 14. letní školy JČMF ROBUST, pp Jednota českých matematiků a fyziků, Č. Štuka and P. Šimeček. Studium souvislostí mezi úspěšností studia medicíny, známkami na střední škole a výsledky přijímacích zkoušek. In Sborník konference MEDSOFT Action M Agency, P. Šimeček. Classes of Gaussian, discrete and binary representable independence models have no finite characterization. In M. Hušková a M. Janžura, editors, Proceedings of Prague Stochastics 2006, CD volume, pp MATFYZPRESS, P. Šimeček. Gaussian representation of independence models over four random variables. V A. Rizzi, editor, Proceedings of COMPSTAT 2006, CD volume, pp Springer, P. Šimeček. A short note on discrete representability of independence models. In M. Studený and J. Vomlel, editors, Proceedings of Workshop on Probabilistic Graphical Models 2006, pp Action M Agency, P. Šimeček. Independence models. In J. Vejnarová, editor, Proceedings of Workshop on Uncertainty Processing 2006, pp University of Economics,

Graphical Models and Independence Models

Graphical Models and Independence Models Yunshu Liu ASPITRG Research Group 2014-03-04 References: [1]. Steffen Lauritzen, Graphical Models, Oxford University Press, 1996 [2]. Christopher M. Bishop, Pattern