Total positivity in Markov structures

1 based on joint work with Shaun Fallat, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik (arxiv:1510.01290) Faculty of Science Total positivity in Markov structures Steffen Lauritzen 1 Department of Mathematical Sciences CRM, Montreal, July 2016 Slide 1/27

1 Positive association and multivariate total positivity 2 Multivariate Gaussian MTP 2 distributions 3 Conditional independence and Markov properties 4 Totally positive Markov distributions 5 Special instances of total positivity Slide 2/27

Positive dependence and Simpson s paradox Two real-valued random variables X and Y are positively associated if if Cov{f(X),g(Y)} 0 for all f, g which are non-decreasing. The Yule-Simpson paradox says that we may have X and Y positively associated but X and Y negatively associated conditionally on a third variable Z. Multivariate total positivity (MTP 2 ) ensures this not to happen: associations can never change sign due to changes of context. Hence might be of a causal nature... Slide 3/27

Multivariate total positivity for functions Let f : X = v V X v R where X v are either discrete or open subsets of R. Definition f is multivariate totally positive of order two (MTP 2 ) if f(x)f(y) f(x y)f(x y) for all x,y X. Here and should be applied coordinatewise. In the bivariate case, this property is known simply as total positivity or TP 2 (Karlin and Rinott, 1980). A function g is supermodular if g(x y)+g(x y) g(x)+g(y) for all x,y Z. Thus g is supermodular iff exp(g) is MTP 2. Slide 4/27

Example For d = 2, x 1 x 2,y 1 y 2 the condition for MTP 2 simply becomes or, alternatively f(x 1,y 2 )f(x 2,y 1 ) f(x 1,y 1 )f(x 2,y 2 ), det { f(x1,y 1 ) f(x 1,y 2 ) f(x 2,y 1 ) f(x 2,y 2 ) } 0. Slide 5/27

Multivariate total positivity for distributions For X = v V X v as before we adopt a standard base measure µ = v V µ v where µ v is counting measure if X v is discrete and Lebesgue measure if X v is an open subset of R. We then define Definition A distribution P is said to be multivariate totally positive of order two (MTP 2 ) if its density w.r.t. the standard base measure µ is MTP 2. Introduced and studied by Karlin and Rinott (1980) using results (FKG inequality) from fundamental paper by Fortuin et al. (1971). We shall occasionally say X is MTP 2 instead of the distribution of X is MTP 2. Slide 6/27

Example For d = 2, let f be the density of a Gaussian distribution with mean zero and covariance matrix { } σxx σ Σ = xy. σ yx σ yy Then f(x 1,y 2 )f(x 2,y 1 ) f(x 1,y 1 )f(x 2,y 2 ) if and only if σ yx 0, since this is equivalent to the mixed terms in the exponents satisfying (x 1 y 2 +x 2 y 1 )σ xy /det(σ) (x 1 y 1 +x 2 y 2 )σ xy /det(σ) and if σ xy > 0 this is equivalent to (x 1 y 1 +x 2 y 2 ) (x 1 y 2 +x 2 y 1 ) = (x 2 x 1 )(y 2 y 1 ) 0. Slide 7/27

Example Consider binary X and Y with p ij = P(X = i,y = j) for i,j {0,1}. Then P is MTP 2 if and only if p 01 p 10 p 00 p 11 i.e. iff the odds-ratio θ = p 00 p 11 /p 01 p 10 satisfies θ 1. For three MTP 2 binary variables X,Y,Z we have, for example, p 01k p 10k p 00k p 11k, k = 0,1, and thus the conditional odds-ratios satisfy θ k = p 00k p 11k /p 01k p 10k 1. Slide 8/27

Examples of MTP 2 distributions Mostly from Karlin and Rinott (1980): Characteristic roots of a Wishart matrix W, or of W 1 W 1 2, or W 1 (W 1 +W 2 ) 1, where W 1 W 2 (Dykstra and Hewett, 1978); Ferromagnetic (attractive) Ising models (Lebowitz, 1972); Multivariate logistic density (Gumbel, 1961); Gaussian free fields (random height landscapes) (Dynkin, 1980); Markov chains with TP 2 transition densities; Order statistics (X (1),...,X (n) ) if X 1...,X n are i.i.d. with density f; Gaussian latent tree models as in phylogenetics (Zwiernik, 2015); Many other examples... Slide 9/27

Fundamental properties A wealth of probability inequalities are satisfied for MTP 2 distributions (Karlin and Rinott, 1980). Also Proposition Assume X is MTP 2. Then If A V, then the marginal X A = (X v ) v A is MTP 2 ; If C V then the conditional distribution L(X V\C X C = x C ) is MTP 2 for almost all x C X C; If X is discrete and Y is obtained from X by collapsing neighboring states, then Y is MTP 2 ; If φ = (φ v ) v V are non-decreasing, then Y = φ(x) is MTP 2. Slide 10/27

Positive association and MTP 2 Proposition If X is MTP 2 and f and g are non-decreasing in each of its arguments, then X is positively associated Proof. Cov{f(X),g(X)} 0. Discrete case by Fortuin et al. (1971). General case by Sarkar (1969). Slide 11/27

Covariance and independence Proposition If X is positively associated and A,B V are disjoint, then X A X B Cov(X u,x v ) = 0 for all u A,v B. Proof. Shown in Lebowitz (1972). Such a result is usually special for the Gaussian distribution. So learning MTP 2 structure may be based on correlation analysis. Slide 12/27

Multivariate Gaussian MTP 2 distributions Proposition Let X N V (0,Σ). Then X is MTP 2 if and only if K = Σ 1 is a positive definite Minkowski matrix (M-matrix) i.e. iff Proof. k uv 0 for u v and u,v V. See Bølviken (1982) and Karlin and Rinott (1983). Since k uv is proportional to the negative partial correlation between X u and X v, X is MTP 2 if and only if all partial correlations are non-negative. Note also that this is a convex restriction in K. Slide 13/27

Mathematics marks Mechanics Vectors Algebra Analysis Statistics Mechanics 5.24 2.44 2.74 0.01 0.14 Vectors 0.33 10.43 4.71 0.79 0.17 Algebra 0.23 0.28 26.95 7.05 4.70 Analysis -0.00 0.08 0.43 9.88 2.02 Statistics 0.02 0.02 0.36 0.25 6.45 Empirical partial correlations (below the diagonal) and concentrations ( 1000, on and above the diagonal) for 88 examination marks in five mathematical subjects. Essentially MTP 2. Slide 14/27

Mathematics marks under MTP 2 Fitting under the MTP 2 constraint yields ˆK which conforms with graphical model below Vectors Analysis Algebra Mechanics Statistics Slide 15/27

Abstract conditional independence An independence model σ is a ternary relation over subsets of V. It is semi-graphoid if for disjoint subsets A, B, C, D: (S1) if A σ B C then B σ A C (symmetry); (S2) if A σ (B D) C then A σ B C and A σ D C (decomposition); (S3) if A σ (B C) D then A σ B (C D) (weak union); (S4) if A σ B C and A σ D (B C), then A σ (B D) C (contraction). Any probabilistic independence model P is a semi-graphoid. It is a graphoid if (S1) (S4) holds and (S5) if A σ B (C D) and A σ C (B D) then A σ (B C) D (intersection). If X has a density f > 0 its associated independence model P is a graphoid. Slide 16/27

Conditional independence and total positivity A probability distribution on X defines an independence model P by A P B S X A P X B X S. Proposition (Fallat et al. 2016) If X is MTP 2, its independence model P satisfies (S6) (A P B C) (A P D C) = A P (B D) C (composition); (S7) (u P v C) (u P v (C w)) = (u P w C) (v P w C) (singleton transitivity) S(8) (A P B C) D V \(A B) = A P B (C D) (upward stability). These are all fulfilled for separation G in undirected graphs, but not necessarily for any probabilistic independence model P. Slide 17/27

Markov properties Let P be a probability distribution on X = v V X v. The pairwise independence graph G(P) = (V,E) is defined through the relation uv E u P v V \{u,v}. In other words, G(P) is the smallest graph G such that P is pairwise Markov w.r.t. G. We say that P is globally Markov w.r.t. a graph G if A G B S = A P B S where G is separation in the graph G. Further, we say that P is faithful to G if A G B S A P B S i.e. if the independence models P and G are identical. Slide 18/27

A main result Theorem (Fallat et al. 2016) Assume the distribution P of X is MTP 2 with strictly positive density f > 0. Then P is faithful to G(P). In other words, for MTP 2 distributions, the pairwise independence graph yields a complete picture of the independence relations in P. It also implies that if P is faithful to a DAG D and P is MTP 2, D must be perfect, i.e. all parents in the DAG are connected. So in this case, the undirected version of the DAG is chordal. Slide 19/27

Graph decompositions and total positivity Consider a chordal graph G and an associated junction tree T of cliques. Theorem (Fallat et al. 2016) If all separators S in T are singletons, a distribution P is MTP 2 if and only if all clique marginals P C,c C are MTP 2. Note in particular this covers trees. If the separators are not singletons, it is easy to construct counterexamples. And since the MTP 2 property is closed under marginalization, this implies that latent tree models with pairwise MTP 2 2 associations are MTP 2. Slide 20/27

Pairwise interaction models Theorem (Fallat et al. (2016)) A distribution of the form p(x) = 1 Z uv E ψ uv (x u,x v ), where ψ uv are positive functions and Z is a normalizing constant, is MTP 2 if and only if each ψ uv is an MTP 2 function. This covers, in particular, ferromagnetic Ising models. Slide 21/27

Higher order interactions Let X = (X v ) v V take values in X = v V X v where each X v is finite. D denote the power set of V. If p(x) > 0 for all x, we can expand log(x) = D Dθ D (x), (1) where interactions θ D depend on x through x D only. For uniqueness, we may w.l.o.g. assume 0 X v and require that θ D (x) = 0 whenever x d = 0 for some d D. In the binary case we may use simpler notation by letting θ D (1 D ) := θ D for all D D. Slide 22/27

Higher order interactions For a fixed pair u,w V, we define γ uw on X by γ uw (x) = θ D (x). D:{u,w} D Proposition (Fallat et al. (2016)) Let P be strictly positive. Then P is MTP 2 if and only if for all A V with A 2 and any given u,w V the function γ uw is non-negative, non-decreasing, and supermodular over X A, where X A are those with support A. Slide 23/27

Binary log-linear expansions For the binary case, the previous result specializes: Corollary (Bartolucci and Forcina (2000)) Let P be a binary distribution with logp(x) = D θ D Then P is MTP 2 if and only if for all A with A 2 and all {u,w} V we have D:{u,w} D A θ D 0. Slide 24/27

Causal betweenness Let X = (X 1 = 1 A,X 2 = 1 B,X 3 = 1 C ) be binary indicator functions of events A, B, C. Reichenbach (1956) says B is causally between A and C if P(C B A) = P(C B) and 1 > P(C B) > P(C A) > P(C) > 0, 1 > P(A B) > P(A C) > P(A) > 0. In general, causal betweenness does not imply MTP 2 ; if we let p 101 = 0, p 000 = 4/10, and p ijk = 1/10 for the remaining six possibilities, B is causally between A and C, but X is not MTP 2 since 0 = p 101 p 000 < p 100 p 001. However, if P(X = x) > 0 for all x and B is causally between A and C, then P is MTP 2. Conversely, if P(X = x) > 0 for all x, P is MTP 2, and the independence graph of P is 1 2 3 then B is causally between A and C. This follows from the faithfulness of P. Slide 25/27

Some implications for structural learning A distribution is signed MTP 2 if sign changes σ v { 1,1} can be allocated to X v so that Y v = σ v X v,v V is MTP 2 ; The MTP 2 restriction is convex in logf, hence lends itself to convex optimization; So a potential learning strategy first finds a Chow-Liu tree, then changes signs so associations along edges are positive, and finally optimizes scoring function (e.g. penalized likelihood) under MTP 2 constraints. To be explored, so watch this space... Slide 26/27

There are many more things to be said... Thank you! Slide 27/27

Bartolucci, F. and Forcina, A. (2000). A likelihood ratio test for MTP 2 within binary variables. Ann. Statist., 28(4):1206 1218. Bølviken, E. (1982). Probability inequalities for the multivariate normal with non-negative partial correlations. Scand. J. Statist., 9:49 58. Dykstra, R. L. and Hewett, J. E. (1978). Positive dependence of the roots of a Wishart matrix. The Annals of Statistics, 6(1):235 238. Dynkin, E. (1980). Markov processes and random fields. Bulletin of the American Mathematical Society, 3(3):975 999. Fallat, S., Lauritzen, S., Sadeghi, K., Uhler, C., Wermuth, N., and Zwiernik, P. (2016). Total positivity in Markov structures. Annals of Statistics, page To appear. arxiv:1510.01290. Slide 27/27

Fortuin, C. M., Kasteleyn, P. W., and Ginibre, J. (1971). Correlation inequalities on some partially ordered sets. Comm. Math. Phys., 22(2):89 103. Gumbel, E. J. (1961). Bivariate logistic distributions. Journal of the American Statistical Association, 56(294):335 349. Karlin, S. and Rinott, Y. (1980). Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multiv. Anal., 10(4):467 498. Karlin, S. and Rinott, Y. (1983). M-matrices as covariance matrices of multinormal distributions. Linear Algebra Appl., 52:419 438. Lebowitz, J. L. (1972). Bounds on the correlations and analyticity properties of ferromagnetic Ising spin systems. Comm. Math. Phys., 28(4):313 321. Reichenbach, H. (1956). The Direction of Time. University of California Press, Berkeley, CA. Slide 27/27

Sarkar, T. K. (1969). Some lower bounds of reliability. Tech. Report, No. 124, Department of Operations Research and Department of Statistics, Stanford University. Zwiernik, P. (2015). Semialgebraic Statistics and Latent Tree Models. Number 146 in Monographs on Statistics and Applied Probability. Chapman & Hall. Slide 27/27