Total positivity in Markov structures

Similar documents
Likelihood Analysis of Gaussian Graphical Models

Markov properties for undirected graphs

Conditional Independence and Markov Properties

Markov properties for undirected graphs

Faithfulness of Probability Distributions and Graphs

Undirected Graphical Models

Lecture 4 October 18th

Graphical Models and Independence Models

MATH 829: Introduction to Data Mining and Analysis Graphical Models I

Learning Multivariate Regression Chain Graphs under Faithfulness

Tutorial: Gaussian conditional independence and graphical models. Thomas Kahle Otto-von-Guericke Universität Magdeburg

Decomposable Graphical Gaussian Models

Decomposable and Directed Graphical Gaussian Models

Independencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)

Chris Bishop s PRML Ch. 8: Graphical Models

Probabilistic Graphical Models (I)

3 : Representation of Undirected GM

Undirected Graphical Models: Markov Random Fields

10708 Graphical Models: Homework 2

Total positivity order and the normal distribution

Geometry of Gaussoids

CSC 412 (Lecture 4): Undirected Graphical Models

arxiv: v2 [stat.me] 5 May 2016

Review: Directed Models (Bayes Nets)

4.1 Notation and probability review

Parameter estimation in linear Gaussian covariance models

Markov properties for mixed graphs

Prof. Dr. Lars Schmidt-Thieme, L. B. Marinho, K. Buza Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany, Course

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Causal Effect Identification in Alternative Acyclic Directed Mixed Graphs

Learning Marginal AMP Chain Graphs under Faithfulness

Log-Convexity Properties of Schur Functions and Generalized Hypergeometric Functions of Matrix Argument. Donald St. P. Richards.

Learning discrete graphical models via generalized inverse covariance matrices

Structure estimation for Gaussian graphical models

STA 4273H: Statistical Machine Learning

Chapter 16. Structured Probabilistic Models for Deep Learning

Identifying the Graphs of Polynomial Functions

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Probability Background

Chapter 17: Undirected Graphical Models

Markov properties for directed graphs

Graphical Gaussian models and their groups

Introduction to Graphical Models

CPSC 540: Machine Learning

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

CPSC 540: Machine Learning

x log x, which is strictly convex, and use Jensen s Inequality:

Directed and Undirected Graphical Models

Parametrizations of Discrete Graphical Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Algebraic methods toward higher-order probability inequalities

Example: multivariate Gaussian Distribution

Chapter 1 Vector Spaces

The Maximum Likelihood Threshold of a Graph

Undirected Graphical Models

Based on slides by Richard Zemel

Probabilistic Graphical Models. Rudolf Kruse, Alexander Dockhorn Bayesian Networks 153

Quiz 1 Date: Monday, October 17, 2016

COMP538: Introduction to Bayesian Networks

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Markov properties for graphical time series models

Elements of Graphical Models DRAFT.

An Algebraic and Geometric Perspective on Exponential Families

Lecture 17: May 29, 2002

Maximum likelihood in log-linear models

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Introduction to graphical models: Lecture III

Graphical Models and Kernel Methods

Towards an extension of the PC algorithm to local context-specific independencies detection

On an Additive Semigraphoid Model for Statistical Networks With Application to Nov Pathway 25, 2016 Analysis -1 Bing / 38Li,

CS281A/Stat241A Lecture 19

Independence, Decomposability and functions which take values into an Abelian Group

Log-concave distributions: definitions, properties, and consequences

Complex Systems Methods 2. Conditional mutual information, entropy rate and algorithmic complexity

GAUSSIAN PROCESS REGRESSION

Probability and Measure

Markovian Combination of Decomposable Model Structures: MCMoSt

Causal Models with Hidden Variables

Chapter 9: Relations Relations

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

The Minimum Rank, Inverse Inertia, and Inverse Eigenvalue Problems for Graphs. Mark C. Kempton

Intelligent Systems:

Variational Inference (11/04/13)

Bayesian Machine Learning - Lecture 7

ON STRONGLY PRIME IDEALS AND STRONGLY ZERO-DIMENSIONAL RINGS. Christian Gottlieb

11 : Gaussian Graphic Models and Ising Models

Algebraic Representations of Gaussian Markov Combinations

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Probabilistic Graphical Models

The intersection axiom of

arxiv: v4 [math.st] 11 Jul 2017

Regression models for multivariate ordered responses via the Plackett distribution

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Causality in Econometrics (3)

ARTICLE IN PRESS. Journal of Multivariate Analysis ( ) Contents lists available at ScienceDirect. Journal of Multivariate Analysis

Capturing Independence Graphically; Undirected Graphs

Graphical Model Inference with Perfect Graphs

Fisher Information in Gaussian Graphical Models

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Gaussian Graphical Models: An Algebraic and Geometric Perspective

Transcription:

1 based on joint work with Shaun Fallat, Kayvan Sadeghi, Caroline Uhler, Nanny Wermuth, and Piotr Zwiernik (arxiv:1510.01290) Faculty of Science Total positivity in Markov structures Steffen Lauritzen 1 Department of Mathematical Sciences CRM, Montreal, July 2016 Slide 1/27

1 Positive association and multivariate total positivity 2 Multivariate Gaussian MTP 2 distributions 3 Conditional independence and Markov properties 4 Totally positive Markov distributions 5 Special instances of total positivity Slide 2/27

Positive dependence and Simpson s paradox Two real-valued random variables X and Y are positively associated if if Cov{f(X),g(Y)} 0 for all f, g which are non-decreasing. The Yule-Simpson paradox says that we may have X and Y positively associated but X and Y negatively associated conditionally on a third variable Z. Multivariate total positivity (MTP 2 ) ensures this not to happen: associations can never change sign due to changes of context. Hence might be of a causal nature... Slide 3/27

Multivariate total positivity for functions Let f : X = v V X v R where X v are either discrete or open subsets of R. Definition f is multivariate totally positive of order two (MTP 2 ) if f(x)f(y) f(x y)f(x y) for all x,y X. Here and should be applied coordinatewise. In the bivariate case, this property is known simply as total positivity or TP 2 (Karlin and Rinott, 1980). A function g is supermodular if g(x y)+g(x y) g(x)+g(y) for all x,y Z. Thus g is supermodular iff exp(g) is MTP 2. Slide 4/27

Example For d = 2, x 1 x 2,y 1 y 2 the condition for MTP 2 simply becomes or, alternatively f(x 1,y 2 )f(x 2,y 1 ) f(x 1,y 1 )f(x 2,y 2 ), det { f(x1,y 1 ) f(x 1,y 2 ) f(x 2,y 1 ) f(x 2,y 2 ) } 0. Slide 5/27

Multivariate total positivity for distributions For X = v V X v as before we adopt a standard base measure µ = v V µ v where µ v is counting measure if X v is discrete and Lebesgue measure if X v is an open subset of R. We then define Definition A distribution P is said to be multivariate totally positive of order two (MTP 2 ) if its density w.r.t. the standard base measure µ is MTP 2. Introduced and studied by Karlin and Rinott (1980) using results (FKG inequality) from fundamental paper by Fortuin et al. (1971). We shall occasionally say X is MTP 2 instead of the distribution of X is MTP 2. Slide 6/27

Example For d = 2, let f be the density of a Gaussian distribution with mean zero and covariance matrix { } σxx σ Σ = xy. σ yx σ yy Then f(x 1,y 2 )f(x 2,y 1 ) f(x 1,y 1 )f(x 2,y 2 ) if and only if σ yx 0, since this is equivalent to the mixed terms in the exponents satisfying (x 1 y 2 +x 2 y 1 )σ xy /det(σ) (x 1 y 1 +x 2 y 2 )σ xy /det(σ) and if σ xy > 0 this is equivalent to (x 1 y 1 +x 2 y 2 ) (x 1 y 2 +x 2 y 1 ) = (x 2 x 1 )(y 2 y 1 ) 0. Slide 7/27

Example Consider binary X and Y with p ij = P(X = i,y = j) for i,j {0,1}. Then P is MTP 2 if and only if p 01 p 10 p 00 p 11 i.e. iff the odds-ratio θ = p 00 p 11 /p 01 p 10 satisfies θ 1. For three MTP 2 binary variables X,Y,Z we have, for example, p 01k p 10k p 00k p 11k, k = 0,1, and thus the conditional odds-ratios satisfy θ k = p 00k p 11k /p 01k p 10k 1. Slide 8/27

Examples of MTP 2 distributions Mostly from Karlin and Rinott (1980): Characteristic roots of a Wishart matrix W, or of W 1 W 1 2, or W 1 (W 1 +W 2 ) 1, where W 1 W 2 (Dykstra and Hewett, 1978); Ferromagnetic (attractive) Ising models (Lebowitz, 1972); Multivariate logistic density (Gumbel, 1961); Gaussian free fields (random height landscapes) (Dynkin, 1980); Markov chains with TP 2 transition densities; Order statistics (X (1),...,X (n) ) if X 1...,X n are i.i.d. with density f; Gaussian latent tree models as in phylogenetics (Zwiernik, 2015); Many other examples... Slide 9/27

Fundamental properties A wealth of probability inequalities are satisfied for MTP 2 distributions (Karlin and Rinott, 1980). Also Proposition Assume X is MTP 2. Then If A V, then the marginal X A = (X v ) v A is MTP 2 ; If C V then the conditional distribution L(X V\C X C = x C ) is MTP 2 for almost all x C X C; If X is discrete and Y is obtained from X by collapsing neighboring states, then Y is MTP 2 ; If φ = (φ v ) v V are non-decreasing, then Y = φ(x) is MTP 2. Slide 10/27

Positive association and MTP 2 Proposition If X is MTP 2 and f and g are non-decreasing in each of its arguments, then X is positively associated Proof. Cov{f(X),g(X)} 0. Discrete case by Fortuin et al. (1971). General case by Sarkar (1969). Slide 11/27

Covariance and independence Proposition If X is positively associated and A,B V are disjoint, then X A X B Cov(X u,x v ) = 0 for all u A,v B. Proof. Shown in Lebowitz (1972). Such a result is usually special for the Gaussian distribution. So learning MTP 2 structure may be based on correlation analysis. Slide 12/27

Multivariate Gaussian MTP 2 distributions Proposition Let X N V (0,Σ). Then X is MTP 2 if and only if K = Σ 1 is a positive definite Minkowski matrix (M-matrix) i.e. iff Proof. k uv 0 for u v and u,v V. See Bølviken (1982) and Karlin and Rinott (1983). Since k uv is proportional to the negative partial correlation between X u and X v, X is MTP 2 if and only if all partial correlations are non-negative. Note also that this is a convex restriction in K. Slide 13/27

Mathematics marks Mechanics Vectors Algebra Analysis Statistics Mechanics 5.24 2.44 2.74 0.01 0.14 Vectors 0.33 10.43 4.71 0.79 0.17 Algebra 0.23 0.28 26.95 7.05 4.70 Analysis -0.00 0.08 0.43 9.88 2.02 Statistics 0.02 0.02 0.36 0.25 6.45 Empirical partial correlations (below the diagonal) and concentrations ( 1000, on and above the diagonal) for 88 examination marks in five mathematical subjects. Essentially MTP 2. Slide 14/27

Mathematics marks under MTP 2 Fitting under the MTP 2 constraint yields ˆK which conforms with graphical model below Vectors Analysis Algebra Mechanics Statistics Slide 15/27

Abstract conditional independence An independence model σ is a ternary relation over subsets of V. It is semi-graphoid if for disjoint subsets A, B, C, D: (S1) if A σ B C then B σ A C (symmetry); (S2) if A σ (B D) C then A σ B C and A σ D C (decomposition); (S3) if A σ (B C) D then A σ B (C D) (weak union); (S4) if A σ B C and A σ D (B C), then A σ (B D) C (contraction). Any probabilistic independence model P is a semi-graphoid. It is a graphoid if (S1) (S4) holds and (S5) if A σ B (C D) and A σ C (B D) then A σ (B C) D (intersection). If X has a density f > 0 its associated independence model P is a graphoid. Slide 16/27

Conditional independence and total positivity A probability distribution on X defines an independence model P by A P B S X A P X B X S. Proposition (Fallat et al. 2016) If X is MTP 2, its independence model P satisfies (S6) (A P B C) (A P D C) = A P (B D) C (composition); (S7) (u P v C) (u P v (C w)) = (u P w C) (v P w C) (singleton transitivity) S(8) (A P B C) D V \(A B) = A P B (C D) (upward stability). These are all fulfilled for separation G in undirected graphs, but not necessarily for any probabilistic independence model P. Slide 17/27

Markov properties Let P be a probability distribution on X = v V X v. The pairwise independence graph G(P) = (V,E) is defined through the relation uv E u P v V \{u,v}. In other words, G(P) is the smallest graph G such that P is pairwise Markov w.r.t. G. We say that P is globally Markov w.r.t. a graph G if A G B S = A P B S where G is separation in the graph G. Further, we say that P is faithful to G if A G B S A P B S i.e. if the independence models P and G are identical. Slide 18/27

A main result Theorem (Fallat et al. 2016) Assume the distribution P of X is MTP 2 with strictly positive density f > 0. Then P is faithful to G(P). In other words, for MTP 2 distributions, the pairwise independence graph yields a complete picture of the independence relations in P. It also implies that if P is faithful to a DAG D and P is MTP 2, D must be perfect, i.e. all parents in the DAG are connected. So in this case, the undirected version of the DAG is chordal. Slide 19/27

Graph decompositions and total positivity Consider a chordal graph G and an associated junction tree T of cliques. Theorem (Fallat et al. 2016) If all separators S in T are singletons, a distribution P is MTP 2 if and only if all clique marginals P C,c C are MTP 2. Note in particular this covers trees. If the separators are not singletons, it is easy to construct counterexamples. And since the MTP 2 property is closed under marginalization, this implies that latent tree models with pairwise MTP 2 2 associations are MTP 2. Slide 20/27

Pairwise interaction models Theorem (Fallat et al. (2016)) A distribution of the form p(x) = 1 Z uv E ψ uv (x u,x v ), where ψ uv are positive functions and Z is a normalizing constant, is MTP 2 if and only if each ψ uv is an MTP 2 function. This covers, in particular, ferromagnetic Ising models. Slide 21/27

Higher order interactions Let X = (X v ) v V take values in X = v V X v where each X v is finite. D denote the power set of V. If p(x) > 0 for all x, we can expand log(x) = D Dθ D (x), (1) where interactions θ D depend on x through x D only. For uniqueness, we may w.l.o.g. assume 0 X v and require that θ D (x) = 0 whenever x d = 0 for some d D. In the binary case we may use simpler notation by letting θ D (1 D ) := θ D for all D D. Slide 22/27

Higher order interactions For a fixed pair u,w V, we define γ uw on X by γ uw (x) = θ D (x). D:{u,w} D Proposition (Fallat et al. (2016)) Let P be strictly positive. Then P is MTP 2 if and only if for all A V with A 2 and any given u,w V the function γ uw is non-negative, non-decreasing, and supermodular over X A, where X A are those with support A. Slide 23/27

Binary log-linear expansions For the binary case, the previous result specializes: Corollary (Bartolucci and Forcina (2000)) Let P be a binary distribution with logp(x) = D θ D Then P is MTP 2 if and only if for all A with A 2 and all {u,w} V we have D:{u,w} D A θ D 0. Slide 24/27

Causal betweenness Let X = (X 1 = 1 A,X 2 = 1 B,X 3 = 1 C ) be binary indicator functions of events A, B, C. Reichenbach (1956) says B is causally between A and C if P(C B A) = P(C B) and 1 > P(C B) > P(C A) > P(C) > 0, 1 > P(A B) > P(A C) > P(A) > 0. In general, causal betweenness does not imply MTP 2 ; if we let p 101 = 0, p 000 = 4/10, and p ijk = 1/10 for the remaining six possibilities, B is causally between A and C, but X is not MTP 2 since 0 = p 101 p 000 < p 100 p 001. However, if P(X = x) > 0 for all x and B is causally between A and C, then P is MTP 2. Conversely, if P(X = x) > 0 for all x, P is MTP 2, and the independence graph of P is 1 2 3 then B is causally between A and C. This follows from the faithfulness of P. Slide 25/27

Some implications for structural learning A distribution is signed MTP 2 if sign changes σ v { 1,1} can be allocated to X v so that Y v = σ v X v,v V is MTP 2 ; The MTP 2 restriction is convex in logf, hence lends itself to convex optimization; So a potential learning strategy first finds a Chow-Liu tree, then changes signs so associations along edges are positive, and finally optimizes scoring function (e.g. penalized likelihood) under MTP 2 constraints. To be explored, so watch this space... Slide 26/27

There are many more things to be said... Thank you! Slide 27/27

Bartolucci, F. and Forcina, A. (2000). A likelihood ratio test for MTP 2 within binary variables. Ann. Statist., 28(4):1206 1218. Bølviken, E. (1982). Probability inequalities for the multivariate normal with non-negative partial correlations. Scand. J. Statist., 9:49 58. Dykstra, R. L. and Hewett, J. E. (1978). Positive dependence of the roots of a Wishart matrix. The Annals of Statistics, 6(1):235 238. Dynkin, E. (1980). Markov processes and random fields. Bulletin of the American Mathematical Society, 3(3):975 999. Fallat, S., Lauritzen, S., Sadeghi, K., Uhler, C., Wermuth, N., and Zwiernik, P. (2016). Total positivity in Markov structures. Annals of Statistics, page To appear. arxiv:1510.01290. Slide 27/27

Fortuin, C. M., Kasteleyn, P. W., and Ginibre, J. (1971). Correlation inequalities on some partially ordered sets. Comm. Math. Phys., 22(2):89 103. Gumbel, E. J. (1961). Bivariate logistic distributions. Journal of the American Statistical Association, 56(294):335 349. Karlin, S. and Rinott, Y. (1980). Classes of orderings of measures and related correlation inequalities. I. Multivariate totally positive distributions. J. Multiv. Anal., 10(4):467 498. Karlin, S. and Rinott, Y. (1983). M-matrices as covariance matrices of multinormal distributions. Linear Algebra Appl., 52:419 438. Lebowitz, J. L. (1972). Bounds on the correlations and analyticity properties of ferromagnetic Ising spin systems. Comm. Math. Phys., 28(4):313 321. Reichenbach, H. (1956). The Direction of Time. University of California Press, Berkeley, CA. Slide 27/27

Sarkar, T. K. (1969). Some lower bounds of reliability. Tech. Report, No. 124, Department of Operations Research and Department of Statistics, Stanford University. Zwiernik, P. (2015). Semialgebraic Statistics and Latent Tree Models. Number 146 in Monographs on Statistics and Applied Probability. Chapman & Hall. Slide 27/27