Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences

Size: px

Start display at page:

Download "Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences"

Justin Knight
6 years ago
Views:

Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences Ph.D.

1 Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences Ph.D. Dissertation Defense Yunshu Liu Adaptive Signal Processing and Information Theory Research Group ECE Department, Drexel University, Philadelphia, PA

2 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code

3 Main Contribution 1: Enumerating k-atom supports and map to entropic region Introducing the concept of non-isomorphic distribution support enumeration for entropy vectors Transversal Orbit Z( ) z} { N Generating list of non-isomorphic k-atom, N-variable supports through the algorithm of snakes and ladders Optimizing entropy inner bounds from k-atom distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0

4 Main Contribution 2: Characterization of entropic region via Information Geometry Characterizing information inequalities with Information Geometry Q p X (i) p Q (iii) (ii) E ( Q ( S p E Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0 Mapping k-atoms support distribution with Information Geometry f 34 VK Given marginals = =0.5 DFZ 4 atom conjecture 0.35, = =0.5 Ingleton34 > 0 VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 1

5 Main Contribution 3: Information guided dataset partitioning Exploit context-specific independence for supervised learning + w 1 w 2 (X \A,Y) XA XB (X \B,Y) Design information theoretic score functions to partition dataset Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2

6 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

7 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

8 The Region of Entropic Vectors Γ N Joint entropy viewed as a vector For N discrete random variables, there is a 2 N 1 dimensional vector h = (h A A N ) R 2N 1 associated with it, we call it entropic vectors. For example: N = 2: h = {h 1 h 2 h 12 } N = 3: h = {h 1 h 2 h 12 h 3 h 13 h 23 h 123 } The Region of Entropic vectors: all valid entropic vectors Γ N = {h R2N 1 h is entropic} Examples of two random variables Γ 2 (Polyhedral cone): h 1 + h 2 h 12, h 12 h 1 and h 12 h 2

9 The Region of Entropic Vectors Γ N Shannon-Type Information inequality { Γ N = h R 2N 1 h A + h B h A B + h A B A, B N h P h Q 0 Q P N } Relationship between Γ n and Γ n Γ 2 = Γ 2 and Γ 3 = Γ 3: All the Information inequalities on N 3 variables are Shannon-Type. (0 1 1) (1 1 1) h 12 = H(X 1,X 2 ) (1 0 1) For N = 2, Use h 1 + h 2 h 12, h 12 h 1 and h 12 h 2, we get Γ 2 (0 0 0) h 2 = H(X 2 ) h 1 = H(X 1 )

10 The Region of Entropic Vectors Γ N Γ 2 and Γ 3 are polyhedral cone and fully characterized. Γ 4 is a non-polyhedral but convex cone, we use outer bound and inner bound to approximate it. Outer bound for Γ N : non-shannon type Inequality Shannon Outer bound N ZY Outer bound + N For N 4 we have Γ N Γ N, indicating non-shannon type Information inequalities exist for N 4 Entropic region N Better Outer bound ++ N First Non-Shannon-Type Information inequality(zhang-yeung 1 ) 2I(X 3 ; X 4 ) I(X 1 ; X 2 )+I(X 1 ; X 3, X 4 )+3I(X 3 ; X 4 X 1 )+I(X 3 ; X 4 X 2 ) 1 Zhen Zhang and Raymond W. Yeung, On Characterization of Entropy Function via Information Inequalities, IEEE Trans. on Information Theory July 1998.

11 The Region of Entropic Vectors Γ N Inner bound for Γ 4 (N =4): Ingleton Inequality: Ingleton ij 0 Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton Inequality hold for ranks(dimensions) of finite subsets of linear spaces, meaning linear codes give us Ingleton ij 0. Shannon Outer bound 4 ZY Outer bound + 4 Entropic region 4 Better Outer bound ++ 4 How to generate better inner Ingleton ij 0 bounds from distributions? Ingleton Inner bound

12 The Region of Entropic Vectors Γ N Motivation of the main results in contribution 1 and contribution 2: What type of distributions generate entropy vectors in the gap? Distribution support Enumeration What are the properties of distributions on the boundary and in the gap? Information Geometric characterization Shannon Outer bound 4 How to generate better inner bounds from distributions? Ingleton ij 0 Ingleton Inner bound

13 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

14 Enumerating k-atom supports and map to Entropic Region References Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec

15 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

16 Equivalence of k-atom Supports An example of 3-variable 4-atom support (0, 0, 0) > α (0, 2, 2) > β α (1, 0, 1) > γ α (1) (1, 1, 3) > 1 + α β γ In distribution support (1), each row corresponding to one outcome/atom, each column represent a variable. Assign variables from left to right as X 1, X 2 and X 3 such that X = (X 1, X 2, X 3 ). If we assume X 1 = 2, X 2 = 3 and X 3 = 4, then X has totally 24 outcomes, among these outcomes, only 4 of them have non-zero probability, we call it a 4-atom distribution support.

17 Equivalence of k-atom Supports Equivalence of two k-atom supports Two k-atom supports S, S, S = S = k, will be said to be equivalent, for the purposes of tracing out the entropy region, if they yield the same set of entropic vectors, up to a permutation of the random variables. 4-atom support A 2 3 (0, 0, 0) 6(0, 2, 2) 7 4(1, 0, 1) 5 (1, 1, 3) 4-atom support B 2 3 (1, 2, 1) 6(3, 4, 1) 7 4(1, 3, 0) 5 (2, 1, 0) same set of entropic vectors

18 Equivalence of k-atom Supports Outcome for each variables in a k-atom support is a set partition, entropy remain unchanged for set partitions under symmetric group S k Consider the 4-atom support in (1) (0, 0, 0) (0, 2, 2) (1, 0, 1) (1, 1, 3) outcome of X 1 can be encoded as {{1, 2}, {3, 4}} outcome of X 2 can be encoded as {{1, 3}, {2}, {4}} outcome of X 3 can be encoded as {{1}, {2}, {3}, {4}} How to encode and find equivalent supports for multiple variables?

19 Orbit data structure Orbit of elements under group action Let a finite group Z acting on a finite set V, For v V, the orbit of v under Z is defined as Z (v) = {zv z Z } Transversal T of the Orbits under group Z transversal is the collection of canonical representative of all the orbits, e.g. least under some ordering of V. V Transversal z} { v Orbit Z(v)

20 Non-isomorphic k-atom supports set partitions A set partition of N k 1 is a set B = {B 1,..., B t } consisting of t subsets B 1,..., B t of N k 1 that are pairwise disjoint B i B j =, i j, and whose union is N k 1, so that Nk 1 = t i=1 B i. Let Π(N k 1 ) denote the set of all set partitions of Nk 1. The cardinality of Π(N k 1 ) is commonly known as Bell numbers. Examples of set partitions Π(N 3 1 ) = {{{1, 2, 3}}, {{1, 2}, {3}}, {{1, 3}, {2}}, {{2, 3}, {1}}, {{1}, {2}, {3}}}, Π(N 4 1 ) = {{{1, 2, 3, 4}}, {{1, 2, 3}, {4}}, {{1, 2, 4}, {3}}, {{1, 3, 4}, {2}}, {{2, 3, 4}, {1}}, {{1, 2}, {3, 4}}, {{1, 3}, {2, 4}}, {{1, 4}, {2, 3}}, {{1, 2}, {3}, {4}}, {{1, 3}, {2}, {4}}, {{1, 4}, {2}, {3}}, {{2, 3}, {1}, {4}}, {{2, 4}, {1}, {3}}, {{3, 4}, {1}, {2}}, {{1}, {2}, {3}, {4}}}.

21 Non-isomorphic k-atom supports Use combinations of N set partitions to represent distribution supports for N variables Let Ξ N be the collection of all sets of N set partitions of N k 1 whose meet is the finest partition (the set of singletons), Ξ N := ξ ξ Π(Nk 1 ), ξ = N, N B = {{i}}. (2) B ξ where meet of two partitions B, B is defined as { } B B = B i B j B i B, B j B, B i B j i=1

22 Non-isomorphic k-atom supports Ξ N represent all the necessary distribution supports to calculate entropic vectors for N variables and k atoms, selecting one representative from each equivalence class will lead us to a list of all non-isomorphic k-atom, N-variable supports. Transversal Orbit Z( ) z} { N The problem of generating the list of all non-isomorphic k-atom, N-variable supports is equivalent to obtaining a transversal of the orbits of S k acting on Ξ N.

23 Non-isomorphic k-atom supports Calculate transversal via the algorithm of Snakes and Ladders 2 For given set N k 1, the algorithm of Snakes and Ladders first compute the transversal of Ξ 1, then it recursively increase the subsets size i, where the enumeration of Ξ i is depending on the result of Ξ i 1 : Ξ 1 Ξ 2 Ξ N 1 Ξ N Number of non-isomorphic k-atom, N-variable supports N\k Anton Betten, Michael Braun, Harald Fripertinger etc., Error-Correcting Linear Codes: Classification by Isometry and Applications, Springer 2006 pp

24 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

25 Maximal Ingleton Violation and the Four Atom Conjecture Ingleton score 3 Given a probability distribution, we define the Ingleton score to be Ingleton score = Ingleton ij (3) H(X i X j X k X l ) where Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton score determine how much a distribution violate one of the six Ingleton Inequality 3 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.

26 Maximal Ingleton Violation and the Four Atom Conjecture Four Atom Conjecture: the lowest reported Ingleton score is approximately , it is attained by (4) (0, 0, 0, 0) > 0.35 (0, 1, 1, 0) > 0.15 (1, 0, 1, 0) > 0.15 (4) (1, 1, 1, 1) > 0.35 A even lower Ingleton score of was reported by F. Matúš and L. Csirmaz 4 through a transformation of entropic vectors, thus without a distribution associated with it. 4 F. Matus and L. Csirmaz, Entropy region and convolution, arxiv: v1.

27 Maximal Ingleton Violation and the Four Atom Conjecture Number of non-isomorphic Ingleton violating supports for four variables number of atoms k all supports Ingleton violating The only Ingleton violating 5-atom support that is not grown from 4-atom support (4) with a Ingleton score (0, 0, 0, 0) (0, 0, 1, 1) (0, 1, 1, 0) (1, 0, 1, 0) (1, 1, 1, 0) (5)

28 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

29 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Random cost functions Ingleton cost function k-atom inner bound generation for four variables Randomly generate enough cost functions from the gap between Shannon outer bound and Ingleton inner bound Given a k-atom support, find the distribution that optimizes each of the cost function, save it if it violates Ingleton Taking the convex hull of all the entropic vectors from the k-atom distributions that violate Ingleton kl to get the k-atom inner bound

30 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 inner and outer bounds percent of pyramid Shannon 100 Outer bound from Dougherty etc ,5,6 atoms inner bound ,5 atoms inner bound atoms inner bound atom conjecture point only atoms inner bound 0 5 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.

31 Optimizing Inner Bounds to Entropy from k-atom Distribution Three dimensional projection of 4-atom inner bound(3-d projection introduced by F. Matúš and L. Csirmaz 3 )

32 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

33 Characterization of Entropic Region via Information Geometry References Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec

34 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

35 Manifold of Probability distributions Example of two binary discrete random variables Let X = {X 1, X 2 }, X i = 1 or 1, consider the family of all the probability distributions S = {p(x) p(x) > 0, i p(x = C i) = 1} Outcomes: C 0 = { 1, 1}, C 1 = { 1, 1} C 2 = {1, 1}, C 3 = {1, 1} Parameterization: p(x = C i ) = η i for i = 1, 2, 3 p(x = C 0 ) = 1 3 j=1 ηj η = (η 1 η 2 η 3 ) is called the m-coordinate of S. η 1 (0 0 1) η 3 (1 0 0) (0 0 0) (0 1 0) Simplex η 2

36 Manifold of Probability distributions Example of two binary discrete random variables A new coordinate {θ i }, which is defined as θ i = ln η i 1 3 j=1 ηj for i = 1, 2, 3. (6) θ = (θ 1 θ 2 θ 3 ) is called the e-coordinate of S. η 3 θ 3 (1 0 0) (0 0 0) (0 1 0) η 2 θ 2 η 1 (0 0 1) Simplex θ 1 Whole space

37 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

38 e-autoparallel submanifold and m-autoparallel submanifold e-autoparallel submanifold A subfamily E is said to be e-autoparallel submanifold of S if for some coordinate λ there exist a matrix A and a vector b such that for the e-coordinate θ of S θ(p) = Aλ(p) + b ( p E) (7) If λ is one-dimensional, it is called a e-geodesic. An e-autoparallel submanifold of S is a hyperplane in e-coordinate; A e-geodesic of S is a straight line in e-coordinate. Hyper-plane in E in coordinate

39 e-autoparallel submanifold and m-autoparallel submanifold m-autoparallel submanifold A subfamily M is said to be m-autoparallel submanifold of S if for some coordinate γ there exist a matrix A and a vector b such that for the m-coordinate η of S η(p) = Aγ(p) + b ( p M) (8) If γ is one-dimensional, it is called a m-geodesic.

40 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits The set of all product distributions, which is defined as E = {p(x) p(x) = p X1 (x 1 )p X2 (x 2 )} (9) p(x 1 = 1) = p 1, p(x 2 = 1) = p 2 Parameterization Define λ i = 1 2 ln p i 1 p i for i = 1, 2 Then E is an e-autoparallel submanifold(plane in θ coordinate) of S: θ ( ) 2 2 θ(p) = θ 2 = 2 0 λ1 = 2 0 λ(p) λ θ

41 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits: e-autoparallel submanifold but not m-autoparallel submanifold E in coordinate E in coordinate

42 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

43 Projections and Pythagorean Theorem KL-divergence On the manifold of probability mass functions for random variables taking values in the set X, we can also define the Kullback Leibler divergence, or relative entropy, measured in bits, according to D(p X q X ) = ( ) px (x) p X (x) log (10) q x X X (x) q X p X S

44 Projections and Pythagorean Theorem Information Projection Let p be a point in S and let E be a e-autoparallel submanifold of S. We find projection point Π E (p X ) in E minimizes the KL-divergence D(p X Π E (p X )) = min q X E D(p X q X ) (11) p X The m-geodesic connecting p X and Π E (p X ) is orthogonal to E at Π E (p X ) 5. In geometry, orthogonal usually means right angle In information geometry, orthogonal usually means certain values are uncorrelated m-geodesic q X E(pX) e-autoparallel submanifold E D(p X k E (p X )) = min qx 2E D(p X kq X )

45 Projections and Pythagorean Theorem For q X E, we have the following Pythagorean relation 6. D(p X q X ) = D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E (12) p X m-geodesic E (p X ) q X e-autoparallel submanifold E D(p X q X )=D(p X E (p X )) + D( E (p X ) q X ) 8q X 2E 6 Shun-ichi Amari and Hiroshi Nagaoka, Methods of Information Geometry, Oxford University Press, 2000.

46 Projections and Pythagorean Theorem From Pythagorean relation to Information Inequalities For S = {p(x) p(x) = q X (x 1,..., x N )}, consider submanifolds { } Q = p X ( ) p X = q X A B q X (A B) c x { } E = p X ( ) p X = q X A\B X A B q X B\A q X (A B) c x Q and E are e-autoparallel submanifold of S and E Q S. p Q and p E : Projection of p X onto Q and E respectively Q p X (i) (iii) p Q (ii) p E E ( Q ( S Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0

47 Projections and Pythagorean Theorem From Pythagorean to Information Inequalities: example For S = {p(x) p(x) = q X (x 1, x 2, x 3 )}, consider submanifolds Q = {p(x) p(x) = q X1 X 3 (x 1, x 3 )q X2 (x 2 )} E= {p(x) p(x) = q X1 (x 1 )q X2 (x 2 )q X3 (x 3 )} Projection of p X onto Q: p Q = p X1 X 3 (x 1, x 3 )p X2 (x 2 ) Projection of p X onto E: p E = p X1 (x 1 )p X2 (x 2 )p X3 (x 3 ) p X (i) (iii) Pythagorean relation E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) (ii) p Q Q E ( Q ( S p E (iii) =(i)+(ii) (i) D(p X kp Q )=h 13 + h 2 h (ii) D(p Q kp E )=h 1 + h 3 h 13 0 (iii) D(p X kp E )=h 1 + h 2 + h 3 h 123 0

48 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

Characterization of 4-atoms support for Four variables Recall 4-atom support in (4) [(0000)(0110)(1010)(1111)] Left: 4-atom support in m-coordinate Right: 4-atom support in e-coordinate Given

49 Characterization of 4-atoms support for Four variables Recall 4-atom support in (4) [(0000)(0110)(1010)(1111)] Left: 4-atom support in m-coordinate Right: 4-atom support in e-coordinate Given marginals(line) = =0.5 3 DFZ 4 atom conjecture(point) 0.35, = = atoms uniform(point) =0.25, = =0.5 2 Hyperplane E: I(X 3 ; X 4 )=0 1 where = The whole 3D space p(0000) = p(0110) = p(1010) = p(1111) = 1 + Marginal distribution of X 3 and X 4 p(x 3 = 0) = p(x 4 = 0) =

50 Characterization of 4-atoms support for Four variables Ingleton 34 = 0 is e-autoparallel in θ coordinate for the given 4-atom support f 34 Given marginals = =0.5 Ingleton 34 > 0 V K DFZ 4 atom conjecture 0.35, = =0.5 V R 3 Ingleton 34 < 0 V M r ; 1 4 atoms uniform =0.25, = =0.5 Hyperplane Ingleton 34 =0 V N Hyperplane E: I(X 3; X 4)=0 where = 2 1

51 Characterization of 5-atoms support for Four variables Recall 5-atom support in Equation (5) [(0,0,0,0)(0,0,1,1)(0,1,1,0)(1,0,1,0)(1,1,1,0)] Fixing one coordinate θ 0 in e-coordinate to arbitrary value, to plot the resulting 3 dimensional manifold: Ingleton34 =0.1 3 Ingleton34 =0 Ingleton34 = 0.01 Ingleton34 = The result shows the e-autoparallel property of Ingleton 34 = 0 is unique to the given 4-atom support

52 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

53 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code

54 From Conditional Independence to Probabilistic Models References Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, 2016.

55 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

56 p-representable semimatroids Let N = {1, 2, N} and S be the family of all couples (i, j K), where K N and i, j are two singletons in N \ K. If we include the cases when i = j, there are, for example, 18 such couples for three variables, and 56 such couples for N = 4. A subset L S is called p-representable semimatroids (probabilistically representable)if there exists a system of N random variables X = {X i } i N such that L = {(i, j K) S(N) X i is conditionally independent of X j given X K i.e. I(X i ; X j X K ) = 0 }.

57 p-representable semimatroids František Matúš and Milan Studený find the minimal set of configurations of conditional independence that can occur within four discrete random variables. Milan Studený studied the minimal set of conditional independent structures that can be described by undirected graphs, directed acyclic graphs and chain graphs for four variables. František Matúš and Milan Studený point out there are huge difference between the above two results, meaning lots of conditional independence structures can not be represented by traditional graph-based models.

58 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

59 Probabilistic Modeling: Bayesian networks Bayesian networks: directed graphical model A Bayesian network consists of a collection of probability distributions P over X = {X 1,, X K } that factorize over a directed acyclic graph(dag) in the following way: p(x) = p(x 1,, X K ) = k K p(x k pa k ) where pa k is the direct parents nodes of X k. Examples of Bayesian networks Consider joint distribution p(x) = p(x 1, X 2, X 3 ) over three variables, we can write: X 1 p(x 1, X 2, X 3 ) = p(x 1 )p(x 2 X 1 )p(x 3 X 1 ) X 3 X 2

60 Probabilistic Modeling: limitations of Graphical Models Problems with Graphical Models Only able to represent limit number of conditional independent relations Inference is usually intractable beyond tree structured graphs Not able to represent conditional independent relations exhibit exclusively in certain contexts.

61 Probabilistic Modeling: Sum-product Network Sum-product Network: Definition A sum-product network(spn) is defined as follows. (1) A tractable univariate distribution is SPN. (2) A product of SPNs with disjoint scopes is SPN. (3) A weighted sum of SPNs with the same scope is SPN, provided all weights are positive. (4) Nothing else is SPN. Note 1: A univariate distribution is tractable iff its partition function and its mode can be computed in O(1) time. Note 2: The scope of SPN is the set of variables that appear in it.

62 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation A SPN can be represented as a tree with univariate distributions as leaves, sums and products as internal nodes, and the edges from a sum node to its children labeled with the corresponding weights X 1 X 2 X 3 X 2 X X 1 P(x) = 0.8(1.0x x 1 )(0.3x x 2 )(0.8x x 3 ) + 0.2(0.0x x 1 )(0.5x x 2 )(0.9x x 3 ).

63 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation SPN can represent context-specific independence X 1 X 2 X 3 X 1 X 2 X

64 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

Information guided dataset partitioning for supervised learning What we learned cluster If no similar from Structure learning of Sum-product instances Network Variables

Partition w 1 winstances 2 w 3 Return: w 1 w 2 w 3 Instances Recurse Recurse Variables Return: approximate independence If no independence, Find cluster independence similar

65 Information guided dataset partitioning for supervised learning What we learned cluster If no similar from Structure learning of Sum-product instances Network Variables Establish Variables Return: Establish w 1 w 2 w approximate 3 Return: independence Recurse Recurse Recurse Recurse If no independence, independence, cluster similar instances Partition w 1 winstances 2 w 3 Return: w 1 w 2 w 3 Instances Recurse Recurse Variables Return: approximate independence If no independence, Find cluster independence similar instances of variables Return: + Variables Recurse Recurse Recurse Return: Instances Recurse w 1 Recurse w 2 w 3 Recurse Recurse Recurse Recurse Establish approximate independence Variables

66 Figure : Context-Specific Independence of a dataset in Sum-Product Network Information guided dataset partitioning Supervised learning algorithm : generalize from the training data to unseen situations in a reasonable way Partition data for supervised learning For supervised learning of discrete dataset, we have N feature variables X i for i N and label variable Y. Denote X A,X B as subsets of all variables for A N and B N. + w 1 w 2 X A X B (X \A,Y) (X \B,Y)

67 Information guided dataset partitioning Literatures on divide and conquer machine learning algorithms: designed to only work on particular machine learning algorithms e.g. Support Vector Machines 7, Ridge Regression : Randomly divide samples evenly and uniformly at random, or divide samples through clustering. 2 : Compute local cost function, build model on local partitioned data. 3 : Directly merge local solutions together or combined to yield a better initialization for the global problem. 7 Cho-Jui Hsieh, Si Si, Inderjit S. Dhillon, A Divid-and-Conquer Solver for Kernel Support Vector Machines, ICML Yuchen Zhang, John C. Duchi, Martin J. Wainwright, Divid and Conquer Kernel Ridge Regression, JMLR 2015.

68 Information guided dataset partitioning Partition data for supervised learning Task: divide both training and test data into blocks of instances and conquer them completely separately. Goal: generate comparable classification performance to the performance of the data what is not partitioned. Challenge: huge number of possible partitions Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2

Finding multiple partition strategies giving comparable scores, ensembling the different resulting

69 Information guided dataset partitioning Our motivations to partition data for supervised learning Able to run machine learning algorithms on very large dataset in parallel without adding communication overheads. Finding multiple partition strategies giving comparable scores, ensembling the different resulting classifiers across different partitioning strategies together to get further performance improvements. Ranking

70 Information guided dataset partitioning for supervised learning Partition by frequent values of feature variables 1. Frequent values of feature variables as candidate partition criteria 2. Design score functions which calculating entropy and conditional mutual information among X i for i N and Y 3. Divide dataset based on partitions giving lower scores. Score function Consider certain K-partition of data as a hidden variable C taking values of c 1, c 2, c K. min C S(C) (13)

71 Information guided dataset partitioning for supervised learning Designing score functions with Information measures minimize I(X i ; Y C) minimize I(X i ; Y C = c j ) minimize I(X i ; Y C = c j ) I(X i ; Y ) larger H(C) to encourage more even partition larger H(Y C) to make sure partition not based on value of Y S 0 (C) = min i N I(X i ; Y C) H(Y C) H(C)

72 Information guided dataset partitioning for supervised learning S 1 (C) = c j min i N S 2 (C) = c j min i N I(X i ; Y C = c j ) H(Y C = c j ) H(C) I(X i ; Y C = c j ) I(X i ; Y ) H(Y C = c j ) H(C) Further more, we can calculate the sum of k smallest of N context-specific conditional mutual information S 3 (C, k) = c j ( S 4 (C, k) = c j ( smallest k smallest k I(X i ; Y C = c j ) ) H(C) H(Y C = c j ) I(X i ; Y C = c j ) I(X i ; Y )) H(C) H(Y C = c j ) + 0.1

73 Information guided dataset partitioning for supervised learning If only 2-parition is allowed, C = {c 1, c 2 }, following score function can be used where we calculate the root-mean-square error (RMSE) to measure the difference of context-specific conditional mutual information on all feature variables. i N [I(X i; Y C = c 1 ) I(X i ; Y C = c 2 )] 2 S 5 (C) = N H(C)

74 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

75 Experiment 1: Click-Through Rate(CTR) prediction Results are tested on one of the online advertising click-through rate(ctr) datasets from Kaggle The task is to predict the probability of a user to click an advertisement based on historical training data. The training data have 40 million instances, test data have 4 million instances Partition ID feature-value ratio in datasets 1 app_id-ecad device_id-a99f214a site_category-50e219e C C C Table : Feature-value pairs of click-through rate data

76 Experiment 1: Click-Through Rate(CTR) prediction Experiment 1: Click-Through Rate(CTR) prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 3) S 4 (C, 3) S 5 (C) Table : Scores of different partitions Partition ID FTRL-Proximal Factorization Machines original Table : Classification results of different partitions(in log loss where logloss(y, p) = 1 N P N i=1 (y i log(p i )+(1 y i ) log(1 p i )))

77 Experiment 1: Click-Through Rate(CTR) prediction

78 Experiment 2: Customer relationship prediction Results are tested on KDD Cup 2009 customer relationship prediction dataset The task is to predict the probability of telecom company users to buy upgrades or add-ons The results can be used to analyze customers buying preferences, helping the company offer better customer service Partition ID feature-value ratio 1 Var200-NaN Var218-cJvF Var212-NhsEn4L Var211-L84s Var205-09_Q Var228-F2FyR07IdsN7I Var193-RO Var227-RAYp Table : Frequent feature-value pairs of customer relationship data

79 Experiment 2: Customer relationship prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 10) S 4 (C, 10) S 5 (C) AUC of GBDT Table : Scores of different partitions for customer relationship data I AUC of GBDT: Area Under the ROC Curve(AUC) as evaluation metric, Gradient Boosting Decision Tree as evaluation algorithm I AUC of original dataset without partitioning: I Ensemble the prediction results from top-4 partition strategies give us AUC of / 80

80 Experiment 2: Customer relationship prediction

81 Publications Yunshu Liu and John MacLaren Walsh, Mapping the Region of Entropic Vectors with Support Enumeration & Information Geometry, IEEE Trans. Inform. Theory, submitted on December 08, Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, Yunshu Liu, John MacLaren Walsh, Only One Nonlinear Non-Shannon Inequality is Necessary for Four Variables, in Proceedings of the 2nd Int. Electronic Conference on Entropy and Its Applications, Nov Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep

82 Summary Transversal Orbit Z( ) z} { N Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 Training X Y f 34 Given marginals = =0.5 Ingleton34 > 0 Test X DFZ 4 atom conjecture 0.35, = =0.5 VK VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 2 1 Training X1 Y1 Test X1 Training X2 Y2 Test X2

83 Thanks! Questions?

Non-isomorphic Distribution Supports for Calculating Entropic Vectors

Non-isomorphic Distribution Supports for Calculating Entropic Vectors Yunshu Liu & John MacLaren Walsh Adaptive Signal Processing and Information Theory Research Group Department of Electrical and Computer