Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences
|
|
- Justin Knight
- 6 years ago
- Views:
Transcription
1 Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences Ph.D. Dissertation Defense Yunshu Liu Adaptive Signal Processing and Information Theory Research Group ECE Department, Drexel University, Philadelphia, PA
2 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code
3 Main Contribution 1: Enumerating k-atom supports and map to entropic region Introducing the concept of non-isomorphic distribution support enumeration for entropy vectors Transversal Orbit Z( ) z} { N Generating list of non-isomorphic k-atom, N-variable supports through the algorithm of snakes and ladders Optimizing entropy inner bounds from k-atom distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0
4 Main Contribution 2: Characterization of entropic region via Information Geometry Characterizing information inequalities with Information Geometry Q p X (i) p Q (iii) (ii) E ( Q ( S p E Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0 Mapping k-atoms support distribution with Information Geometry f 34 VK Given marginals = =0.5 DFZ 4 atom conjecture 0.35, = =0.5 Ingleton34 > 0 VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 1
5 Main Contribution 3: Information guided dataset partitioning Exploit context-specific independence for supervised learning + w 1 w 2 (X \A,Y) XA XB (X \B,Y) Design information theoretic score functions to partition dataset Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2
6 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
7 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
8 The Region of Entropic Vectors Γ N Joint entropy viewed as a vector For N discrete random variables, there is a 2 N 1 dimensional vector h = (h A A N ) R 2N 1 associated with it, we call it entropic vectors. For example: N = 2: h = {h 1 h 2 h 12 } N = 3: h = {h 1 h 2 h 12 h 3 h 13 h 23 h 123 } The Region of Entropic vectors: all valid entropic vectors Γ N = {h R2N 1 h is entropic} Examples of two random variables Γ 2 (Polyhedral cone): h 1 + h 2 h 12, h 12 h 1 and h 12 h 2
9 The Region of Entropic Vectors Γ N Shannon-Type Information inequality { Γ N = h R 2N 1 h A + h B h A B + h A B A, B N h P h Q 0 Q P N } Relationship between Γ n and Γ n Γ 2 = Γ 2 and Γ 3 = Γ 3: All the Information inequalities on N 3 variables are Shannon-Type. (0 1 1) (1 1 1) h 12 = H(X 1,X 2 ) (1 0 1) For N = 2, Use h 1 + h 2 h 12, h 12 h 1 and h 12 h 2, we get Γ 2 (0 0 0) h 2 = H(X 2 ) h 1 = H(X 1 )
10 The Region of Entropic Vectors Γ N Γ 2 and Γ 3 are polyhedral cone and fully characterized. Γ 4 is a non-polyhedral but convex cone, we use outer bound and inner bound to approximate it. Outer bound for Γ N : non-shannon type Inequality Shannon Outer bound N ZY Outer bound + N For N 4 we have Γ N Γ N, indicating non-shannon type Information inequalities exist for N 4 Entropic region N Better Outer bound ++ N First Non-Shannon-Type Information inequality(zhang-yeung 1 ) 2I(X 3 ; X 4 ) I(X 1 ; X 2 )+I(X 1 ; X 3, X 4 )+3I(X 3 ; X 4 X 1 )+I(X 3 ; X 4 X 2 ) 1 Zhen Zhang and Raymond W. Yeung, On Characterization of Entropy Function via Information Inequalities, IEEE Trans. on Information Theory July 1998.
11 The Region of Entropic Vectors Γ N Inner bound for Γ 4 (N =4): Ingleton Inequality: Ingleton ij 0 Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton Inequality hold for ranks(dimensions) of finite subsets of linear spaces, meaning linear codes give us Ingleton ij 0. Shannon Outer bound 4 ZY Outer bound + 4 Entropic region 4 Better Outer bound ++ 4 How to generate better inner Ingleton ij 0 bounds from distributions? Ingleton Inner bound
12 The Region of Entropic Vectors Γ N Motivation of the main results in contribution 1 and contribution 2: What type of distributions generate entropy vectors in the gap? Distribution support Enumeration What are the properties of distributions on the boundary and in the gap? Information Geometric characterization Shannon Outer bound 4 How to generate better inner bounds from distributions? Ingleton ij 0 Ingleton Inner bound
13 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
14 Enumerating k-atom supports and map to Entropic Region References Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec
15 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
16 Equivalence of k-atom Supports An example of 3-variable 4-atom support (0, 0, 0) > α (0, 2, 2) > β α (1, 0, 1) > γ α (1) (1, 1, 3) > 1 + α β γ In distribution support (1), each row corresponding to one outcome/atom, each column represent a variable. Assign variables from left to right as X 1, X 2 and X 3 such that X = (X 1, X 2, X 3 ). If we assume X 1 = 2, X 2 = 3 and X 3 = 4, then X has totally 24 outcomes, among these outcomes, only 4 of them have non-zero probability, we call it a 4-atom distribution support.
17 Equivalence of k-atom Supports Equivalence of two k-atom supports Two k-atom supports S, S, S = S = k, will be said to be equivalent, for the purposes of tracing out the entropy region, if they yield the same set of entropic vectors, up to a permutation of the random variables. 4-atom support A 2 3 (0, 0, 0) 6(0, 2, 2) 7 4(1, 0, 1) 5 (1, 1, 3) 4-atom support B 2 3 (1, 2, 1) 6(3, 4, 1) 7 4(1, 3, 0) 5 (2, 1, 0) same set of entropic vectors
18 Equivalence of k-atom Supports Outcome for each variables in a k-atom support is a set partition, entropy remain unchanged for set partitions under symmetric group S k Consider the 4-atom support in (1) (0, 0, 0) (0, 2, 2) (1, 0, 1) (1, 1, 3) outcome of X 1 can be encoded as {{1, 2}, {3, 4}} outcome of X 2 can be encoded as {{1, 3}, {2}, {4}} outcome of X 3 can be encoded as {{1}, {2}, {3}, {4}} How to encode and find equivalent supports for multiple variables?
19 Orbit data structure Orbit of elements under group action Let a finite group Z acting on a finite set V, For v V, the orbit of v under Z is defined as Z (v) = {zv z Z } Transversal T of the Orbits under group Z transversal is the collection of canonical representative of all the orbits, e.g. least under some ordering of V. V Transversal z} { v Orbit Z(v)
20 Non-isomorphic k-atom supports set partitions A set partition of N k 1 is a set B = {B 1,..., B t } consisting of t subsets B 1,..., B t of N k 1 that are pairwise disjoint B i B j =, i j, and whose union is N k 1, so that Nk 1 = t i=1 B i. Let Π(N k 1 ) denote the set of all set partitions of Nk 1. The cardinality of Π(N k 1 ) is commonly known as Bell numbers. Examples of set partitions Π(N 3 1 ) = {{{1, 2, 3}}, {{1, 2}, {3}}, {{1, 3}, {2}}, {{2, 3}, {1}}, {{1}, {2}, {3}}}, Π(N 4 1 ) = {{{1, 2, 3, 4}}, {{1, 2, 3}, {4}}, {{1, 2, 4}, {3}}, {{1, 3, 4}, {2}}, {{2, 3, 4}, {1}}, {{1, 2}, {3, 4}}, {{1, 3}, {2, 4}}, {{1, 4}, {2, 3}}, {{1, 2}, {3}, {4}}, {{1, 3}, {2}, {4}}, {{1, 4}, {2}, {3}}, {{2, 3}, {1}, {4}}, {{2, 4}, {1}, {3}}, {{3, 4}, {1}, {2}}, {{1}, {2}, {3}, {4}}}.
21 Non-isomorphic k-atom supports Use combinations of N set partitions to represent distribution supports for N variables Let Ξ N be the collection of all sets of N set partitions of N k 1 whose meet is the finest partition (the set of singletons), Ξ N := ξ ξ Π(Nk 1 ), ξ = N, N B = {{i}}. (2) B ξ where meet of two partitions B, B is defined as { } B B = B i B j B i B, B j B, B i B j i=1
22 Non-isomorphic k-atom supports Ξ N represent all the necessary distribution supports to calculate entropic vectors for N variables and k atoms, selecting one representative from each equivalence class will lead us to a list of all non-isomorphic k-atom, N-variable supports. Transversal Orbit Z( ) z} { N The problem of generating the list of all non-isomorphic k-atom, N-variable supports is equivalent to obtaining a transversal of the orbits of S k acting on Ξ N.
23 Non-isomorphic k-atom supports Calculate transversal via the algorithm of Snakes and Ladders 2 For given set N k 1, the algorithm of Snakes and Ladders first compute the transversal of Ξ 1, then it recursively increase the subsets size i, where the enumeration of Ξ i is depending on the result of Ξ i 1 : Ξ 1 Ξ 2 Ξ N 1 Ξ N Number of non-isomorphic k-atom, N-variable supports N\k Anton Betten, Michael Braun, Harald Fripertinger etc., Error-Correcting Linear Codes: Classification by Isometry and Applications, Springer 2006 pp
24 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
25 Maximal Ingleton Violation and the Four Atom Conjecture Ingleton score 3 Given a probability distribution, we define the Ingleton score to be Ingleton score = Ingleton ij (3) H(X i X j X k X l ) where Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton score determine how much a distribution violate one of the six Ingleton Inequality 3 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.
26 Maximal Ingleton Violation and the Four Atom Conjecture Four Atom Conjecture: the lowest reported Ingleton score is approximately , it is attained by (4) (0, 0, 0, 0) > 0.35 (0, 1, 1, 0) > 0.15 (1, 0, 1, 0) > 0.15 (4) (1, 1, 1, 1) > 0.35 A even lower Ingleton score of was reported by F. Matúš and L. Csirmaz 4 through a transformation of entropic vectors, thus without a distribution associated with it. 4 F. Matus and L. Csirmaz, Entropy region and convolution, arxiv: v1.
27 Maximal Ingleton Violation and the Four Atom Conjecture Number of non-isomorphic Ingleton violating supports for four variables number of atoms k all supports Ingleton violating The only Ingleton violating 5-atom support that is not grown from 4-atom support (4) with a Ingleton score (0, 0, 0, 0) (0, 0, 1, 1) (0, 1, 1, 0) (1, 0, 1, 0) (1, 1, 1, 0) (5)
28 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
29 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Random cost functions Ingleton cost function k-atom inner bound generation for four variables Randomly generate enough cost functions from the gap between Shannon outer bound and Ingleton inner bound Given a k-atom support, find the distribution that optimizes each of the cost function, save it if it violates Ingleton Taking the convex hull of all the entropic vectors from the k-atom distributions that violate Ingleton kl to get the k-atom inner bound
30 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 inner and outer bounds percent of pyramid Shannon 100 Outer bound from Dougherty etc ,5,6 atoms inner bound ,5 atoms inner bound atoms inner bound atom conjecture point only atoms inner bound 0 5 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.
31 Optimizing Inner Bounds to Entropy from k-atom Distribution Three dimensional projection of 4-atom inner bound(3-d projection introduced by F. Matúš and L. Csirmaz 3 )
32 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
33 Characterization of Entropic Region via Information Geometry References Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec
34 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
35 Manifold of Probability distributions Example of two binary discrete random variables Let X = {X 1, X 2 }, X i = 1 or 1, consider the family of all the probability distributions S = {p(x) p(x) > 0, i p(x = C i) = 1} Outcomes: C 0 = { 1, 1}, C 1 = { 1, 1} C 2 = {1, 1}, C 3 = {1, 1} Parameterization: p(x = C i ) = η i for i = 1, 2, 3 p(x = C 0 ) = 1 3 j=1 ηj η = (η 1 η 2 η 3 ) is called the m-coordinate of S. η 1 (0 0 1) η 3 (1 0 0) (0 0 0) (0 1 0) Simplex η 2
36 Manifold of Probability distributions Example of two binary discrete random variables A new coordinate {θ i }, which is defined as θ i = ln η i 1 3 j=1 ηj for i = 1, 2, 3. (6) θ = (θ 1 θ 2 θ 3 ) is called the e-coordinate of S. η 3 θ 3 (1 0 0) (0 0 0) (0 1 0) η 2 θ 2 η 1 (0 0 1) Simplex θ 1 Whole space
37 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
38 e-autoparallel submanifold and m-autoparallel submanifold e-autoparallel submanifold A subfamily E is said to be e-autoparallel submanifold of S if for some coordinate λ there exist a matrix A and a vector b such that for the e-coordinate θ of S θ(p) = Aλ(p) + b ( p E) (7) If λ is one-dimensional, it is called a e-geodesic. An e-autoparallel submanifold of S is a hyperplane in e-coordinate; A e-geodesic of S is a straight line in e-coordinate. Hyper-plane in E in coordinate
39 e-autoparallel submanifold and m-autoparallel submanifold m-autoparallel submanifold A subfamily M is said to be m-autoparallel submanifold of S if for some coordinate γ there exist a matrix A and a vector b such that for the m-coordinate η of S η(p) = Aγ(p) + b ( p M) (8) If γ is one-dimensional, it is called a m-geodesic.
40 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits The set of all product distributions, which is defined as E = {p(x) p(x) = p X1 (x 1 )p X2 (x 2 )} (9) p(x 1 = 1) = p 1, p(x 2 = 1) = p 2 Parameterization Define λ i = 1 2 ln p i 1 p i for i = 1, 2 Then E is an e-autoparallel submanifold(plane in θ coordinate) of S: θ ( ) 2 2 θ(p) = θ 2 = 2 0 λ1 = 2 0 λ(p) λ θ
41 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits: e-autoparallel submanifold but not m-autoparallel submanifold E in coordinate E in coordinate
42 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
43 Projections and Pythagorean Theorem KL-divergence On the manifold of probability mass functions for random variables taking values in the set X, we can also define the Kullback Leibler divergence, or relative entropy, measured in bits, according to D(p X q X ) = ( ) px (x) p X (x) log (10) q x X X (x) q X p X S
44 Projections and Pythagorean Theorem Information Projection Let p be a point in S and let E be a e-autoparallel submanifold of S. We find projection point Π E (p X ) in E minimizes the KL-divergence D(p X Π E (p X )) = min q X E D(p X q X ) (11) p X The m-geodesic connecting p X and Π E (p X ) is orthogonal to E at Π E (p X ) 5. In geometry, orthogonal usually means right angle In information geometry, orthogonal usually means certain values are uncorrelated m-geodesic q X E(pX) e-autoparallel submanifold E D(p X k E (p X )) = min qx 2E D(p X kq X )
45 Projections and Pythagorean Theorem For q X E, we have the following Pythagorean relation 6. D(p X q X ) = D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E (12) p X m-geodesic E (p X ) q X e-autoparallel submanifold E D(p X q X )=D(p X E (p X )) + D( E (p X ) q X ) 8q X 2E 6 Shun-ichi Amari and Hiroshi Nagaoka, Methods of Information Geometry, Oxford University Press, 2000.
46 Projections and Pythagorean Theorem From Pythagorean relation to Information Inequalities For S = {p(x) p(x) = q X (x 1,..., x N )}, consider submanifolds { } Q = p X ( ) p X = q X A B q X (A B) c x { } E = p X ( ) p X = q X A\B X A B q X B\A q X (A B) c x Q and E are e-autoparallel submanifold of S and E Q S. p Q and p E : Projection of p X onto Q and E respectively Q p X (i) (iii) p Q (ii) p E E ( Q ( S Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0
47 Projections and Pythagorean Theorem From Pythagorean to Information Inequalities: example For S = {p(x) p(x) = q X (x 1, x 2, x 3 )}, consider submanifolds Q = {p(x) p(x) = q X1 X 3 (x 1, x 3 )q X2 (x 2 )} E= {p(x) p(x) = q X1 (x 1 )q X2 (x 2 )q X3 (x 3 )} Projection of p X onto Q: p Q = p X1 X 3 (x 1, x 3 )p X2 (x 2 ) Projection of p X onto E: p E = p X1 (x 1 )p X2 (x 2 )p X3 (x 3 ) p X (i) (iii) Pythagorean relation E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) (ii) p Q Q E ( Q ( S p E (iii) =(i)+(ii) (i) D(p X kp Q )=h 13 + h 2 h (ii) D(p Q kp E )=h 1 + h 3 h 13 0 (iii) D(p X kp E )=h 1 + h 2 + h 3 h 123 0
48 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
49 Characterization of 4-atoms support for Four variables Recall 4-atom support in (4) [(0000)(0110)(1010)(1111)] Left: 4-atom support in m-coordinate Right: 4-atom support in e-coordinate Given marginals(line) = =0.5 3 DFZ 4 atom conjecture(point) 0.35, = = atoms uniform(point) =0.25, = =0.5 2 Hyperplane E: I(X 3 ; X 4 )=0 1 where = The whole 3D space p(0000) = p(0110) = p(1010) = p(1111) = 1 + Marginal distribution of X 3 and X 4 p(x 3 = 0) = p(x 4 = 0) =
50 Characterization of 4-atoms support for Four variables Ingleton 34 = 0 is e-autoparallel in θ coordinate for the given 4-atom support f 34 Given marginals = =0.5 Ingleton 34 > 0 V K DFZ 4 atom conjecture 0.35, = =0.5 V R 3 Ingleton 34 < 0 V M r ; 1 4 atoms uniform =0.25, = =0.5 Hyperplane Ingleton 34 =0 V N Hyperplane E: I(X 3; X 4)=0 where = 2 1
51 Characterization of 5-atoms support for Four variables Recall 5-atom support in Equation (5) [(0,0,0,0)(0,0,1,1)(0,1,1,0)(1,0,1,0)(1,1,1,0)] Fixing one coordinate θ 0 in e-coordinate to arbitrary value, to plot the resulting 3 dimensional manifold: Ingleton34 =0.1 3 Ingleton34 =0 Ingleton34 = 0.01 Ingleton34 = The result shows the e-autoparallel property of Ingleton 34 = 0 is unique to the given 4-atom support
52 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
53 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code
54 From Conditional Independence to Probabilistic Models References Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, 2016.
55 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
56 p-representable semimatroids Let N = {1, 2, N} and S be the family of all couples (i, j K), where K N and i, j are two singletons in N \ K. If we include the cases when i = j, there are, for example, 18 such couples for three variables, and 56 such couples for N = 4. A subset L S is called p-representable semimatroids (probabilistically representable)if there exists a system of N random variables X = {X i } i N such that L = {(i, j K) S(N) X i is conditionally independent of X j given X K i.e. I(X i ; X j X K ) = 0 }.
57 p-representable semimatroids František Matúš and Milan Studený find the minimal set of configurations of conditional independence that can occur within four discrete random variables. Milan Studený studied the minimal set of conditional independent structures that can be described by undirected graphs, directed acyclic graphs and chain graphs for four variables. František Matúš and Milan Studený point out there are huge difference between the above two results, meaning lots of conditional independence structures can not be represented by traditional graph-based models.
58 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
59 Probabilistic Modeling: Bayesian networks Bayesian networks: directed graphical model A Bayesian network consists of a collection of probability distributions P over X = {X 1,, X K } that factorize over a directed acyclic graph(dag) in the following way: p(x) = p(x 1,, X K ) = k K p(x k pa k ) where pa k is the direct parents nodes of X k. Examples of Bayesian networks Consider joint distribution p(x) = p(x 1, X 2, X 3 ) over three variables, we can write: X 1 p(x 1, X 2, X 3 ) = p(x 1 )p(x 2 X 1 )p(x 3 X 1 ) X 3 X 2
60 Probabilistic Modeling: limitations of Graphical Models Problems with Graphical Models Only able to represent limit number of conditional independent relations Inference is usually intractable beyond tree structured graphs Not able to represent conditional independent relations exhibit exclusively in certain contexts.
61 Probabilistic Modeling: Sum-product Network Sum-product Network: Definition A sum-product network(spn) is defined as follows. (1) A tractable univariate distribution is SPN. (2) A product of SPNs with disjoint scopes is SPN. (3) A weighted sum of SPNs with the same scope is SPN, provided all weights are positive. (4) Nothing else is SPN. Note 1: A univariate distribution is tractable iff its partition function and its mode can be computed in O(1) time. Note 2: The scope of SPN is the set of variables that appear in it.
62 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation A SPN can be represented as a tree with univariate distributions as leaves, sums and products as internal nodes, and the edges from a sum node to its children labeled with the corresponding weights X 1 X 2 X 3 X 2 X X 1 P(x) = 0.8(1.0x x 1 )(0.3x x 2 )(0.8x x 3 ) + 0.2(0.0x x 1 )(0.5x x 2 )(0.9x x 3 ).
63 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation SPN can represent context-specific independence X 1 X 2 X 3 X 1 X 2 X
64 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
65 Information guided dataset partitioning for supervised learning What we learned cluster If no similar from Structure learning of Sum-product instances Network Variables Establish Variables Return: Establish w 1 w 2 w approximate 3 Return: independence Recurse Recurse Recurse Recurse If no independence, independence, cluster similar instances Partition w 1 winstances 2 w 3 Return: w 1 w 2 w 3 Instances Recurse Recurse Variables Return: approximate independence If no independence, Find cluster independence similar instances of variables Return: + Variables Recurse Recurse Recurse Return: Instances Recurse w 1 Recurse w 2 w 3 Recurse Recurse Recurse Recurse Establish approximate independence Variables
66 Figure : Context-Specific Independence of a dataset in Sum-Product Network Information guided dataset partitioning Supervised learning algorithm : generalize from the training data to unseen situations in a reasonable way Partition data for supervised learning For supervised learning of discrete dataset, we have N feature variables X i for i N and label variable Y. Denote X A,X B as subsets of all variables for A N and B N. + w 1 w 2 X A X B (X \A,Y) (X \B,Y)
67 Information guided dataset partitioning Literatures on divide and conquer machine learning algorithms: designed to only work on particular machine learning algorithms e.g. Support Vector Machines 7, Ridge Regression : Randomly divide samples evenly and uniformly at random, or divide samples through clustering. 2 : Compute local cost function, build model on local partitioned data. 3 : Directly merge local solutions together or combined to yield a better initialization for the global problem. 7 Cho-Jui Hsieh, Si Si, Inderjit S. Dhillon, A Divid-and-Conquer Solver for Kernel Support Vector Machines, ICML Yuchen Zhang, John C. Duchi, Martin J. Wainwright, Divid and Conquer Kernel Ridge Regression, JMLR 2015.
68 Information guided dataset partitioning Partition data for supervised learning Task: divide both training and test data into blocks of instances and conquer them completely separately. Goal: generate comparable classification performance to the performance of the data what is not partitioned. Challenge: huge number of possible partitions Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2
69 Information guided dataset partitioning Our motivations to partition data for supervised learning Able to run machine learning algorithms on very large dataset in parallel without adding communication overheads. Finding multiple partition strategies giving comparable scores, ensembling the different resulting classifiers across different partitioning strategies together to get further performance improvements. Ranking
70 Information guided dataset partitioning for supervised learning Partition by frequent values of feature variables 1. Frequent values of feature variables as candidate partition criteria 2. Design score functions which calculating entropy and conditional mutual information among X i for i N and Y 3. Divide dataset based on partitions giving lower scores. Score function Consider certain K-partition of data as a hidden variable C taking values of c 1, c 2, c K. min C S(C) (13)
71 Information guided dataset partitioning for supervised learning Designing score functions with Information measures minimize I(X i ; Y C) minimize I(X i ; Y C = c j ) minimize I(X i ; Y C = c j ) I(X i ; Y ) larger H(C) to encourage more even partition larger H(Y C) to make sure partition not based on value of Y S 0 (C) = min i N I(X i ; Y C) H(Y C) H(C)
72 Information guided dataset partitioning for supervised learning S 1 (C) = c j min i N S 2 (C) = c j min i N I(X i ; Y C = c j ) H(Y C = c j ) H(C) I(X i ; Y C = c j ) I(X i ; Y ) H(Y C = c j ) H(C) Further more, we can calculate the sum of k smallest of N context-specific conditional mutual information S 3 (C, k) = c j ( S 4 (C, k) = c j ( smallest k smallest k I(X i ; Y C = c j ) ) H(C) H(Y C = c j ) I(X i ; Y C = c j ) I(X i ; Y )) H(C) H(Y C = c j ) + 0.1
73 Information guided dataset partitioning for supervised learning If only 2-parition is allowed, C = {c 1, c 2 }, following score function can be used where we calculate the root-mean-square error (RMSE) to measure the difference of context-specific conditional mutual information on all feature variables. i N [I(X i; Y C = c 1 ) I(X i ; Y C = c 2 )] 2 S 5 (C) = N H(C)
74 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning
75 Experiment 1: Click-Through Rate(CTR) prediction Results are tested on one of the online advertising click-through rate(ctr) datasets from Kaggle The task is to predict the probability of a user to click an advertisement based on historical training data. The training data have 40 million instances, test data have 4 million instances Partition ID feature-value ratio in datasets 1 app_id-ecad device_id-a99f214a site_category-50e219e C C C Table : Feature-value pairs of click-through rate data
76 Experiment 1: Click-Through Rate(CTR) prediction Experiment 1: Click-Through Rate(CTR) prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 3) S 4 (C, 3) S 5 (C) Table : Scores of different partitions Partition ID FTRL-Proximal Factorization Machines original Table : Classification results of different partitions(in log loss where logloss(y, p) = 1 N P N i=1 (y i log(p i )+(1 y i ) log(1 p i )))
77 Experiment 1: Click-Through Rate(CTR) prediction
78 Experiment 2: Customer relationship prediction Results are tested on KDD Cup 2009 customer relationship prediction dataset The task is to predict the probability of telecom company users to buy upgrades or add-ons The results can be used to analyze customers buying preferences, helping the company offer better customer service Partition ID feature-value ratio 1 Var200-NaN Var218-cJvF Var212-NhsEn4L Var211-L84s Var205-09_Q Var228-F2FyR07IdsN7I Var193-RO Var227-RAYp Table : Frequent feature-value pairs of customer relationship data
79 Experiment 2: Customer relationship prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 10) S 4 (C, 10) S 5 (C) AUC of GBDT Table : Scores of different partitions for customer relationship data I AUC of GBDT: Area Under the ROC Curve(AUC) as evaluation metric, Gradient Boosting Decision Tree as evaluation algorithm I AUC of original dataset without partitioning: I Ensemble the prediction results from top-4 partition strategies give us AUC of / 80
80 Experiment 2: Customer relationship prediction
81 Publications Yunshu Liu and John MacLaren Walsh, Mapping the Region of Entropic Vectors with Support Enumeration & Information Geometry, IEEE Trans. Inform. Theory, submitted on December 08, Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, Yunshu Liu, John MacLaren Walsh, Only One Nonlinear Non-Shannon Inequality is Necessary for Four Variables, in Proceedings of the 2nd Int. Electronic Conference on Entropy and Its Applications, Nov Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep
82 Summary Transversal Orbit Z( ) z} { N Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 Training X Y f 34 Given marginals = =0.5 Ingleton34 > 0 Test X DFZ 4 atom conjecture 0.35, = =0.5 VK VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 2 1 Training X1 Y1 Test X1 Training X2 Y2 Test X2
83 Thanks! Questions?
Non-isomorphic Distribution Supports for Calculating Entropic Vectors
Non-isomorphic Distribution Supports for Calculating Entropic Vectors Yunshu Liu & John MacLaren Walsh Adaptive Signal Processing and Information Theory Research Group Department of Electrical and Computer
More informationBounding the Entropic Region via Information Geometry
Bounding the ntropic Region via Information Geometry Yunshu Liu John MacLaren Walsh Dept. of C, Drexel University, Philadelphia, PA 19104, USA yunshu.liu@drexel.edu jwalsh@coe.drexel.edu Abstract This
More informationCharacterizing the Region of Entropic Vectors via Information Geometry
Characterizing the Region of Entropic Vectors via Information Geometry John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks
More informationInformation Geometric view of Belief Propagation
Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural
More informationExtremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences.
Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences A Thesis Submitted to the Faculty of Drexel University by Yunshu
More informationEntropic Vectors: Polyhedral Computation & Information Geometry
Entropic Vectors: Polyhedral Computation & Information Geometry John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks to
More information. Relationships Among Bounds for the Region of Entropic Vectors in Four Variables
. Relationships Among Bounds for the Region of Entropic Vectors in Four Variables John MacLaren Walsh and Steven Weber Department of Electrical and Computer Engineering, Drexel University, Philadelphia,
More informationBelief Propagation, Information Projections, and Dykstra s Algorithm
Belief Propagation, Information Projections, and Dykstra s Algorithm John MacLaren Walsh, PhD Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSymmetry in Network Coding
Symmetry in Network Coding Formalization, Graph-theoretic Characterization, and Computation Jayant Apte John Walsh Department of Electrical and Computer Engineering Drexel University, Philadelphia ISIT,
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationSymmetries in Entropy Space
1 Symmetries in Entropy Space Jayant Apte, Qi Chen, John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks to NSF CCF-1421828
More informationSymmetries in the Entropy Space
Symmetries in the Entropy Space Jayant Apte, Qi Chen, John MacLaren Walsh Abstract This paper investigates when Shannon-type inequalities completely characterize the part of the closure of the entropy
More informationGraphical Models and Independence Models
Graphical Models and Independence Models Yunshu Liu ASPITRG Research Group 2014-03-04 References: [1]. Steffen Lauritzen, Graphical Models, Oxford University Press, 1996 [2]. Christopher M. Bishop, Pattern
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationEntropic and informatic matroids
Entropic and informatic matroids Emmanuel Abbe Abstract This paper studies dependencies between random variables by means of information theoretic functionals such as the entropy and the mutual information.
More informationInformation Theory Primer:
Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationKernel Learning with Bregman Matrix Divergences
Kernel Learning with Bregman Matrix Divergences Inderjit S. Dhillon The University of Texas at Austin Workshop on Algorithms for Modern Massive Data Sets Stanford University and Yahoo! Research June 22,
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationReceived: 1 September 2018; Accepted: 10 October 2018; Published: 12 October 2018
entropy Article Entropy Inequalities for Lattices Peter Harremoës Copenhagen Business College, Nørre Voldgade 34, 1358 Copenhagen K, Denmark; harremoes@ieee.org; Tel.: +45-39-56-41-71 Current address:
More informationData Mining Techniques
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of
More informationInfo-Clustering: An Information-Theoretic Paradigm for Data Clustering. Information Theory Guest Lecture Fall 2018
Info-Clustering: An Information-Theoretic Paradigm for Data Clustering Information Theory Guest Lecture Fall 208 Problem statement Input: Independently and identically drawn samples of the jointly distributed
More informationCS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs
CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy
More informationFinal Exam, Machine Learning, Spring 2009
Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3
More informationMultiterminal Networks: Rate Regions, Codes, Computations, & Forbidden Minors
Multiterminal etworks: Rate Regions, Codes, Computations, & Forbidden Minors Ph D Thesis Proposal Congduan Li ASPITRG & MAL Drexel University congduanli@gmailcom October 5, 204 C Li (ASPITRG & MAL) Thesis
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationHomework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015
10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,
More informationLecture 4 October 18th
Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations
More informationExploiting Symmetry in Computing Polyhedral Bounds on Network Coding Rate Regions
Exploiting Symmetry in Computing Polyhedral Bounds on Network Coding Rate Regions Jayant Apte John Walsh Department of Electrical and Computer Engineering Drexel University, Philadelphia NetCod, 205 NSF
More informationMachine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.
Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted
More informationAmobile satellite communication system, like Motorola s
I TRANSACTIONS ON INFORMATION THORY, VOL. 45, NO. 4, MAY 1999 1111 Distributed Source Coding for Satellite Communications Raymond W. Yeung, Senior Member, I, Zhen Zhang, Senior Member, I Abstract Inspired
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationProbabilistic Graphical Models
Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder
More informationIntroduction to Information Geometry
Introduction to Information Geometry based on the book Methods of Information Geometry written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Outline 1 Introduction to differential geometry
More informationReview: Directed Models (Bayes Nets)
X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationBayesian Inference Course, WTCN, UCL, March 2013
Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information
More informationJune 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions
Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all
More informationCS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine
CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows
More informationNetwork Coding on Directed Acyclic Graphs
Network Coding on Directed Acyclic Graphs John MacLaren Walsh, Ph.D. Multiterminal Information Theory, Spring Quarter, 0 Reference These notes are directly derived from Chapter of R. W. Yeung s Information
More informationModel Complexity of Pseudo-independent Models
Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationLinearly Representable Entropy Vectors and their Relation to Network Coding Solutions
2009 IEEE Information Theory Workshop Linearly Representable Entropy Vectors and their Relation to Network Coding Solutions Asaf Cohen, Michelle Effros, Salman Avestimehr and Ralf Koetter Abstract In this
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationBias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions
- Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationInformation Theory, Statistics, and Decision Trees
Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010
More informationSHANNON S information measures refer to entropies, conditional
1924 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 6, NOVEMBER 1997 A Framework for Linear Information Inequalities Raymond W. Yeung, Senior Member, IEEE Abstract We present a framework for information
More informationBayesian Machine Learning - Lecture 7
Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1
More information5 Mutual Information and Channel Capacity
5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationAlgebraic matroids are almost entropic
accepted to Proceedings of the AMS June 28, 2017 Algebraic matroids are almost entropic František Matúš Abstract. Algebraic matroids capture properties of the algebraic dependence among elements of extension
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationA new computational approach for determining rate regions and optimal codes for coded networks
A new computational approach for determining rate regions and optimal codes for coded networks Congduan Li, Jayant Apte, John MacLaren Walsh, Steven Weber Drexel University, Dept. of ECE, Philadelphia,
More informationA computational approach for determining rate regions and codes using entropic vector bounds
1 A computational approach for determining rate regions and codes using entropic vector bounds Congduan Li, John MacLaren Walsh, Steven Weber Drexel University, Dept. of ECE, Philadelphia, PA 19104, USA
More informationCapacity Region of the Permutation Channel
Capacity Region of the Permutation Channel John MacLaren Walsh and Steven Weber Abstract We discuss the capacity region of a degraded broadcast channel (DBC) formed from a channel that randomly permutes
More informationChapter 16. Structured Probabilistic Models for Deep Learning
Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe
More informationBased on slides by Richard Zemel
CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we
More informationNetwork Combination Operations Preserving the Sufficiency of Linear Network Codes
Network Combination Operations Preserving the Sufficiency of Linear Network Codes Congduan Li, Steven Weber, John MacLaren Walsh ECE Department, Drexel University Philadelphia, PA 904 Abstract Operations
More informationJunction Tree, BP and Variational Methods
Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,
More informationCSCI-567: Machine Learning (Spring 2019)
CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationChapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang
Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction
More information11. Learning graphical models
Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical
More informationUpper Bounds on the Capacity of Binary Intermittent Communication
Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationLecture 1: Introduction, Entropy and ML estimation
0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual
More informationSubmodularity in Machine Learning
Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity
More informationCHAPTER-17. Decision Tree Induction
CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationProbabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning
Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck KULeuven Symposium Dec 12, 2018 Outline Learning Adding knowledge to deep learning Logistic circuits
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationLECTURE 3. Last time:
LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate
More informationThe local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance
The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA
More informationSupport Vector Machines, Kernel SVM
Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM
More information4.1 Notation and probability review
Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy
More informationBregman Divergences for Data Mining Meta-Algorithms
p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,
More informationCharacteristic-Dependent Linear Rank Inequalities and Network Coding Applications
haracteristic-ependent Linear Rank Inequalities and Network oding pplications Randall ougherty, Eric Freiling, and Kenneth Zeger bstract Two characteristic-dependent linear rank inequalities are given
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationUndirected Graphical Models
Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates
More informationLars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany
Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More information