Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences

Size: px
Start display at page:

Download "Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences"

Transcription

1 Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences Ph.D. Dissertation Defense Yunshu Liu Adaptive Signal Processing and Information Theory Research Group ECE Department, Drexel University, Philadelphia, PA

2 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code

3 Main Contribution 1: Enumerating k-atom supports and map to entropic region Introducing the concept of non-isomorphic distribution support enumeration for entropy vectors Transversal Orbit Z( ) z} { N Generating list of non-isomorphic k-atom, N-variable supports through the algorithm of snakes and ladders Optimizing entropy inner bounds from k-atom distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0

4 Main Contribution 2: Characterization of entropic region via Information Geometry Characterizing information inequalities with Information Geometry Q p X (i) p Q (iii) (ii) E ( Q ( S p E Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0 Mapping k-atoms support distribution with Information Geometry f 34 VK Given marginals = =0.5 DFZ 4 atom conjecture 0.35, = =0.5 Ingleton34 > 0 VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 1

5 Main Contribution 3: Information guided dataset partitioning Exploit context-specific independence for supervised learning + w 1 w 2 (X \A,Y) XA XB (X \B,Y) Design information theoretic score functions to partition dataset Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2

6 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

7 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

8 The Region of Entropic Vectors Γ N Joint entropy viewed as a vector For N discrete random variables, there is a 2 N 1 dimensional vector h = (h A A N ) R 2N 1 associated with it, we call it entropic vectors. For example: N = 2: h = {h 1 h 2 h 12 } N = 3: h = {h 1 h 2 h 12 h 3 h 13 h 23 h 123 } The Region of Entropic vectors: all valid entropic vectors Γ N = {h R2N 1 h is entropic} Examples of two random variables Γ 2 (Polyhedral cone): h 1 + h 2 h 12, h 12 h 1 and h 12 h 2

9 The Region of Entropic Vectors Γ N Shannon-Type Information inequality { Γ N = h R 2N 1 h A + h B h A B + h A B A, B N h P h Q 0 Q P N } Relationship between Γ n and Γ n Γ 2 = Γ 2 and Γ 3 = Γ 3: All the Information inequalities on N 3 variables are Shannon-Type. (0 1 1) (1 1 1) h 12 = H(X 1,X 2 ) (1 0 1) For N = 2, Use h 1 + h 2 h 12, h 12 h 1 and h 12 h 2, we get Γ 2 (0 0 0) h 2 = H(X 2 ) h 1 = H(X 1 )

10 The Region of Entropic Vectors Γ N Γ 2 and Γ 3 are polyhedral cone and fully characterized. Γ 4 is a non-polyhedral but convex cone, we use outer bound and inner bound to approximate it. Outer bound for Γ N : non-shannon type Inequality Shannon Outer bound N ZY Outer bound + N For N 4 we have Γ N Γ N, indicating non-shannon type Information inequalities exist for N 4 Entropic region N Better Outer bound ++ N First Non-Shannon-Type Information inequality(zhang-yeung 1 ) 2I(X 3 ; X 4 ) I(X 1 ; X 2 )+I(X 1 ; X 3, X 4 )+3I(X 3 ; X 4 X 1 )+I(X 3 ; X 4 X 2 ) 1 Zhen Zhang and Raymond W. Yeung, On Characterization of Entropy Function via Information Inequalities, IEEE Trans. on Information Theory July 1998.

11 The Region of Entropic Vectors Γ N Inner bound for Γ 4 (N =4): Ingleton Inequality: Ingleton ij 0 Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton Inequality hold for ranks(dimensions) of finite subsets of linear spaces, meaning linear codes give us Ingleton ij 0. Shannon Outer bound 4 ZY Outer bound + 4 Entropic region 4 Better Outer bound ++ 4 How to generate better inner Ingleton ij 0 bounds from distributions? Ingleton Inner bound

12 The Region of Entropic Vectors Γ N Motivation of the main results in contribution 1 and contribution 2: What type of distributions generate entropy vectors in the gap? Distribution support Enumeration What are the properties of distributions on the boundary and in the gap? Information Geometric characterization Shannon Outer bound 4 How to generate better inner bounds from distributions? Ingleton ij 0 Ingleton Inner bound

13 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

14 Enumerating k-atom supports and map to Entropic Region References Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec

15 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

16 Equivalence of k-atom Supports An example of 3-variable 4-atom support (0, 0, 0) > α (0, 2, 2) > β α (1, 0, 1) > γ α (1) (1, 1, 3) > 1 + α β γ In distribution support (1), each row corresponding to one outcome/atom, each column represent a variable. Assign variables from left to right as X 1, X 2 and X 3 such that X = (X 1, X 2, X 3 ). If we assume X 1 = 2, X 2 = 3 and X 3 = 4, then X has totally 24 outcomes, among these outcomes, only 4 of them have non-zero probability, we call it a 4-atom distribution support.

17 Equivalence of k-atom Supports Equivalence of two k-atom supports Two k-atom supports S, S, S = S = k, will be said to be equivalent, for the purposes of tracing out the entropy region, if they yield the same set of entropic vectors, up to a permutation of the random variables. 4-atom support A 2 3 (0, 0, 0) 6(0, 2, 2) 7 4(1, 0, 1) 5 (1, 1, 3) 4-atom support B 2 3 (1, 2, 1) 6(3, 4, 1) 7 4(1, 3, 0) 5 (2, 1, 0) same set of entropic vectors

18 Equivalence of k-atom Supports Outcome for each variables in a k-atom support is a set partition, entropy remain unchanged for set partitions under symmetric group S k Consider the 4-atom support in (1) (0, 0, 0) (0, 2, 2) (1, 0, 1) (1, 1, 3) outcome of X 1 can be encoded as {{1, 2}, {3, 4}} outcome of X 2 can be encoded as {{1, 3}, {2}, {4}} outcome of X 3 can be encoded as {{1}, {2}, {3}, {4}} How to encode and find equivalent supports for multiple variables?

19 Orbit data structure Orbit of elements under group action Let a finite group Z acting on a finite set V, For v V, the orbit of v under Z is defined as Z (v) = {zv z Z } Transversal T of the Orbits under group Z transversal is the collection of canonical representative of all the orbits, e.g. least under some ordering of V. V Transversal z} { v Orbit Z(v)

20 Non-isomorphic k-atom supports set partitions A set partition of N k 1 is a set B = {B 1,..., B t } consisting of t subsets B 1,..., B t of N k 1 that are pairwise disjoint B i B j =, i j, and whose union is N k 1, so that Nk 1 = t i=1 B i. Let Π(N k 1 ) denote the set of all set partitions of Nk 1. The cardinality of Π(N k 1 ) is commonly known as Bell numbers. Examples of set partitions Π(N 3 1 ) = {{{1, 2, 3}}, {{1, 2}, {3}}, {{1, 3}, {2}}, {{2, 3}, {1}}, {{1}, {2}, {3}}}, Π(N 4 1 ) = {{{1, 2, 3, 4}}, {{1, 2, 3}, {4}}, {{1, 2, 4}, {3}}, {{1, 3, 4}, {2}}, {{2, 3, 4}, {1}}, {{1, 2}, {3, 4}}, {{1, 3}, {2, 4}}, {{1, 4}, {2, 3}}, {{1, 2}, {3}, {4}}, {{1, 3}, {2}, {4}}, {{1, 4}, {2}, {3}}, {{2, 3}, {1}, {4}}, {{2, 4}, {1}, {3}}, {{3, 4}, {1}, {2}}, {{1}, {2}, {3}, {4}}}.

21 Non-isomorphic k-atom supports Use combinations of N set partitions to represent distribution supports for N variables Let Ξ N be the collection of all sets of N set partitions of N k 1 whose meet is the finest partition (the set of singletons), Ξ N := ξ ξ Π(Nk 1 ), ξ = N, N B = {{i}}. (2) B ξ where meet of two partitions B, B is defined as { } B B = B i B j B i B, B j B, B i B j i=1

22 Non-isomorphic k-atom supports Ξ N represent all the necessary distribution supports to calculate entropic vectors for N variables and k atoms, selecting one representative from each equivalence class will lead us to a list of all non-isomorphic k-atom, N-variable supports. Transversal Orbit Z( ) z} { N The problem of generating the list of all non-isomorphic k-atom, N-variable supports is equivalent to obtaining a transversal of the orbits of S k acting on Ξ N.

23 Non-isomorphic k-atom supports Calculate transversal via the algorithm of Snakes and Ladders 2 For given set N k 1, the algorithm of Snakes and Ladders first compute the transversal of Ξ 1, then it recursively increase the subsets size i, where the enumeration of Ξ i is depending on the result of Ξ i 1 : Ξ 1 Ξ 2 Ξ N 1 Ξ N Number of non-isomorphic k-atom, N-variable supports N\k Anton Betten, Michael Braun, Harald Fripertinger etc., Error-Correcting Linear Codes: Classification by Isometry and Applications, Springer 2006 pp

24 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

25 Maximal Ingleton Violation and the Four Atom Conjecture Ingleton score 3 Given a probability distribution, we define the Ingleton score to be Ingleton score = Ingleton ij (3) H(X i X j X k X l ) where Ingleton ij = I(X k ; X l X i ) + I(X k ; X l X j ) + I(X i ; X j ) I(X k ; X l ) Ingleton score determine how much a distribution violate one of the six Ingleton Inequality 3 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.

26 Maximal Ingleton Violation and the Four Atom Conjecture Four Atom Conjecture: the lowest reported Ingleton score is approximately , it is attained by (4) (0, 0, 0, 0) > 0.35 (0, 1, 1, 0) > 0.15 (1, 0, 1, 0) > 0.15 (4) (1, 1, 1, 1) > 0.35 A even lower Ingleton score of was reported by F. Matúš and L. Csirmaz 4 through a transformation of entropic vectors, thus without a distribution associated with it. 4 F. Matus and L. Csirmaz, Entropy region and convolution, arxiv: v1.

27 Maximal Ingleton Violation and the Four Atom Conjecture Number of non-isomorphic Ingleton violating supports for four variables number of atoms k all supports Ingleton violating The only Ingleton violating 5-atom support that is not grown from 4-atom support (4) with a Ingleton score (0, 0, 0, 0) (0, 0, 1, 1) (0, 1, 1, 0) (1, 0, 1, 0) (1, 1, 1, 0) (5)

28 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

29 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Random cost functions Ingleton cost function k-atom inner bound generation for four variables Randomly generate enough cost functions from the gap between Shannon outer bound and Ingleton inner bound Given a k-atom support, find the distribution that optimizes each of the cost function, save it if it violates Ingleton Taking the convex hull of all the entropic vectors from the k-atom distributions that violate Ingleton kl to get the k-atom inner bound

30 Optimizing Inner Bounds to Entropy from k-atom Distribution Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 inner and outer bounds percent of pyramid Shannon 100 Outer bound from Dougherty etc ,5,6 atoms inner bound ,5 atoms inner bound atoms inner bound atom conjecture point only atoms inner bound 0 5 Randall Dougherty, Chris Freiling, Kenneth Zeger, Non-Shannon Information Inequalities in Four Random Variables, arxiv: v1.

31 Optimizing Inner Bounds to Entropy from k-atom Distribution Three dimensional projection of 4-atom inner bound(3-d projection introduced by F. Matúš and L. Csirmaz 3 )

32 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

33 Characterization of Entropic Region via Information Geometry References Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep Yunshu Liu and John MacLaren Walsh, Mapping the Entropic Region with Support Enumeration & Information Geometry, submitted to IEEE Transaction on Information Theory on Dec

34 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

35 Manifold of Probability distributions Example of two binary discrete random variables Let X = {X 1, X 2 }, X i = 1 or 1, consider the family of all the probability distributions S = {p(x) p(x) > 0, i p(x = C i) = 1} Outcomes: C 0 = { 1, 1}, C 1 = { 1, 1} C 2 = {1, 1}, C 3 = {1, 1} Parameterization: p(x = C i ) = η i for i = 1, 2, 3 p(x = C 0 ) = 1 3 j=1 ηj η = (η 1 η 2 η 3 ) is called the m-coordinate of S. η 1 (0 0 1) η 3 (1 0 0) (0 0 0) (0 1 0) Simplex η 2

36 Manifold of Probability distributions Example of two binary discrete random variables A new coordinate {θ i }, which is defined as θ i = ln η i 1 3 j=1 ηj for i = 1, 2, 3. (6) θ = (θ 1 θ 2 θ 3 ) is called the e-coordinate of S. η 3 θ 3 (1 0 0) (0 0 0) (0 1 0) η 2 θ 2 η 1 (0 0 1) Simplex θ 1 Whole space

37 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

38 e-autoparallel submanifold and m-autoparallel submanifold e-autoparallel submanifold A subfamily E is said to be e-autoparallel submanifold of S if for some coordinate λ there exist a matrix A and a vector b such that for the e-coordinate θ of S θ(p) = Aλ(p) + b ( p E) (7) If λ is one-dimensional, it is called a e-geodesic. An e-autoparallel submanifold of S is a hyperplane in e-coordinate; A e-geodesic of S is a straight line in e-coordinate. Hyper-plane in E in coordinate

39 e-autoparallel submanifold and m-autoparallel submanifold m-autoparallel submanifold A subfamily M is said to be m-autoparallel submanifold of S if for some coordinate γ there exist a matrix A and a vector b such that for the m-coordinate η of S η(p) = Aγ(p) + b ( p M) (8) If γ is one-dimensional, it is called a m-geodesic.

40 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits The set of all product distributions, which is defined as E = {p(x) p(x) = p X1 (x 1 )p X2 (x 2 )} (9) p(x 1 = 1) = p 1, p(x 2 = 1) = p 2 Parameterization Define λ i = 1 2 ln p i 1 p i for i = 1, 2 Then E is an e-autoparallel submanifold(plane in θ coordinate) of S: θ ( ) 2 2 θ(p) = θ 2 = 2 0 λ1 = 2 0 λ(p) λ θ

41 e-autoparallel submanifold and m-autoparallel submanifold Two independent bits: e-autoparallel submanifold but not m-autoparallel submanifold E in coordinate E in coordinate

42 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

43 Projections and Pythagorean Theorem KL-divergence On the manifold of probability mass functions for random variables taking values in the set X, we can also define the Kullback Leibler divergence, or relative entropy, measured in bits, according to D(p X q X ) = ( ) px (x) p X (x) log (10) q x X X (x) q X p X S

44 Projections and Pythagorean Theorem Information Projection Let p be a point in S and let E be a e-autoparallel submanifold of S. We find projection point Π E (p X ) in E minimizes the KL-divergence D(p X Π E (p X )) = min q X E D(p X q X ) (11) p X The m-geodesic connecting p X and Π E (p X ) is orthogonal to E at Π E (p X ) 5. In geometry, orthogonal usually means right angle In information geometry, orthogonal usually means certain values are uncorrelated m-geodesic q X E(pX) e-autoparallel submanifold E D(p X k E (p X )) = min qx 2E D(p X kq X )

45 Projections and Pythagorean Theorem For q X E, we have the following Pythagorean relation 6. D(p X q X ) = D(p X Π E (p X )) + D(Π E (p X ) q X ) q X E (12) p X m-geodesic E (p X ) q X e-autoparallel submanifold E D(p X q X )=D(p X E (p X )) + D( E (p X ) q X ) 8q X 2E 6 Shun-ichi Amari and Hiroshi Nagaoka, Methods of Information Geometry, Oxford University Press, 2000.

46 Projections and Pythagorean Theorem From Pythagorean relation to Information Inequalities For S = {p(x) p(x) = q X (x 1,..., x N )}, consider submanifolds { } Q = p X ( ) p X = q X A B q X (A B) c x { } E = p X ( ) p X = q X A\B X A B q X B\A q X (A B) c x Q and E are e-autoparallel submanifold of S and E Q S. p Q and p E : Projection of p X onto Q and E respectively Q p X (i) (iii) p Q (ii) p E E ( Q ( S Pythagorean relation: E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) Markov chain in B: X A\B $ X A\B $ X B\A Entropy Submodularity: (ii) D(p Q kp E )=h A + h B h A[B h A\B 0

47 Projections and Pythagorean Theorem From Pythagorean to Information Inequalities: example For S = {p(x) p(x) = q X (x 1, x 2, x 3 )}, consider submanifolds Q = {p(x) p(x) = q X1 X 3 (x 1, x 3 )q X2 (x 2 )} E= {p(x) p(x) = q X1 (x 1 )q X2 (x 2 )q X3 (x 3 )} Projection of p X onto Q: p Q = p X1 X 3 (x 1, x 3 )p X2 (x 2 ) Projection of p X onto E: p E = p X1 (x 1 )p X2 (x 2 )p X3 (x 3 ) p X (i) (iii) Pythagorean relation E D(p X kp E )=D(p X kp Q )+D(p Q kp E ) (ii) p Q Q E ( Q ( S p E (iii) =(i)+(ii) (i) D(p X kp Q )=h 13 + h 2 h (ii) D(p Q kp E )=h 1 + h 3 h 13 0 (iii) D(p X kp E )=h 1 + h 2 + h 3 h 123 0

48 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

49 Characterization of 4-atoms support for Four variables Recall 4-atom support in (4) [(0000)(0110)(1010)(1111)] Left: 4-atom support in m-coordinate Right: 4-atom support in e-coordinate Given marginals(line) = =0.5 3 DFZ 4 atom conjecture(point) 0.35, = = atoms uniform(point) =0.25, = =0.5 2 Hyperplane E: I(X 3 ; X 4 )=0 1 where = The whole 3D space p(0000) = p(0110) = p(1010) = p(1111) = 1 + Marginal distribution of X 3 and X 4 p(x 3 = 0) = p(x 4 = 0) =

50 Characterization of 4-atoms support for Four variables Ingleton 34 = 0 is e-autoparallel in θ coordinate for the given 4-atom support f 34 Given marginals = =0.5 Ingleton 34 > 0 V K DFZ 4 atom conjecture 0.35, = =0.5 V R 3 Ingleton 34 < 0 V M r ; 1 4 atoms uniform =0.25, = =0.5 Hyperplane Ingleton 34 =0 V N Hyperplane E: I(X 3; X 4)=0 where = 2 1

51 Characterization of 5-atoms support for Four variables Recall 5-atom support in Equation (5) [(0,0,0,0)(0,0,1,1)(0,1,1,0)(1,0,1,0)(1,1,1,0)] Fixing one coordinate θ 0 in e-coordinate to arbitrary value, to plot the resulting 3 dimensional manifold: Ingleton34 =0.1 3 Ingleton34 =0 Ingleton34 = 0.01 Ingleton34 = The result shows the e-autoparallel property of Ingleton 34 = 0 is unique to the given 4-atom support

52 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

53 Motivation Conditional Independence of Discrete Random Variables Region of Entropic Vectors Applications p-representable semimatroids Probabilistic Modeling: Graphical models Sum-Product Networks Applications Calculate Network Coding capacity region Determine limits of large coded distributed storage system Find theoretical lower bounds for secret sharing schemes Classification: Information guided dataset partitioning Structure learning: Protein-drug interaction Communication: decoding LDPC code

54 From Conditional Independence to Probabilistic Models References Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, 2016.

55 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

56 p-representable semimatroids Let N = {1, 2, N} and S be the family of all couples (i, j K), where K N and i, j are two singletons in N \ K. If we include the cases when i = j, there are, for example, 18 such couples for three variables, and 56 such couples for N = 4. A subset L S is called p-representable semimatroids (probabilistically representable)if there exists a system of N random variables X = {X i } i N such that L = {(i, j K) S(N) X i is conditionally independent of X j given X K i.e. I(X i ; X j X K ) = 0 }.

57 p-representable semimatroids František Matúš and Milan Studený find the minimal set of configurations of conditional independence that can occur within four discrete random variables. Milan Studený studied the minimal set of conditional independent structures that can be described by undirected graphs, directed acyclic graphs and chain graphs for four variables. František Matúš and Milan Studený point out there are huge difference between the above two results, meaning lots of conditional independence structures can not be represented by traditional graph-based models.

58 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

59 Probabilistic Modeling: Bayesian networks Bayesian networks: directed graphical model A Bayesian network consists of a collection of probability distributions P over X = {X 1,, X K } that factorize over a directed acyclic graph(dag) in the following way: p(x) = p(x 1,, X K ) = k K p(x k pa k ) where pa k is the direct parents nodes of X k. Examples of Bayesian networks Consider joint distribution p(x) = p(x 1, X 2, X 3 ) over three variables, we can write: X 1 p(x 1, X 2, X 3 ) = p(x 1 )p(x 2 X 1 )p(x 3 X 1 ) X 3 X 2

60 Probabilistic Modeling: limitations of Graphical Models Problems with Graphical Models Only able to represent limit number of conditional independent relations Inference is usually intractable beyond tree structured graphs Not able to represent conditional independent relations exhibit exclusively in certain contexts.

61 Probabilistic Modeling: Sum-product Network Sum-product Network: Definition A sum-product network(spn) is defined as follows. (1) A tractable univariate distribution is SPN. (2) A product of SPNs with disjoint scopes is SPN. (3) A weighted sum of SPNs with the same scope is SPN, provided all weights are positive. (4) Nothing else is SPN. Note 1: A univariate distribution is tractable iff its partition function and its mode can be computed in O(1) time. Note 2: The scope of SPN is the set of variables that appear in it.

62 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation A SPN can be represented as a tree with univariate distributions as leaves, sums and products as internal nodes, and the edges from a sum node to its children labeled with the corresponding weights X 1 X 2 X 3 X 2 X X 1 P(x) = 0.8(1.0x x 1 )(0.3x x 2 )(0.8x x 3 ) + 0.2(0.0x x 1 )(0.5x x 2 )(0.9x x 3 ).

63 Probabilistic Modeling: Sum-product Network Sum-product Network: Representation SPN can represent context-specific independence X 1 X 2 X 3 X 1 X 2 X

64 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

65 Information guided dataset partitioning for supervised learning What we learned cluster If no similar from Structure learning of Sum-product instances Network Variables Establish Variables Return: Establish w 1 w 2 w approximate 3 Return: independence Recurse Recurse Recurse Recurse If no independence, independence, cluster similar instances Partition w 1 winstances 2 w 3 Return: w 1 w 2 w 3 Instances Recurse Recurse Variables Return: approximate independence If no independence, Find cluster independence similar instances of variables Return: + Variables Recurse Recurse Recurse Return: Instances Recurse w 1 Recurse w 2 w 3 Recurse Recurse Recurse Recurse Establish approximate independence Variables

66 Figure : Context-Specific Independence of a dataset in Sum-Product Network Information guided dataset partitioning Supervised learning algorithm : generalize from the training data to unseen situations in a reasonable way Partition data for supervised learning For supervised learning of discrete dataset, we have N feature variables X i for i N and label variable Y. Denote X A,X B as subsets of all variables for A N and B N. + w 1 w 2 X A X B (X \A,Y) (X \B,Y)

67 Information guided dataset partitioning Literatures on divide and conquer machine learning algorithms: designed to only work on particular machine learning algorithms e.g. Support Vector Machines 7, Ridge Regression : Randomly divide samples evenly and uniformly at random, or divide samples through clustering. 2 : Compute local cost function, build model on local partitioned data. 3 : Directly merge local solutions together or combined to yield a better initialization for the global problem. 7 Cho-Jui Hsieh, Si Si, Inderjit S. Dhillon, A Divid-and-Conquer Solver for Kernel Support Vector Machines, ICML Yuchen Zhang, John C. Duchi, Martin J. Wainwright, Divid and Conquer Kernel Ridge Regression, JMLR 2015.

68 Information guided dataset partitioning Partition data for supervised learning Task: divide both training and test data into blocks of instances and conquer them completely separately. Goal: generate comparable classification performance to the performance of the data what is not partitioned. Challenge: huge number of possible partitions Training X Y Test X Training X1 Y1 Test X1 Training X2 Y2 Test X2

69 Information guided dataset partitioning Our motivations to partition data for supervised learning Able to run machine learning algorithms on very large dataset in parallel without adding communication overheads. Finding multiple partition strategies giving comparable scores, ensembling the different resulting classifiers across different partitioning strategies together to get further performance improvements. Ranking

70 Information guided dataset partitioning for supervised learning Partition by frequent values of feature variables 1. Frequent values of feature variables as candidate partition criteria 2. Design score functions which calculating entropy and conditional mutual information among X i for i N and Y 3. Divide dataset based on partitions giving lower scores. Score function Consider certain K-partition of data as a hidden variable C taking values of c 1, c 2, c K. min C S(C) (13)

71 Information guided dataset partitioning for supervised learning Designing score functions with Information measures minimize I(X i ; Y C) minimize I(X i ; Y C = c j ) minimize I(X i ; Y C = c j ) I(X i ; Y ) larger H(C) to encourage more even partition larger H(Y C) to make sure partition not based on value of Y S 0 (C) = min i N I(X i ; Y C) H(Y C) H(C)

72 Information guided dataset partitioning for supervised learning S 1 (C) = c j min i N S 2 (C) = c j min i N I(X i ; Y C = c j ) H(Y C = c j ) H(C) I(X i ; Y C = c j ) I(X i ; Y ) H(Y C = c j ) H(C) Further more, we can calculate the sum of k smallest of N context-specific conditional mutual information S 3 (C, k) = c j ( S 4 (C, k) = c j ( smallest k smallest k I(X i ; Y C = c j ) ) H(C) H(Y C = c j ) I(X i ; Y C = c j ) I(X i ; Y )) H(C) H(Y C = c j ) + 0.1

73 Information guided dataset partitioning for supervised learning If only 2-parition is allowed, C = {c 1, c 2 }, following score function can be used where we calculate the root-mean-square error (RMSE) to measure the difference of context-specific conditional mutual information on all feature variables. i N [I(X i; Y C = c 1 ) I(X i ; Y C = c 2 )] 2 S 5 (C) = N H(C)

74 Outline The Region of Entropic Vectors Γ N Enumerating k-atom supports and map to Entropic Region Non-isomorphic k-atom supports Maximal Ingleton Violation and the Four Atom Conjecture Optimizing Entropy Inner Bounds from k-atom Distribution Characterization of Entropic Region via Information Geometry Manifold of Probability distributions e-autoparallel submanifold and m-autoparallel submanifold Projections and Pythagorean Theorem Information Geometry of k-atom support From Conditional Independence to Probabilistic Models p-representable semimatroids Probabilistic Models Information guided dataset partitioning for supervised learning Experiments on Information guided dataset partitioning

75 Experiment 1: Click-Through Rate(CTR) prediction Results are tested on one of the online advertising click-through rate(ctr) datasets from Kaggle The task is to predict the probability of a user to click an advertisement based on historical training data. The training data have 40 million instances, test data have 4 million instances Partition ID feature-value ratio in datasets 1 app_id-ecad device_id-a99f214a site_category-50e219e C C C Table : Feature-value pairs of click-through rate data

76 Experiment 1: Click-Through Rate(CTR) prediction Experiment 1: Click-Through Rate(CTR) prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 3) S 4 (C, 3) S 5 (C) Table : Scores of different partitions Partition ID FTRL-Proximal Factorization Machines original Table : Classification results of different partitions(in log loss where logloss(y, p) = 1 N P N i=1 (y i log(p i )+(1 y i ) log(1 p i )))

77 Experiment 1: Click-Through Rate(CTR) prediction

78 Experiment 2: Customer relationship prediction Results are tested on KDD Cup 2009 customer relationship prediction dataset The task is to predict the probability of telecom company users to buy upgrades or add-ons The results can be used to analyze customers buying preferences, helping the company offer better customer service Partition ID feature-value ratio 1 Var200-NaN Var218-cJvF Var212-NhsEn4L Var211-L84s Var205-09_Q Var228-F2FyR07IdsN7I Var193-RO Var227-RAYp Table : Frequent feature-value pairs of customer relationship data

79 Experiment 2: Customer relationship prediction Partition ID S 1 (C) S 2 (C) S 3 (C, 10) S 4 (C, 10) S 5 (C) AUC of GBDT Table : Scores of different partitions for customer relationship data I AUC of GBDT: Area Under the ROC Curve(AUC) as evaluation metric, Gradient Boosting Decision Tree as evaluation algorithm I AUC of original dataset without partitioning: I Ensemble the prediction results from top-4 partition strategies give us AUC of / 80

80 Experiment 2: Customer relationship prediction

81 Publications Yunshu Liu and John MacLaren Walsh, Mapping the Region of Entropic Vectors with Support Enumeration & Information Geometry, IEEE Trans. Inform. Theory, submitted on December 08, Yunshu Liu and John MacLaren Walsh, Information Guided Dataset Partitioning for Supervised Learning, Conference on Uncertainty in Artificial Intelligence, submitted on March 01, Yunshu Liu, John MacLaren Walsh, Only One Nonlinear Non-Shannon Inequality is Necessary for Four Variables, in Proceedings of the 2nd Int. Electronic Conference on Entropy and Its Applications, Nov Yunshu Liu and John MacLaren Walsh, Non-isomorphic Distribution Supports for Calculating Entropic Vectors, in 53nd Annual Allerton Conference on Communication, Control, and Computing, Oct Yunshu Liu, John MacLaren Walsh, Bounding the entropic region via information geometry, in IEEE Information Theory Workshop, Sep

82 Summary Transversal Orbit Z( ) z} { N Shannon Outer bound 4 Non-Shannon Outer bound Inner Bound of two points Ingleton 34 =0 Training X Y f 34 Given marginals = =0.5 Ingleton34 > 0 Test X DFZ 4 atom conjecture 0.35, = =0.5 VK VM r ; 1 3 VR 4 atoms uniform =0.25, = =0.5 Ingleton34 < 0 Hyperplane Ingleton34 = 0 VN Hyperplane E: I(X3;X4) =0 where = 2 1 Training X1 Y1 Test X1 Training X2 Y2 Test X2

83 Thanks! Questions?

Non-isomorphic Distribution Supports for Calculating Entropic Vectors

Non-isomorphic Distribution Supports for Calculating Entropic Vectors Non-isomorphic Distribution Supports for Calculating Entropic Vectors Yunshu Liu & John MacLaren Walsh Adaptive Signal Processing and Information Theory Research Group Department of Electrical and Computer

More information

Bounding the Entropic Region via Information Geometry

Bounding the Entropic Region via Information Geometry Bounding the ntropic Region via Information Geometry Yunshu Liu John MacLaren Walsh Dept. of C, Drexel University, Philadelphia, PA 19104, USA yunshu.liu@drexel.edu jwalsh@coe.drexel.edu Abstract This

More information

Characterizing the Region of Entropic Vectors via Information Geometry

Characterizing the Region of Entropic Vectors via Information Geometry Characterizing the Region of Entropic Vectors via Information Geometry John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks

More information

Information Geometric view of Belief Propagation

Information Geometric view of Belief Propagation Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural

More information

Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences.

Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences. Extremal Entropy: Information Geometry, Numerical Entropy Mapping, and Machine Learning Application of Associated Conditional Independences A Thesis Submitted to the Faculty of Drexel University by Yunshu

More information

Entropic Vectors: Polyhedral Computation & Information Geometry

Entropic Vectors: Polyhedral Computation & Information Geometry Entropic Vectors: Polyhedral Computation & Information Geometry John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks to

More information

. Relationships Among Bounds for the Region of Entropic Vectors in Four Variables

. Relationships Among Bounds for the Region of Entropic Vectors in Four Variables . Relationships Among Bounds for the Region of Entropic Vectors in Four Variables John MacLaren Walsh and Steven Weber Department of Electrical and Computer Engineering, Drexel University, Philadelphia,

More information

Belief Propagation, Information Projections, and Dykstra s Algorithm

Belief Propagation, Information Projections, and Dykstra s Algorithm Belief Propagation, Information Projections, and Dykstra s Algorithm John MacLaren Walsh, PhD Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Symmetry in Network Coding

Symmetry in Network Coding Symmetry in Network Coding Formalization, Graph-theoretic Characterization, and Computation Jayant Apte John Walsh Department of Electrical and Computer Engineering Drexel University, Philadelphia ISIT,

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Symmetries in Entropy Space

Symmetries in Entropy Space 1 Symmetries in Entropy Space Jayant Apte, Qi Chen, John MacLaren Walsh Department of Electrical and Computer Engineering Drexel University Philadelphia, PA jwalsh@ece.drexel.edu Thanks to NSF CCF-1421828

More information

Symmetries in the Entropy Space

Symmetries in the Entropy Space Symmetries in the Entropy Space Jayant Apte, Qi Chen, John MacLaren Walsh Abstract This paper investigates when Shannon-type inequalities completely characterize the part of the closure of the entropy

More information

Graphical Models and Independence Models

Graphical Models and Independence Models Graphical Models and Independence Models Yunshu Liu ASPITRG Research Group 2014-03-04 References: [1]. Steffen Lauritzen, Graphical Models, Oxford University Press, 1996 [2]. Christopher M. Bishop, Pattern

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Entropic and informatic matroids

Entropic and informatic matroids Entropic and informatic matroids Emmanuel Abbe Abstract This paper studies dependencies between random variables by means of information theoretic functionals such as the entropy and the mutual information.

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Kernel Learning with Bregman Matrix Divergences

Kernel Learning with Bregman Matrix Divergences Kernel Learning with Bregman Matrix Divergences Inderjit S. Dhillon The University of Texas at Austin Workshop on Algorithms for Modern Massive Data Sets Stanford University and Yahoo! Research June 22,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 9: Variational Inference Relaxations Volkan Cevher, Matthias Seeger Ecole Polytechnique Fédérale de Lausanne 24/10/2011 (EPFL) Graphical Models 24/10/2011 1 / 15

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Received: 1 September 2018; Accepted: 10 October 2018; Published: 12 October 2018

Received: 1 September 2018; Accepted: 10 October 2018; Published: 12 October 2018 entropy Article Entropy Inequalities for Lattices Peter Harremoës Copenhagen Business College, Nørre Voldgade 34, 1358 Copenhagen K, Denmark; harremoes@ieee.org; Tel.: +45-39-56-41-71 Current address:

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Info-Clustering: An Information-Theoretic Paradigm for Data Clustering. Information Theory Guest Lecture Fall 2018

Info-Clustering: An Information-Theoretic Paradigm for Data Clustering. Information Theory Guest Lecture Fall 2018 Info-Clustering: An Information-Theoretic Paradigm for Data Clustering Information Theory Guest Lecture Fall 208 Problem statement Input: Independently and identically drawn samples of the jointly distributed

More information

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Multiterminal Networks: Rate Regions, Codes, Computations, & Forbidden Minors

Multiterminal Networks: Rate Regions, Codes, Computations, & Forbidden Minors Multiterminal etworks: Rate Regions, Codes, Computations, & Forbidden Minors Ph D Thesis Proposal Congduan Li ASPITRG & MAL Drexel University congduanli@gmailcom October 5, 204 C Li (ASPITRG & MAL) Thesis

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Lecture 4 October 18th

Lecture 4 October 18th Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations

More information

Exploiting Symmetry in Computing Polyhedral Bounds on Network Coding Rate Regions

Exploiting Symmetry in Computing Polyhedral Bounds on Network Coding Rate Regions Exploiting Symmetry in Computing Polyhedral Bounds on Network Coding Rate Regions Jayant Apte John Walsh Department of Electrical and Computer Engineering Drexel University, Philadelphia NetCod, 205 NSF

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

Amobile satellite communication system, like Motorola s

Amobile satellite communication system, like Motorola s I TRANSACTIONS ON INFORMATION THORY, VOL. 45, NO. 4, MAY 1999 1111 Distributed Source Coding for Satellite Communications Raymond W. Yeung, Senior Member, I, Zhen Zhang, Senior Member, I Abstract Inspired

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

Introduction to Information Geometry

Introduction to Information Geometry Introduction to Information Geometry based on the book Methods of Information Geometry written by Shun-Ichi Amari and Hiroshi Nagaoka Yunshu Liu 2012-02-17 Outline 1 Introduction to differential geometry

More information

Review: Directed Models (Bayes Nets)

Review: Directed Models (Bayes Nets) X Review: Directed Models (Bayes Nets) Lecture 3: Undirected Graphical Models Sam Roweis January 2, 24 Semantics: x y z if z d-separates x and y d-separation: z d-separates x from y if along every undirected

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions Dual Peking University June 21, 2016 Divergences: Riemannian connection Let M be a manifold on which there is given a Riemannian metric g =,. A connection satisfying Z X, Y = Z X, Y + X, Z Y (1) for all

More information

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine CS 484 Data Mining Classification 7 Some slides are from Professor Padhraic Smyth at UC Irvine Bayesian Belief networks Conditional independence assumption of Naïve Bayes classifier is too strong. Allows

More information

Network Coding on Directed Acyclic Graphs

Network Coding on Directed Acyclic Graphs Network Coding on Directed Acyclic Graphs John MacLaren Walsh, Ph.D. Multiterminal Information Theory, Spring Quarter, 0 Reference These notes are directly derived from Chapter of R. W. Yeung s Information

More information

Model Complexity of Pseudo-independent Models

Model Complexity of Pseudo-independent Models Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Linearly Representable Entropy Vectors and their Relation to Network Coding Solutions

Linearly Representable Entropy Vectors and their Relation to Network Coding Solutions 2009 IEEE Information Theory Workshop Linearly Representable Entropy Vectors and their Relation to Network Coding Solutions Asaf Cohen, Michelle Effros, Salman Avestimehr and Ralf Koetter Abstract In this

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions

Bias-Variance Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions - Trade-Off in Hierarchical Probabilistic Models Using Higher-Order Feature Interactions Simon Luo The University of Sydney Data61, CSIRO simon.luo@data61.csiro.au Mahito Sugiyama National Institute of

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

SHANNON S information measures refer to entropies, conditional

SHANNON S information measures refer to entropies, conditional 1924 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 6, NOVEMBER 1997 A Framework for Linear Information Inequalities Raymond W. Yeung, Senior Member, IEEE Abstract We present a framework for information

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Algebraic matroids are almost entropic

Algebraic matroids are almost entropic accepted to Proceedings of the AMS June 28, 2017 Algebraic matroids are almost entropic František Matúš Abstract. Algebraic matroids capture properties of the algebraic dependence among elements of extension

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

A new computational approach for determining rate regions and optimal codes for coded networks

A new computational approach for determining rate regions and optimal codes for coded networks A new computational approach for determining rate regions and optimal codes for coded networks Congduan Li, Jayant Apte, John MacLaren Walsh, Steven Weber Drexel University, Dept. of ECE, Philadelphia,

More information

A computational approach for determining rate regions and codes using entropic vector bounds

A computational approach for determining rate regions and codes using entropic vector bounds 1 A computational approach for determining rate regions and codes using entropic vector bounds Congduan Li, John MacLaren Walsh, Steven Weber Drexel University, Dept. of ECE, Philadelphia, PA 19104, USA

More information

Capacity Region of the Permutation Channel

Capacity Region of the Permutation Channel Capacity Region of the Permutation Channel John MacLaren Walsh and Steven Weber Abstract We discuss the capacity region of a degraded broadcast channel (DBC) formed from a channel that randomly permutes

More information

Chapter 16. Structured Probabilistic Models for Deep Learning

Chapter 16. Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 1 Chapter 16 Structured Probabilistic Models for Deep Learning Peng et al.: Deep Learning and Practice 2 Structured Probabilistic Models way of using graphs to describe

More information

Based on slides by Richard Zemel

Based on slides by Richard Zemel CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

More information

Network Combination Operations Preserving the Sufficiency of Linear Network Codes

Network Combination Operations Preserving the Sufficiency of Linear Network Codes Network Combination Operations Preserving the Sufficiency of Linear Network Codes Congduan Li, Steven Weber, John MacLaren Walsh ECE Department, Drexel University Philadelphia, PA 904 Abstract Operations

More information

Junction Tree, BP and Variational Methods

Junction Tree, BP and Variational Methods Junction Tree, BP and Variational Methods Adrian Weller MLSALT4 Lecture Feb 21, 2018 With thanks to David Sontag (MIT) and Tony Jebara (Columbia) for use of many slides and illustrations For more information,

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Maria Ryskina, Yen-Chia Hsu 1 Introduction

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Upper Bounds on the Capacity of Binary Intermittent Communication

Upper Bounds on the Capacity of Binary Intermittent Communication Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning

Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Probabilistic and Logistic Circuits: A New Synthesis of Logic and Machine Learning Guy Van den Broeck KULeuven Symposium Dec 12, 2018 Outline Learning Adding knowledge to deep learning Logistic circuits

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

4.1 Notation and probability review

4.1 Notation and probability review Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 9: Expectation Maximiation (EM) Algorithm, Learning in Undirected Graphical Models Some figures courtesy

More information

Bregman Divergences for Data Mining Meta-Algorithms

Bregman Divergences for Data Mining Meta-Algorithms p.1/?? Bregman Divergences for Data Mining Meta-Algorithms Joydeep Ghosh University of Texas at Austin ghosh@ece.utexas.edu Reflects joint work with Arindam Banerjee, Srujana Merugu, Inderjit Dhillon,

More information

Characteristic-Dependent Linear Rank Inequalities and Network Coding Applications

Characteristic-Dependent Linear Rank Inequalities and Network Coding Applications haracteristic-ependent Linear Rank Inequalities and Network oding pplications Randall ougherty, Eric Freiling, and Kenneth Zeger bstract Two characteristic-dependent linear rank inequalities are given

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany Syllabus Fri. 21.10. (1) 0. Introduction A. Supervised Learning: Linear Models & Fundamentals Fri. 27.10. (2) A.1 Linear Regression Fri. 3.11. (3) A.2 Linear Classification Fri. 10.11. (4) A.3 Regularization

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information