Maximization of the information divergence from the multinomial distributions 1
|
|
- Cori Gibbs
- 5 years ago
- Views:
Transcription
1 aximization of the information divergence from the multinomial distributions Jozef Juríček Charles University in Prague Faculty of athematics and Physics Department of Probability and athematical Statistics Supervisor: Ing. František atúš, CSc. Academy of Sciences of the Czech Republic Institute of Information Theory and Automation Department of Decision-aking Theory Abstract. The explicit solution of the problem of maximization of information divergence from the family of multinomial distributions is presented, using result of N. Ay and A. Knauf for the problem of maximization of multi-information, which is the special case of maximization of information divergence from hierarchical models. The problem studied in this paper is a generalization of the binomial case, which was solved in [3]. The problem of maximization of information divergence from an exponential family has emerged in probabilistic models for evolution and learning in neural networks that are based on infomax principles. The maximizers admit interpretation as stochastic systems with high complexity w.r.t. exponential family. Introduction Let µ, ν be nonzero measures on a finite set Z. Let f : Z R d. Let E = E µ,f = {Q µ,f,ϑ : ϑ R d } be the (full) exponential family determined by the reference measure µ and the directional statistic f, where Q µ,f,ϑ is a probability measure (pm) given by where, denotes the scalar product and Q µ,f,ϑ (z) = e ϑ,f(z) Λ µ,f (ϑ) µ(z), z Z, Λ µ,f (ϑ) = ln z Z e ϑ,f(z) µ(z). The information divergence (relative entropy; Kullback-Leibler divergence) of a pm P (on Z) from ν is { P (z) D(P ν) = z s(p ) P (z) ln ν(z), s(p ) s(ν), +, otherwise, s( ) is the support function, i.e. s(ν) = {z Z : ν(z) > 0}. The information divergence of a pm P (on Z) from the exponential family E is defined by D(P E) = inf D(P Q). Q E This work studies the maximization of the function P D(P ) where is a family of multinomial distributions (which is the closure of an exponential family). This problem is a generalization of the binomial case, which was solved in [3]. Problem. (aximization of divergence from the multinomial family). Let N be the number of identical and independent trials, n be the number of possible outcomes in each trial. Let p j be the probability of realization of the j th outcome ( n j= p j =, p,..., p n [0; ]) in each trial. Now, the multinomial distribution (with parameters N, n, p,..., p n ) is a joint distribution of numbers of realizations of each outcome in all N trials. Let Z := {z = (z,..., z n ) {0,,..., N} n : n j= z j = N} be the state space of (random variables with) multinomial distributions (with N, n fixed). Let be the set of all pm s on Z and be the set of all multinomial distributions (with N, n fixed). Finally, let be the set of all strictly positive pm s on Z and :=. The problem is to calculate sup D(P ) and find every P sup and Psup such that D(P sup P sup) = D(P sup ) = sup D(P ). AS 000 ath. Subject Classification. Primary 94A7. Secondary 6B0, 60A0, 5A0. Keywords and phrases. Kullback-Leibler divergence, relative entropy, exponential family, hierarchical models, multinomial distribution, information projection, log-laplace transform, cumulant generating function.
2 Example. (N =, n = ). Z = {z 0 = (, 0), z = (, ), z 0 = (0, ) }, = {( P (z 0 ), P (z ), P (z 0 ) ) = ( p 0, p, p 0 ) : (p0, p, p 0 ) [0; ] 3, p 0 + p + p 0 = }, = {( P (z 0 ), P (z ), P (z 0 ) ) = ( p, p( p), ( p) ) : p [0; ] }. The situation is illustrated on Figure. δ 0 δ 0 δ Figure : The simplex and exponential family for N =, n =. Example.3 (N = 3, n = ). Z = {z 30 = (3, 0), z = (, ), z = (, ), z 03 = (0, 3) }, = = {( P (z 30 ), P (z ), P (z ), P (z 03 ) ) = ( p 30, p, p, p 03 ) : (p30, p, p, p 03 ) [0; ] 4, p 30 + p + p + p 03 = }, = {( P (z 30 ), P (z ), P (z ), P (z 03 ) ) = ( p 3, 3p ( p), 3p( p), ( p) 3) : p [0; ] }. The situation is illustrated on Figure. δ 03 δ 30 δ δ Figure : The simplex and exponential family for N = 3, n =. The general problem of maximization of information divergence from an exponential family has emerged in probabilistic models for evolution and learning in neural networks based on infomax principles. aximizers of D( E) admit interpretation as stochastic systems with high complexity w.r.t. exponential family E [].
3 Preliminaries This section reviews some facts about exponential families and information projections. Let Lin(A) denote the linear span of a set A R d. Lemma.. Let µ, ν be strictly positive measures on a finite set Z, f : Z R d f, g : Z R dg two exponential families. Then E µ,f = E ν,f. and E µ,f E ν,g Proof. Notice that ν ν(z) = Q ν,g,0 E ν,g E µ,f. Then, there exists ϑ 0 R d f, such that Now µ can be expressed as µ(z) = ν(z) ν(z) eλ(ϑ0) ϑ0,f(z). It can be seen, that for every ϑ R d f Q µ,f,ϑ (z) = e ϑ,f(z) Λ(ϑ) µ(z) = z Z e ϑ,f(z) +Λ(ϑ0) ϑ0,f(z) = e ϑ ϑ0,f(z) Λ(ϑ ϑ0) ν(z) = Q ν,f,ϑ ϑ0 (z). This proves E µ,f E ν,f and the equality here follows by symmetry. e ϑ,f(z) +Λ(ϑ0) ϑ0,f(z) ν(z) ν(z) ν ν(z) = Q µ,f,ϑ 0. ν(z) ν(z) Lemma.. Let ν be a nonzero measure, f = (f,..., f df ), f i : Z R, i =,..., d f, g = (g,..., g dg ), g j : Z R, j =,..., d g. Then E ν,g E ν,f Lin{, g,..., g dg } Lin{, f,..., f df }, E ν,g = E ν,f Lin{, g,..., g dg } = Lin{, f,..., f df }. Proof. It is easy to see, that E ν,f = E ν,(,f). The rest results from the fact that exponential function is injective. Corollary.3. With using notation from Lemma. and D f := dim, f,..., f df there exists h = (h,..., h Df ), h i : Z R, i =,..., D f such that E ν,f = E ν,h and {h i, i =,..., D f } are linearly independent and linearly independent with (on Z). oreover, if E ν,g E ν,f, then dim, g,..., g dg =: D g D f and for h g := (h,..., h Dg ), it holds E ν,g = E ν,hg. Proof. By Lemma. and Steinitz s exchange theorem. The nonnegative integer D f is the dimension of the exponential family E ν,f. Theorem.4 (Uniqueness of the generalized ri-projection). For every pm P (on Z) and exponential family E = E ν,f with s(ν) = Z, there exists a unique pm P E E (the generalized reverse information projection; generalized ri-projection) such that D(P P E ) = D(P E). Proof. For details, see []. holds 3 ultinomial family For n, N N denote [0 : N] := {0,..., N}, [ : n] = {,..., n}. Z := {z = (z,..., z n ) [0 : N] n : n j= z j = N}, for z Z denote ( ) N z := N! n zj!. j= The set of all pm s on Z will be denoted := {P = ( P (z) ) z Z [0; ]Z : z Z P (z) = }, strictly positive pm s := {P = ( P (z) ) z Z (0; )Z : z Z P (z) = }. The family of multinomial distributions (multinomial family) is a set of pm s { Denote = = Q : Q(z) = Q : Q(z) = ( N z ( ) N n p zj j z, z Z; (p j) n j= = ( p(j) ) j [:n] j= = p P([ : n]). ) n j= pzj j, z Z; (p j) n j= = ( p(j) ) j [:n] = p P([ : n]) }. 3
4 It is easy to see, that the multinomial family is the closure of an exponential family, = E µ,f and = E µ,f with µ(z) = ( ) N z and f(z) = z. Its dimension is equal to n and for ϑ R n, Q µ,f,ϑ =: Q = E µ,f, one e has p j = ϑ j n k= eϑ k. Let (X,..., X N ) be the random vector with identical marginal distributions X k p P([ : n]), k =,..., N. Denote V j := {i [ : N] : Y i = j}, j =,..., n. Then V = (V,..., V n ) Q if and only if X,..., X N are mutually independent. Now, the problem of maximization of D(P ) = D(P ) can be formulated in a different equivalent way. Denote X = [ : n] N the state space of the random vector (X,..., X N ). For x = (x,..., x N ) X and permutation π : [ : N] [ : N] let x π = (x π(),..., x π(n) ). The set of all permutations π on [ : N] will be denoted as [ : N]!. Denote: E := {P P(X) : P (x) = P (x π ), x X; π [ : N]!}, F := {Q P(X) : Q(x) = N i= Q i(x i ), x X}, where Q i (x i ) = x X:x i =xi Q(x ), i =,..., N. Finally, E := P(X) E, F := P(X) F. Lemma 3.. With using a previous notation and X z := {x X : j [ : n] : {i [ : N] : x i = j} = z j }, it holds: (i) The mapping h : E such that h(p ) = P, P (x) = P (z) ( N z ) for z Z s.t. x Xz is a bijection, h : E F = E F and for h : E, the inverse of h, h (P ) = P and P (z) = ( ) N z P (x) for any x X z. (ii) For any P, Q, it holds D(P Q) = D ( h(p ) h(q) ). (iii) For any P E, Q F \ E F, there exists π [ : N]!, such that for Q π, Q π (x) = Q(x π ), it holds Q π Q and D(P Q) = D(P Q π ). (iv) For any P E: D(P F) = inf Q E F D(P Q) and arg inf Q E F D(P Q) = P F E F. (v) sup arg sup D(P ) = sup D(P ) = h (arg sup D(P E F) = sup D(P F) D(P E F)) = h (arg sup sup D(P F) and P P(X) D(P F)). Proof. Due to the uniqueness of the ri-projection (Theorem.4), (iii) (iv). Other propositions are straightforward. It is well known, that for P P(X), the D(P F) = (P ), the multi-information. The problem of maximizing the multi-information over the P(X) has explicit solution and was solved in []. Theorem 3. (aximizers of D( F) = ( )). The set of maximizers of D( F) = ( ) is equal to arg sup D(P F) = P P(X) P Π = n δ n (j,π(j),...,π N (j)) : Π = (π,..., π N ) [ : n]! (N ), j= D(P Π F) = (N ) ln(n) and P F Π = U X = n N x X δ x, Π [ : n]! (N ). Proof. For details, see [], Theorem 4.3 and Corollary 4.0. j Denote e j = e j,j = (0,..., 0,, 0,..., 0) }{{}, e k,l = (0,..., 0,, 0,..., 0,, 0,..., 0) }{{}, j, k, l [ : n], k < l; n n ɛ j,j = δ e j,j, ɛ k,l = δ e k,l. k l 4
5 Corollary 3.3 (The set of maximizers of D( )). When using notation of Lemma 3., it holds: ) j [:n]:j π(j) arg sup D(P ) = h (E arg sup D(P F) P P(X) For N =, arg sup D(P ) = = P π = ɛ j,π(j), π [ : n]! : [π(j) = k] [π(k) = j], j, k [ : n] n. For N >, the only maximizer is P Id = n n δ Ne j. j= sup D(P ) = (N ) ln(n) and for every maximizer P sup, it holds Psup(z) = (N z ), z Z. n N Proof. To avoid trivial cases, let n, N. By Lemma 3. and Theorem 3.: sup D(P ) = sup E D(P F) sup P(X) D(P F) = (N ) ln(n). It is easy to see, that P 0 = P (Id,...,Id) is a maximizer (on P(X)) and even P 0 E (Id is an identity mapping on [ : n]). In order of finding the rest of maximizers (on P(X)) which also belong to E, for another maximizer P E, P 0 P = P (π,...,π N ): π i Id and π i (j) j for some i [ : N] and j [ : n]. Thus, (j,..., π i (j),... ) s(p ) and (from the fact, that P E) also (π i (j),..., j,... ) s(p ). If N >, then (j,..., π i (j),..., k,... ) s(p ) and also (π i (j),..., j,..., k,... ) s(p ) for some k [ : n]. Hence, for some l [ : N], π l is not injective, but π l is a permutation and this is contradiction. The rest simply follows. When considering the binomial case (n = ), the application of the ri-projection theorem (Theorem.4), result (in []) of N. Ay and A. Knauf (Theorem 3.) and Lemma 3. (prop. (v)) substantially simplified the proof of the result given in [3] (see proof of Proposition ). Example 3.4 (Ad: Example., N =, n = ). arg sup P(X) D(P F) = { (δ + δ ), (δ + δ )} arg sup E D(P E F) = E arg sup P(X) D(P F) = { (δ + δ ), (δ + δ )} arg sup D(P ) = h (E arg sup P(X) D(P F)) = { (δ 0 + δ 0 ), δ )} sup D(P ) = ln Figure 3 illustrates how the maximization of information divergence from multinomial family is related to the maximization of multi-information and the fact of Lemma 3., prop. (iii). Correspondingly, situation in the simplex is depiced on Figure 4(a). Example 3.5 (Ad: Example.3, N = 3, n = ). arg sup P(X) D(P F) = { (δ + δ ), (δ + δ ), (δ + δ ), (δ + δ )} arg sup E D(P E F) = { (δ + δ )} arg sup D(P ) = { (δ 30 + δ 03 )} sup D(P ) = ln aximization problem in the simplex is illustrated on Figure 4(b). Example 3.6 (N =, n = 3). arg sup P(X) D(P F) = { 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 ), 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 )} arg sup E D(P E F) = { 3 (δ + δ + δ 33 ), 3 (δ + δ 3 + δ 3 ), 3 (δ 3 + δ + δ 3 ), 3 (δ + δ + δ 33 )} arg sup D(P ) = { 3 (δ 00 + δ 00 + δ 00 ), 3 δ δ 0, 3 δ δ 0, 3 δ δ 0} sup D(P ) = ln 3. 5
6 δ (δ + δ) E E F U X F δ δ δ δ Q U X (δ + δ) F E F Q π (a) The simplex P(X) δ δ δ (b) Factorizable pm s F Figure 3: Relation between maximization of information divergence and multi-information for N =, n =. δ 0 δ 03 P sup = (δ30 + δ03) (δ0 + δ0) δ 30 δ h (U X ) Psup δ 0 δ δ (a) N =, n = (b) N = 3, n = Figure 4: Ad Figure and Figure : aximization in. References [] Ay, N., Knauf, A. (006). aximizing multi-information. Kybernetika [] Csiszár, I., atúš, F. (003). Information projections revisited. IEEE Transactions Information Theory [3] atúš, F. (004). aximization of information divergences from binary i.i.d. sequences. Proceedings IPU (004) Perugia, Italy. 6
Maximization of Multi - Information
Maximization of Multi - Information Week of Doctoral Students 2007 Jozef Juríček http://www.adultpdf.com Academy of Sciences of the Czech Republic Created by Image To PDF trial version, Institute to remove
More informationLecture 5 - Information theory
Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information
More informationLecture 2: August 31
0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationConvergence of generalized entropy minimizers in sequences of convex problems
Proceedings IEEE ISIT 206, Barcelona, Spain, 2609 263 Convergence of generalized entropy minimizers in sequences of convex problems Imre Csiszár A Rényi Institute of Mathematics Hungarian Academy of Sciences
More informationCS 591, Lecture 2 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.
More informationInformation Theory in Intelligent Decision Making
Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory
More informationMachine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang
Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang
More informationAlgebraic matroids are almost entropic
accepted to Proceedings of the AMS June 28, 2017 Algebraic matroids are almost entropic František Matúš Abstract. Algebraic matroids capture properties of the algebraic dependence among elements of extension
More informationST5215: Advanced Statistical Theory
Department of Statistics & Applied Probability Monday, September 26, 2011 Lecture 10: Exponential families and Sufficient statistics Exponential Families Exponential families are important parametric families
More informationCS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018
CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs
More information3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.
Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that
More informationConsistency of the maximum likelihood estimator for general hidden Markov models
Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models
More informationArtificial Intelligence
Artificial Intelligence Probabilities Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: AI systems need to reason about what they know, or not know. Uncertainty may have so many sources:
More informationfür Mathematik in den Naturwissenschaften Leipzig
ŠܹÈÐ Ò ¹ÁÒ Ø ØÙØ für Mathematik in den Naturwissenschaften Leipzig Finding the Maximizers of the Information Divergence from an Exponential Family by Johannes Rauh Preprint no.: 82 2009 Finding the
More informationA View on Extension of Utility-Based on Links with Information Measures
Communications of the Korean Statistical Society 2009, Vol. 16, No. 5, 813 820 A View on Extension of Utility-Based on Links with Information Measures A.R. Hoseinzadeh a, G.R. Mohtashami Borzadaran 1,b,
More informationA PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES. 1. Introduction
tm Tatra Mt. Math. Publ. 00 (XXXX), 1 10 A PARAMETRIC MODEL FOR DISCRETE-VALUED TIME SERIES Martin Janžura and Lucie Fialová ABSTRACT. A parametric model for statistical analysis of Markov chains type
More informationEECS 750. Hypothesis Testing with Communication Constraints
EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.
More informationCheng Soon Ong & Christian Walder. Canberra February June 2017
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX
More informationApplication of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.
Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December
More informationPosterior Regularization
Posterior Regularization 1 Introduction One of the key challenges in probabilistic structured learning, is the intractability of the posterior distribution, for fast inference. There are numerous methods
More informationInformation Geometric view of Belief Propagation
Information Geometric view of Belief Propagation Yunshu Liu 2013-10-17 References: [1]. Shiro Ikeda, Toshiyuki Tanaka and Shun-ichi Amari, Stochastic reasoning, Free energy and Information Geometry, Neural
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationMutual Information and Optimal Data Coding
Mutual Information and Optimal Data Coding May 9 th 2012 Jules de Tibeiro Université de Moncton à Shippagan Bernard Colin François Dubeau Hussein Khreibani Université de Sherbooe Abstract Introduction
More informationMax-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig
Max-Planck-Institut für Mathematik in den Naturwissenschaften Leipzig Hierarchical Quantification of Synergy in Channels by Paolo Perrone and Nihat Ay Preprint no.: 86 2015 Hierarchical Quantification
More informationQuantitative Biology II Lecture 4: Variational Methods
10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate
More informationActa Universitatis Carolinae. Mathematica et Physica
Acta Universitatis Carolinae. Mathematica et Physica František Žák Representation form of de Finetti theorem and application to convexity Acta Universitatis Carolinae. Mathematica et Physica, Vol. 52 (2011),
More informationOn John type ellipsoids
On John type ellipsoids B. Klartag Tel Aviv University Abstract Given an arbitrary convex symmetric body K R n, we construct a natural and non-trivial continuous map u K which associates ellipsoids to
More informationChapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye
Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality
More informationOutline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation
More informationOn the Chi square and higher-order Chi distances for approximating f-divergences
On the Chi square and higher-order Chi distances for approximating f-divergences Frank Nielsen, Senior Member, IEEE and Richard Nock, Nonmember Abstract We report closed-form formula for calculating the
More informationExpectation Propagation Algorithm
Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,
More informationMGMT 69000: Topics in High-dimensional Data Analysis Falll 2016
MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationIsodiametric problem in Carnot groups
Conference Geometric Measure Theory Université Paris Diderot, 12th-14th September 2012 Isodiametric inequality in R n Isodiametric inequality: where ω n = L n (B(0, 1)). L n (A) 2 n ω n (diam A) n Isodiametric
More informationNOTES ON FINITE FIELDS
NOTES ON FINITE FIELDS AARON LANDESMAN CONTENTS 1. Introduction to finite fields 2 2. Definition and constructions of fields 3 2.1. The definition of a field 3 2.2. Constructing field extensions by adjoining
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationTHE BORSUK-ULAM THEOREM FOR GENERAL SPACES
THE BORSUK-ULAM THEOREM FOR GENERAL SPACES PEDRO L. Q. PERGHER, DENISE DE MATTOS, AND EDIVALDO L. DOS SANTOS Abstract. Let X, Y be topological spaces and T : X X a free involution. In this context, a question
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationChapter 2 Exponential Families and Mixture Families of Probability Distributions
Chapter 2 Exponential Families and Mixture Families of Probability Distributions The present chapter studies the geometry of the exponential family of probability distributions. It is not only a typical
More informationRefined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities
Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC
More informationtopics about f-divergence
topics about f-divergence Presented by Liqun Chen Mar 16th, 2018 1 Outline 1 f-gan: Training Generative Neural Samplers using Variational Experiments 2 f-gans in an Information Geometric Nutshell Experiments
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationValue Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes
Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael
More informationExpressive Power and Approximation Errors of Restricted Boltzmann Machines
Expressive Power and Approximation Errors of Restricted Boltzmann Machines Guido F. Montúfar, Johannes Rauh, and Nihat Ay, Max Planck Institute for Mathematics in the Sciences, Inselstraße 0403 Leipzig,
More informationUpper triangular matrices and Billiard Arrays
Linear Algebra and its Applications 493 (2016) 508 536 Contents lists available at ScienceDirect Linear Algebra and its Applications www.elsevier.com/locate/laa Upper triangular matrices and Billiard Arrays
More information4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information
4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationPhenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012
Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM
More informationLecture 2: Conjugate priors
(Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability
More informationCODE DECOMPOSITION IN THE ANALYSIS OF A CONVOLUTIONAL CODE
Bol. Soc. Esp. Mat. Apl. n o 42(2008), 183 193 CODE DECOMPOSITION IN THE ANALYSIS OF A CONVOLUTIONAL CODE E. FORNASINI, R. PINTO Department of Information Engineering, University of Padua, 35131 Padova,
More informationRANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA
Discussiones Mathematicae General Algebra and Applications 23 (2003 ) 125 137 RANK AND PERIMETER PRESERVER OF RANK-1 MATRICES OVER MAX ALGEBRA Seok-Zun Song and Kyung-Tae Kang Department of Mathematics,
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationChapter 4: Modelling
Chapter 4: Modelling Exchangeability and Invariance Markus Harva 17.10. / Reading Circle on Bayesian Theory Outline 1 Introduction 2 Models via exchangeability 3 Models via invariance 4 Exercise Statistical
More informationIntroduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.
L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate
More informationStochastic Realization of Binary Exchangeable Processes
Stochastic Realization of Binary Exchangeable Processes Lorenzo Finesso and Cecilia Prosdocimi Abstract A discrete time stochastic process is called exchangeable if its n-dimensional distributions are,
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission
More informationDeep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes
Deep Neural Networks: From Flat Minima to Numerically Nonvacuous Generalization Bounds via PAC-Bayes Daniel M. Roy University of Toronto; Vector Institute Joint work with Gintarė K. Džiugaitė University
More informationLecture 22: Error exponents in hypothesis testing, GLRT
10-704: Information Processing and Learning Spring 2012 Lecture 22: Error exponents in hypothesis testing, GLRT Lecturer: Aarti Singh Scribe: Aarti Singh Disclaimer: These notes have not been subjected
More informationSTATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY
2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationApplications of Information Geometry to Hypothesis Testing and Signal Detection
CMCAA 2016 Applications of Information Geometry to Hypothesis Testing and Signal Detection Yongqiang Cheng National University of Defense Technology July 2016 Outline 1. Principles of Information Geometry
More informationThe Method of Types and Its Application to Information Hiding
The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,
More informationNecessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery
Necessary and Sufficient Conditions for High-Dimensional Salient Feature Subset Recovery Vincent Tan, Matt Johnson, Alan S. Willsky Stochastic Systems Group, Laboratory for Information and Decision Systems,
More informationSHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe
SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/41 Outline Two-terminal model: Mutual information Operational meaning in: Channel coding: channel
More informationSplit Rank of Triangle and Quadrilateral Inequalities
Split Rank of Triangle and Quadrilateral Inequalities Santanu Dey 1 Quentin Louveaux 2 June 4, 2009 Abstract A simple relaxation of two rows of a simplex tableau is a mixed integer set consisting of two
More informationHands-On Learning Theory Fall 2016, Lecture 3
Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete
More informationCoding on Countably Infinite Alphabets
Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe
More informationDivergences, surrogate loss functions and experimental design
Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,
More informationSupplementary material for: Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model
Supplementary material for: Rational Inattention to Discrete Choices: A New Foundation for the Multinomial Logit Model Filip Matějka and Alisdair McKay May 2014 C Additional Proofs for Section 3 Proof
More informationThe binary entropy function
ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but
More informationMONTE CARLO INVERSE. 1. Introduction. Consider a measure space (Ω, F, µ) and a set of measurable non-negative constraint functions
MONTE CARLO INVERSE RAOUL LEPAGE, KRZYSZTOF PODGÓRSKI, AND MICHA L RYZNAR Abstract. We consider the problem of determining and calculating a positive measurable function f satisfying g i fdµ = c i, i I
More information40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology
Singapore University of Design and Technology Lecture 9: Hypothesis testing, uniformly most powerful tests. The Neyman-Pearson framework Let P be the family of distributions of concern. The Neyman-Pearson
More informationRandomized Quantization and Optimal Design with a Marginal Constraint
Randomized Quantization and Optimal Design with a Marginal Constraint Naci Saldi, Tamás Linder, Serdar Yüksel Department of Mathematics and Statistics, Queen s University, Kingston, ON, Canada Email: {nsaldi,linder,yuksel}@mast.queensu.ca
More informationMachine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014
Machine Learning Probability Basics Basic definitions: Random variables, joint, conditional, marginal distribution, Bayes theorem & examples; Probability distributions: Binomial, Beta, Multinomial, Dirichlet,
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationInformation measures in simple coding problems
Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random
More informationLatent Variable Models and EM algorithm
Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic
More informationLaw of Cosines and Shannon-Pythagorean Theorem for Quantum Information
In Geometric Science of Information, 2013, Paris. Law of Cosines and Shannon-Pythagorean Theorem for Quantum Information Roman V. Belavkin 1 School of Engineering and Information Sciences Middlesex University,
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationCourse 212: Academic Year Section 1: Metric Spaces
Course 212: Academic Year 1991-2 Section 1: Metric Spaces D. R. Wilkins Contents 1 Metric Spaces 3 1.1 Distance Functions and Metric Spaces............. 3 1.2 Convergence and Continuity in Metric Spaces.........
More informationLecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax
More informationG8325: Variational Bayes
G8325: Variational Bayes Vincent Dorie Columbia University Wednesday, November 2nd, 2011 bridge Variational University Bayes Press 2003. On-screen viewing permitted. Printing not permitted. http://www.c
More informationUncertainty. Jayakrishnan Unnikrishnan. CSL June PhD Defense ECE Department
Decision-Making under Statistical Uncertainty Jayakrishnan Unnikrishnan PhD Defense ECE Department University of Illinois at Urbana-Champaign CSL 141 12 June 2010 Statistical Decision-Making Relevant in
More informationCOMPSCI 650 Applied Information Theory Jan 21, Lecture 2
COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.
More informationA Representation Approach for Relative Entropy Minimization with Expectation Constraints
A Representation Approach for Relative Entropy Minimization with Expectation Constraints Oluwasanmi Koyejo sanmi.k@utexas.edu Department of Electrical and Computer Engineering, University of Texas, Austin,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationSelf-Organization by Optimizing Free-Energy
Self-Organization by Optimizing Free-Energy J.J. Verbeek, N. Vlassis, B.J.A. Kröse University of Amsterdam, Informatics Institute Kruislaan 403, 1098 SJ Amsterdam, The Netherlands Abstract. We present
More informationIntroduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 14 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Density Estimation Maxent Models 2 Entropy Definition: the entropy of a random variable
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationPrequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong
Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong Peter Grünwald pdg@cwi.nl Wojciech Kotłowski kotlowsk@cwi.nl National Research Institute for Mathematics and Computer
More informationStatistics (1): Estimation
Statistics (1): Estimation Marco Banterlé, Christian Robert and Judith Rousseau Practicals 2014-2015 L3, MIDO, Université Paris Dauphine 1 Table des matières 1 Random variables, probability, expectation
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationAppendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data
Appendices for the article entitled Semi-supervised multi-class classification problems with scarcity of labelled data Jonathan Ortigosa-Hernández, Iñaki Inza, and Jose A. Lozano Contents 1 Appendix A.
More informationMachine learning - HT Maximum Likelihood
Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce
More informationBayesian Learning in Social Networks
Bayesian Learning in Social Networks Asu Ozdaglar Joint work with Daron Acemoglu, Munther Dahleh, Ilan Lobel Department of Electrical Engineering and Computer Science, Department of Economics, Operations
More information