Information Theory in Intelligent Decision Making

Similar documents
Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Lecture 2: August 31

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Lecture 5 - Information theory

Bioinformatics: Biology X

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory and Statistics Lecture 2: Source coding

Computational Systems Biology: Biology X

CS 630 Basic Probability and Information Theory. Tim Campbell

Machine Learning Srihari. Information Theory. Sargur N. Srihari

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Information Theory Primer:

EECS 750. Hypothesis Testing with Communication Constraints

Information Theory and Hypothesis Testing

Chaos, Complexity, and Inference (36-462)

The Method of Types and Its Application to Information Hiding

Bayesian Inference Course, WTCN, UCL, March 2013

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Lecture 1: Introduction, Entropy and ML estimation

Expectation Propagation Algorithm

INTRODUCTION TO INFORMATION THEORY

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Lecture 3. Mathematical methods in communication I. REMINDER. A. Convex Set. A set R is a convex set iff, x 1,x 2 R, θ, 0 θ 1, θx 1 + θx 2 R, (1)

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Chapter 9 Fundamental Limits in Information Theory

Information Theory, Statistics, and Decision Trees

Information in Biology

Series 7, May 22, 2018 (EM Convergence)

Block 2: Introduction to Information Theory

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 11: Continuous-valued signals and differential entropy

Information in Biology

Information Theory. M1 Informatique (parcours recherche et innovation) Aline Roumy. January INRIA Rennes 1/ 73

Homework Set #2 Data Compression, Huffman code and AEP

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

ELEMENT OF INFORMATION THEORY

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Gambling and Information Theory

Entropies & Information Theory

Classification & Information Theory Lecture #8

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Solutions to Set #2 Data Compression, Huffman code and AEP

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Lecture 22: Error exponents in hypothesis testing, GLRT

INFORMATION THEORY AND STATISTICS

Quiz 2 Date: Monday, November 21, 2016

PATTERN RECOGNITION AND MACHINE LEARNING

3F1 Information Theory, Lecture 3

Capacity of a channel Shannon s second theorem. Information Theory 1/33

(Classical) Information Theory III: Noisy channel coding

Context tree models for source coding

Hands-On Learning Theory Fall 2016, Lecture 3

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science Transmission of Information Spring 2006

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Information Theory. Week 4 Compressing streams. Iain Murray,

National University of Singapore Department of Electrical & Computer Engineering. Examination for

Data Compression. Limit of Information Compression. October, Examples of codes 1

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

1 Introduction to information theory


MODULE -4 BAYEIAN LEARNING

Expectation Maximization

The binary entropy function

1.6. Information Theory

APC486/ELE486: Transmission and Compression of Information. Bounds on the Expected Length of Code Words

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Introduction to Machine Learning

1 Introduction. Sept CS497:Learning and NLP Lec 4: Mathematical and Computational Paradigms Fall Consider the following examples:

Coding of memoryless sources 1/35

Exercise 1. = P(y a 1)P(a 1 )

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Probabilistic and Bayesian Machine Learning

The Minimum Message Length Principle for Inductive Inference

Towards a Theory of Information Flow in the Finitary Process Soup

Variable selection and feature construction using methods related to information theory

Chapter 5: Data Compression

Information measures in simple coding problems

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

18.2 Continuous Alphabet (discrete-time, memoryless) Channel

Lecture 16. Error-free variable length schemes (contd.): Shannon-Fano-Elias code, Huffman code

The sequential decoding metric for detection in sensor networks

Chapter 2 Review of Classical Information Theory

Information & Correlation

Upper Bounds on the Capacity of Binary Intermittent Communication

Quantitative Biology II Lecture 4: Variational Methods

the Information Bottleneck

Coding on Countably Infinite Alphabets

Shannon s Noisy-Channel Coding Theorem

Lecture 7 Introduction to Statistical Decision Theory

Intermittent Communication

Lecture 8: Information Theory and Statistics

Information theory and decision tree

Transcription:

Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015

Information Theory in Intelligent Decision Making Information Theory Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015

Motivation Artificial Intelligence modelling cognition in humans realizing human-level intelligent behaviour in machines jumble of various ideas to get above points working Question Is there a joint way of understanding cognition? Probability we have probability theory for a theory of uncertainty we have information theory for endowing probability with a sense of metrics

Motivation Artificial Intelligence modelling cognition in humans realizing human-level intelligent behaviour in machines (just performance: not necessarily imitating biological substrate) jumble of various ideas to get above points working Question Is there a joint way of understanding cognition? Probability we have probability theory for a theory of uncertainty we have information theory for endowing probability with a sense of metrics

Random Variables Def.: Event Space Consider an event space Ω = {ω 1, ω 2,... }, finite or countably infinite with a (probability) measure P Ω : Ω [0, 1] s.t. ω P Ω (ω) = 1. The ω are called events. Random Variable A random variable X is a map X : Ω X with some outcome space X = {x 1, x 2,... } and induced probability measure P X (x) = P Ω (X 1 (x)). We also write instead P X (x) P(X = x) p(x).

Neyman-Pearson Lemma I Lemma Consider observations x 1, x 2,..., x n of a random variable X and two potential hypotheses (distributions) p 1 and p 2 they could have been based upon. Consider the test for hypothesis p 1 to be given as (x 1, x{ 2,..., x n ) A where } A = x = (x 1, x 2,..., x n) p 1(x 1,x 2,...,x n) p 2 (x 1,x 2,...,x n) C with some C R +. Assuming the rate α of false negatives p 1 ( A) to be given. Generated by p 1, but not in A If β is the rate of false positives p 2 (A) Then: any test with false negative rate α α has false positive rate β β. (Cover and Thomas, 2006)

Neyman-Pearson Lemma II Proof (Cover and Thomas, 2006) Let A as above and B some other acceptance region; χ A and χ B be the indicator functions. Then for all x: [χ A (x) χ B (x)] [p 1 (x) Cp 2 (x)] 0. Multiplying out & integrating: 0 (p 1 Cp 2 ) (p 1 Cp 2 ) A B = (1 α) Cβ (1 α ) + Cβ = C(β β) (α α )

Neyman-Pearson Lemma III Consideration assume events x i.i.d. test becomes: i p 1 (x i ) p 2 (x i ) C logarithmize: log p 1(x i ) κ (:= log C) p i 2 (x i )

Neyman-Pearson Lemma IV Consideration assume events x i.i.d. test becomes: i p 1 (x i ) p 2 (x i ) C logarithmize: Note log p 1(x i ) κ (:= log C) p i 2 (x i ) Average evidence growth per sample [ E log p ] 1(X) p 2 (X) = p(x) log p 1(x) x X p 2 (x)

Neyman-Pearson Lemma V Consideration assume events x i.i.d. test becomes: i p 1 (x i ) p 2 (x i ) C logarithmize: log p 1(x i ) κ (:= log C) p i 2 (x i ) Note: Kullback-Leibler Divergence Average evidence growth per sample [ D KL (p 1 p 2 ) = E p1 log p ] 1(X) p 2 (X) = p 1 (x) log p 1(x) x X p 2 (x)

Neyman-Pearson Lemma VI 900 800 700 600 "0.40_vs_0.60.dat" "0.50_vs_0.60.dat" "0.55_vs_0.60.dat" log sum 500 400 300 200 100 0-100 0 2000 4000 6000 8000 10000 samples

Neyman-Pearson Lemma VII log sum 900 800 700 600 500 400 300 200 100 0 "0.40_vs_0.60.dat" "0.50_vs_0.60.dat" "0.55_vs_0.60.dat" dkl_04*x dkl_05 * x dkl_055 * x -100 0 2000 4000 6000 8000 10000 samples

Part I Information Theory

Structural Motivation Intrinsic Pathways to Information Theory Information Theory

Structural Motivation Intrinsic Pathways to Information Theory Information Theory optimal communication

Structural Motivation Intrinsic Pathways to Information Theory Shannon axioms Information Theory optimal communication

Structural Motivation Intrinsic Pathways to Information Theory physical entropy Shannon axioms Information Theory optimal communication

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms Information Theory optimal communication

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms typicality theory Information Theory optimal communication

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms typicality theory Information Theory optimal communication optimal Bayes

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms typicality theory Information Theory optimal communication optimal Bayes Rate Distortion

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms typicality theory Information Theory optimal communication optimal Bayes Rate Distortion information geometry

Structural Motivation Intrinsic Pathways to Information Theory Laplace s principle physical entropy Shannon axioms typicality theory Information Theory optimal communication optimal Bayes Rate Distortion information geometry AI

Optimal Communication Codes task: send messages (disambiguate states) from sender to receiver consider self-delimiting codes (without extra delimiting character) simple example: prefix codes Def.: Prefix Codes codes where none is a prefix of another code

Prefix Codes 0 1 0 1 0 1 0 1 0

Kraft Inequality Theorem Assume events x X = {x 1, x 2,... x k } are coded using prefix codewords based on alphabet size b = B, with lengths l 1, l 2,..., l k for the respective events, then one has k i=1 b l i 1. Proof Sketch (Cover and Thomas, 2006) Let l max be the length of the longest codeword. Expand tree fully to level l max. Fully expanded leaves are either: 1. codewords; 2. descendants of codewords; 3. neither. An l i codeword has b lmax l i full-tree descendants, which must be different for the different codewords and there cannot be more than b lmax in total. Hence b lmax l i b lmax Remark The converse also holds.

Considerations Most compact code Assume Want to code stream of events x X appearing with probability p(x). Minimize Average code length: E[L] = i p(x i ) l i under constraint i b l! i = 1 Note 1 try to make l i as small as possible 2 make b l i as large as possible 3 limited by Kraft inequality; ideally becoming equality b l i = 1 i as l i are integers, that s typically not exact Result Differentiating Lagrangian i p(x i ) l i + λ b l i i w.r.t. l gives codeword lengths for shortest code: l i = log b p(x i )

Considerations Most compact code Assume Want to code stream of events x X appearing with probability p(x). Minimize Average code length: E[L] = i p(x i ) l i under constraint i b l! i = 1 Note 1 try to make l i as small as possible 2 make b l i as large as possible 3 limited by Kraft inequality; ideally becoming equality b l i = 1 i as l i are integers, that s typically not exact Result Differentiating Lagrangian i p(x i ) l i + λ b l i i w.r.t. l gives codeword lengths for shortest code: l i = log b p(x i ) Average Codeword Length = i p(x i ) l i = p(x) log p(x) x In the following, assume binary log.

Entropy Def.: Entropy Consider the random variable X. Then the entropy H(X) of X is defined as H(X) := p(x) log p(x) x with convention 0 log 0 0

Entropy Def.: Entropy Consider the random variable X. Then the entropy H(X) of X is defined as H(X) := p(x) log p(x) x with convention 0 log 0 0 Interpretations average optimal codeword length uncertainty (about next sample of X) physical entropy much more... Quote Why don t you call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage. John von Neumann

Entropy Def.: Entropy Consider the random variable X. Then the entropy H(X) of X is defined as H(X)[ H(p)] := p(x) log p(x) x with convention 0 log 0 0 Interpretations average optimal codeword length uncertainty (about next sample of X) physical entropy much more... Quote Why don t you call it entropy. In the first place, a mathematical development very much like yours already exists in Boltzmann s statistical mechanics, and in the second place, no one understands entropy very well, so in any discussion you will be in a position of advantage. John von Neumann

Meditation Probability/Code Mismatch Consider events x following a probability p(x), but modeler assuming mistakenly probability q(x), with optimal code lengths log q(x). Then code length waste per symbol given by x p(x) log q(x) + x = x = D KL (p q) p(x) log p(x) q(x) p(x) log p(x)

A Tip of Types (Cover and Thomas, 2006) Method of Types: Motivation consider sequences with same empirical distribution how many of these with a particular distribution probability of such a sequence Sketch of the Method consider binary event set X = {0, 1} w.l.o.g. consider sample x (n) = (x 1,..., x n ) X n the type p (n) x is the empirical distribution of symbols y X in sample x (n). I.e. p x (n)(y) counts how often symbol y appears in x (n). Let P n be set of types with denominator n. or dividing n for p P n, call the set of all sequences x (n) X n with type p the type class C(p) = {x (n) p x (n) = p}.

Type Theorem Type Count If X = 2, one has P n = n + 1 different types for sequences of length n. easy to generalize Important P n grows only polynomially, but X n grows exponentially with n. It follows that (at least one) type must contain exponentially many sequences. This corresponds to the macrostate in physics. Theorem (Cover and Thomas, 2006) If x 1, x 2,..., x n is an i.i.d. drawn sample sequence drawn from q, then the probability of x (n) depends only on its type and is given by Corollary 2 n[h(p x (n) )+DKL(p x (n) q)] If x (n) has type q, then its probability is given by 2 nh(q) A large value of H(q) indicates many possible candidates x (n) and high uncertainty, a small value few candidates and low uncertainty. here, we interpret probability q as type

Laplace s Principle of Insufficient Reason I Scenario Consider X. A probability distribution is assumed on X, but it is unknown. Laplace s principle of insufficient reason states that, in absence of any reason to assume that the outcomes are inequivalent, the probability distribution on X is assumed as equidistribution. Question How to generalize when something is known?

Answer: Types Dominant Sample Sequence Remember: sequence probability of sequences in type class C(q) 2 nh(q) A priori, a probability q maximizing H(q) will generate dominating sequence types dominating all others. Maximum Entropy Principle Maximize: H(q) with respect to q Result: equidistribution q(x) = 1 X

Sanov s Theorem I Theorem Consider i.i.d. sequence X 1, X 2,..., X n of random variables, distributed according to q(x). Let further E be a set of probability distributions. Then (amongst other), if E is closed and with p = arg min p E D(p q), one has 1 n log q(n) (E) D(p q) E p q

Sanov s Theorem II Interpretation p is unknown, but one knows constraints for p (e.g. some condition, such as some mean value Ū! = x p(x)u(x) must be attained, i.e. the set E is given), then the dominating types are those close to p. Special Case if prior q is equidistribution (indifference), then minimizing D(p q) under constraints E is equivalent to maximizing H(p) under these constraints. Jaynes Maximum Entropy Principle

Sanov s Theorem III Jaynes Principle generalization of Laplace s Principle maximally uncommitted distribution

Maximum Entropy Distributions I No constraints We are interested in maximizing H(X) = x p(x) log p(x) over all probabilities p. The probability p lives in the simplex = {q R X i q i = 1, q i 0} The maximization requires to respect constraints, of which we now consider only x p(x)! = 1. The edge constraints happen not to be invoked here.

Maximum Entropy Distributions II No constraints Unconstrained maximization via Lagrange: Taking derivative p(x) gives max[ p(x) log p(x) + λ p(x)] p x x log p(x) 1 + λ! = 0. Thus p(x) = e λ 1 1/ X equidistribution

Maximum Entropy Distributions Linear Constraints Constraints are now Derive Lagrangian p(x) =! 1 x p(x) f (x) =! f. x p(x) log p(x) + λ x 0 = x log p(x) 1 + λ + µ f (x) = 0 p(x) + µ p(x) f (x) x so that one has Boltzmann/Gibbs Distribution λ 1+µ f (x) p(x) = e = 1 f (x) eµ Z

Maximum Entropy Distributions Linear Constraints Constraints are now Derive Lagrangian p(x) =! 1 x p(x) f (x) =! f. x p(x) log p(x) + λ x 0 = P [ x log p(x) 1 + λ + µ f (x) = 0 p(x) + µ p(x) f (x)] x so that one has Boltzmann/Gibbs Distribution λ 1+µ f (x) p(x) = e = 1 f (x) eµ Z

Conditional Kullback-Leibler D KL can be conditional D KL [p(y x) q(y x)] D KL [p(y X) q(y X)] = p(x)d KL [p(y x) q(y x)] x

Kullback-Leibler and Bayes (Biehl, 2013) Want to estimate p(x θ), where θ is the parameter. Observe y. Seek best q(x y) for this y in the following sense: 1 minimize D KL of true distribution to model distribution q min q D KL [p(x θ) q(x y)]

Kullback-Leibler and Bayes (Biehl, 2013) Want to estimate p(x θ), where θ is the parameter. Observe y. Seek best q(x y) for this y in the following sense: 1 minimize D KL of true distribution to model distribution q 2 averaged over possible observations y min q y p(y θ) D KL [p(x θ) q(x y)]

Kullback-Leibler and Bayes (Biehl, 2013) Want to estimate p(x θ), where θ is the parameter. Observe y. Seek best q(x y) for this y in the following sense: 1 minimize D KL of true distribution to model distribution q 2 averaged over possible observations y 3 averaged over θ min q dθ p(θ) y p(y θ) D KL [p(x θ) q(x y)]

Kullback-Leibler and Bayes (Biehl, 2013) Want to estimate p(x θ), where θ is the parameter. Observe y. Seek best q(x y) for this y in the following sense: 1 minimize D KL of true distribution to model distribution q 2 averaged over possible observations y 3 averaged over θ min q dθ p(θ) y p(y θ) D KL [p(x θ) q(x y)] Result q(x y) is the Bayesian inference obtained from p(y x) and p(x)

Conditional Entropies Special Case: Conditional Entropy Information H(Y X = x) := y H(Y X) := x p(y x) log p(y x) p(x) y p(y x) log p(y x) Reduction of entropy (uncertainty) by knowing another variable I(X; Y) := H(Y) H(Y X) = H(X) H(X Y) = H(X) + H(Y) H(X, Y) = D KL [p(x, y) p(x)p(y)]

Rate/Distortion Theory Code below specifications Reminder Information is about sending messages. We considered most compact codes over a given noiseless channel. Now consider the situation where either: 1 channel is not noiseless but has noisy characteristics p( ˆx x) or 2 we cannot afford to spend average of H(X) bits per symbol to transmit Question What happens? Total collapse of transmission

Rate/Distortion Theory I Distortion Compromise don t longer insist on perfect transmission accept compromise, measure distortion d(x, ˆx) between original x and transmitted ˆx small distortion good, large distortion baaad Theorem: Rate Distortion Function Given p(x) for generation of symbols X, R(D) := min I(X; ˆX) p( ˆx x) E[d(X, ˆX)]=D where the mean is over p(x, ˆx) = p( ˆx x)p(x).

Rate/Distortion Theory II Distortion 1.8 1.6 r(x) 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1

First Example: Infotaxis (Vergassola et al., 2007)

Part II References

Biehl, M. (2013). Kullback-leibler and bayes. Internal Memo. Cover, T. M. and Thomas, J. A. (2006). Elements of Information Theory. Wiley, 2nd edition. Vergassola, M., Villermaux, E., and Shraiman, B. I. (2007). infotaxis as a strategy for searching without gradients. Nature, 445:406 409.