Maximum Entropy - Lecture 8

Similar documents
Signal Processing - Lecture 7

Statistical Mechanics and Information Theory

Model Averaging (Bayesian Learning)

CS 630 Basic Probability and Information Theory. Tim Campbell

1. Thermodynamics 1.1. A macroscopic view of matter

Lecture 1: Introduction, Entropy and ML estimation

Information in Biology

Outline of the Lecture. Background and Motivation. Basics of Information Theory: 1. Introduction. Markku Juntti. Course Overview

Lecture 13. Multiplicity and statistical definition of entropy

Lecture Notes Set 3a: Probabilities, Microstates and Entropy

Chapter 9 Fundamental Limits in Information Theory

Basic Concepts and Tools in Statistical Physics

COMBINATORIAL COUNTING

Probability, Entropy, and Inference / More About Inference

6.730 Physics for Solid State Applications

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Information in Biology

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

(# = %(& )(* +,(- Closed system, well-defined energy (or e.g. E± E/2): Microcanonical ensemble

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

BOLTZMANN-GIBBS ENTROPY: AXIOMATIC CHARACTERIZATION AND APPLICATION

Lecture 5: Temperature, Adiabatic Processes

Confidence Intervals. First ICFA Instrumentation School/Workshop. Harrison B. Prosper Florida State University

Learning with Probabilities

Probability Models for Bayesian Recognition

Minimum Error-Rate Discriminant

EBCOT coding passes explained on a detailed example

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Introduction to Probability and Statistics (Continued)

Thermodynamics of feedback controlled systems. Francisco J. Cao

Introduction to Information Entropy Adapted from Papoulis (1991)

The Bayes classifier

PERFECT SECRECY AND ADVERSARIAL INDISTINGUISHABILITY

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

BAYESIAN ANALYSIS OF DOSE-RESPONSE CALIBRATION CURVES

Microcanonical Ensemble

Statistical Theory 1

Some Basic Concepts of Probability and Information Theory: Pt. 2

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Chapter 16 Thermodynamics

Information Theory Primer:

Classification & Information Theory Lecture #8

Classification and Regression Trees

AN OPTIMAL STRATEGY FOR SEQUENTIAL CLASSIFICATION ON PARTIALLY ORDERED SETS. Abstract

t = no of steps of length s

Communication Complexity 16:198:671 2/15/2010. Lecture 6. P (x) log

Randomized Algorithms

Discrete Mathematics, Spring 2004 Homework 4 Sample Solutions

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

(Classical) Information Theory II: Source coding

Information Theory with Applications, Math6397 Lecture Notes from September 30, 2014 taken by Ilknur Telkes

Bayesian networks approximation

Suggestions for Further Reading

MATH115. Sequences and Infinite Series. Paolo Lorenzo Bautista. June 29, De La Salle University. PLBautista (DLSU) MATH115 June 29, / 16

Probability calculus and statistics

Classification: Decision Trees

Computing and Communications 2. Information Theory -Entropy

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 8: Clustering & Mixture Models

What is Entropy? Jeff Gill, 1 Entropy in Information Theory

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Entropy Principle in Direct Derivation of Benford's Law

F R A N C E S C O B U S C E M I ( N A G OYA U N I V E R S I T Y ) C O L L O Q U I U D E P T. A P P L I E D M AT H E M AT I C S H A N YA N G U N I

although Boltzmann used W instead of Ω for the number of available states.

Information Theory. Lecture 5 Entropy rate and Markov sources STEFAN HÖST

Reconstruction. Reading for this lecture: Lecture Notes.

Chapter 3: Asymptotic Equipartition Property

Naïve Bayes classification

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Lecture 8. The Second Law of Thermodynamics; Energy Exchange

The Effects of Coarse-Graining on One- Dimensional Cellular Automata Alec Boyd UC Davis Physics Deparment

Massachusetts Institute of Technology

Combinations and Probabilities

Frequentist-Bayesian Model Comparisons: A Simple Example

Thermodynamics of the nucleus

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Thermodynamics Second Law Entropy

Lecture 5: Asymptotic Equipartition Property

Endogenous Information Choice

Bayesian Methods: Naïve Bayes

Information Theory and Coding Techniques: Chapter 1.1. What is Information Theory? Why you should take this course?

arxiv: v1 [physics.data-an] 10 Sep 2007

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

Outline Review Example Problem 1. Thermodynamics. Review and Example Problems: Part-2. X Bai. SDSMT, Physics. Fall 2014

Homework 4 Solution, due July 23

Lecture 2 Binomial and Poisson Probability Distributions

Lecture 1: Probability Fundamentals

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Bayesian Learning. Bayesian Learning Criteria

Probability - Lecture 4

It From Bit Or Bit From Us?

Statistical Methods for Astronomy

How generative models develop in predictive processing

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Transcription:

1 Introduction Maximum Entropy - Lecture 8 In the previous discussions we have not been particularly concerned with the selection of the initial prior probability to use in Bayes theorem, as the result in most cases is insensitive to initial values. This is not always the case, however, and we need a way of selecting initial probabilities among many possible choices. We can use the principle of maximum entropy which is defined by information theory. Thus information entropy is a measure of the uncertainty in a probability distribution. A data stream is an information transfer to which we can assign a set of possible probability distributions. The principle of maximum entropy states that the set of possible probability distributions, {p i }, occurs in a way to maximize the entropy. As an example, supppose 12 balls and all but one are of equal weight. You have a balance as an instrument and wish to determine the best way to do find the odd ball. This is a problem in optimization and is solved by choosing the most probable path. 2 Review of information theory Shannon s theorem states that the information contained in a probability, p i, contained in a set {p i } is I = ln(1/p i ). The entropy of the ensemble of the set is; S = p i ln(1/p i ) As previously noted, the ln function could be defined in any base, but in information theory this is almost always base 2 as information is coded in bits. Thus N random variables each with entropy, S, can be compressed without loss of information into NS or more bits. Consider an ensemble of N independent distributed random variables. The outcome of this distribution, x = (x 1 x N ), most likely belongs to the subset, A x, having 2 NS members, and the probability is close to 1/2 NS. This is due to the equipartition of probabilities. Thus as N the number of bits per outcome required to specify an outcome S. (ie S δ /N Constant as N In the above S δ is the smallest subset of outcomes that have the highest probabilities (the largest information content). This is shown in Figure 1. 1

Figure 1: A plot of S δ /N (vertical axis) vs δ (horizontal axis) for various N 3 Application of information theory Suppose a gas of N point particles. Initially all particles are contained in a volume V. Suppose we know the position of each particle. Divide the volume into 10 6 small volumes, δv. For each of the 10 6 volumes assign a probability of p i of finding a gas particle within the volume. Thus the probability is p i = n i /N. For an even distribution n i = N/10 6. The entropy is; S = 106 i=1 p i ln(1/p i ) = (n i /N) ln(n/n i ) In the case of an even distribution; S = N/10 6 n ln( N N/10 6 ) = ln(10 6 ) Note that the entropy depends on the measurement scale. If the volume size is decreased by 10 3 (ie the accuracy in the measurement of the particle s position increases) the entropy increases to; S = ln(10 9 ) For macroscopic systems we most often use relative rather than absolute measures of entropy, due to the large number of particles. Now allow the particles to move. The state of the system is determined by the the distribution of particles in each volume. We assume that particles are equally likely to move in any box in any given time step. There are 10 6 configurations with all particles in one box. 2

In this case the relative entropy is; S(all in 1 box) = 106 i=1 p i ln(1/p i ) = 0 This is because p 1 = n i /10 6 = 1 because n i = 10 6 and p i ln(1/p i ) = 0 when p i = 0. Now suppose the particles are put in 2 boxes. The number of distributions are; ( ) 10 6 = 10 6! 2 2! (10 6 2)! 5 1011 The entropy in this case is; S(all in 2 boxes) = 1/2 ln(2) + 1/2 ln(2) = ln(2) There are 5 10 11 distributions out of 10 6 possible boxes. The probability is; P = 5 10 11 [5x10 11 + 10 6 ] 1 10 5 Suppose the particles deitribute almost equally with 1/2 the boxes having one less particle and 1/2 having one more particle. The number of configurations are; ( ) 10 6 10 6 10 /2 3 105 Each of these configurations has entropy ln(10 6. The result of all this is that we started a system with entropy 0 an the probability that it moves to a higher configuration is; P = 1 (1/10 3 105 ) 1 It is owhelmingly probable that as time passes a macroscopic syatem will increase in entropy to reach a maximum. This provides an explanation (if not a proof) of the 2 nd law of thermodynamics and the foundation of the maximum entropy principle. 4 Example - Texas armadillos and dosequis We cannot explore the application of maximum entropy in detail, but can give an illustrative example. We look at Texas armadillos distributed as to whether they are left handed and drink Dos Equis beer 2. [paraphrased from Jaynes - Maximum Entropy and Bayesian Methods] Suppose observation establishes that 3/4 of the armadillos in Texas are left-handed and 3/4 drink Dos Equis. We fill in the data table 1. Armadillos come in quantized units, so for a 3

Figure 2: A Dos Equis drinking Armadillo Table 1: Probability table for Texas Armadillos Beer Left Handed Right Handed Probability Dos Equis p 11 p 12 3/4 Other p 21 p 22 1/4 Total 3/4 1/4 1 total of N armadillos we have probabilities; P ij N ij /N which leads to the equations; p 11 + p 12 = 3/4 p 21 + p 22 = 1/4 p 11 + p 21 = 3/4 p 12 + p 22 = 1/4 Solve these equations letting p 22 = q, and then put the table in the form; ( 0.5 + q ) 0.25 q 0.25 q q We find that p 12 = p 21. It can also be shown that when assigning probabilities, maximizing entropy is the only choice that does not introduce correlations in the variables. The number of armadillos in Texas is large, so there is a large number of possible probabilities. We count these using a binomial distribution; W = N! N 11! N 12! N 21! N 22! As an example, choose N = 4. There are 2 possible solutions; 4

Solution 1 q = 0 ( ) 2 1 (1/4) 1 0 This has multiplicity W = 12, and entropy S = p i ln(p i ) = 0.45 Solution 2 q = 1/N ( ) 3 0 (1/4) 0 1 This has multiplicity W = 4 and entropy S = p i ln(p i ) = 0.24. The solution with greatest entropy occurs 75% of the time ( 12 12 + 4 ). Now suppose we change the analysis by introducing a connection between drinking Dos Equis and handiness - a gene perhaps. This introduces a constraint and a correlation. The solution can be developed as above and these would be iterated in a Bayesian approach with the above prior. 5