Why maximize entropy?

Similar documents
Nondeterministic finite automata

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Line Integrals and Path Independence

We set up the basic model of two-sided, one-to-one matching

Uncertainty. Michael Peters December 27, 2013

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Confidence Intervals

Chapter 1 Review of Equations and Inequalities

Guide to Proofs on Sets

MITOCW ocw f99-lec09_300k

Descriptive Statistics (And a little bit on rounding and significant digits)

Notes 1 Autumn Sample space, events. S is the number of elements in the set S.)

MATH 22 FUNCTIONS: ORDER OF GROWTH. Lecture O: 10/21/2003. The old order changeth, yielding place to new. Tennyson, Idylls of the King

Math 308 Midterm Answers and Comments July 18, Part A. Short answer questions

Sequential Logic (3.1 and is a long difficult section you really should read!)

Math 300: Foundations of Higher Mathematics Northwestern University, Lecture Notes

How to Characterize Solutions to Constrained Optimization Problems

Designing geared 3D twisty puzzles (on the example of Gear Pyraminx) Timur Evbatyrov, Jan 2012

Main topics for the First Midterm Exam

One sided tests. An example of a two sided alternative is what we ve been using for our two sample tests:

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (A) = P (B) = P (C) = P (D) =

#29: Logarithm review May 16, 2009

MITOCW ocw f99-lec30_300k

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Instructor (Brad Osgood)

CS 361: Probability & Statistics

Note: Please use the actual date you accessed this material in your citation.

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Intermediate Math Circles November 15, 2017 Probability III

The following are generally referred to as the laws or rules of exponents. x a x b = x a+b (5.1) 1 x b a (5.2) (x a ) b = x ab (5.

CS 124 Math Review Section January 29, 2018

The Derivative of a Function

To factor an expression means to write it as a product of factors instead of a sum of terms. The expression 3x

Fitting a Straight Line to Data

Conditional probabilities and graphical models

Proving languages to be nonregular

Lecture 14, Thurs March 2: Nonlocal Games

MIT BLOSSOMS INITIATIVE

Instructor (Brad Osgood)

Against the F-score. Adam Yedidia. December 8, This essay explains why the F-score is a poor metric for the success of a statistical prediction.

Computability Crib Sheet

Instructor (Brad Osgood)

What s the Average? Stu Schwartz

Problems from Probability and Statistical Inference (9th ed.) by Hogg, Tanis and Zimmerman.

( )( b + c) = ab + ac, but it can also be ( )( a) = ba + ca. Let s use the distributive property on a couple of

Algebra Exam. Solutions and Grading Guide

Learning From Data Lecture 14 Three Learning Principles

Roberto s Notes on Linear Algebra Chapter 4: Matrix Algebra Section 7. Inverse matrices

Grades 7 & 8, Math Circles 10/11/12 October, Series & Polygonal Numbers

Section 1.x: The Variety of Asymptotic Experiences

Section 20: Arrow Diagrams on the Integers

1.1 The Language of Mathematics Expressions versus Sentences

Lesson 6-1: Relations and Functions

Lesson 26: Characterization of Parallel Lines

AN ALGEBRA PRIMER WITH A VIEW TOWARD CURVES OVER FINITE FIELDS

The key is that there are two disjoint populations, and everyone in the market is on either one side or the other

TheFourierTransformAndItsApplications-Lecture28

STA111 - Lecture 1 Welcome to STA111! 1 What is the difference between Probability and Statistics?

MA 1125 Lecture 15 - The Standard Normal Distribution. Friday, October 6, Objectives: Introduce the standard normal distribution and table.

Probability, Statistics, and Bayes Theorem Session 3

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Lecture - 24 Radial Basis Function Networks: Cover s Theorem

Section 0.6: Factoring from Precalculus Prerequisites a.k.a. Chapter 0 by Carl Stitz, PhD, and Jeff Zeager, PhD, is available under a Creative

Mathematics-I Prof. S.K. Ray Department of Mathematics and Statistics Indian Institute of Technology, Kanpur. Lecture 1 Real Numbers

Joint Probability Distributions and Random Samples (Devore Chapter Five)

CH 59 SQUARE ROOTS. Every positive number has two square roots. Ch 59 Square Roots. Introduction

MATH240: Linear Algebra Review for exam #1 6/10/2015 Page 1

1 What does the random effect η mean?

Ordinary Differential Equations Prof. A. K. Nandakumaran Department of Mathematics Indian Institute of Science Bangalore

N/4 + N/2 + N = 2N 2.

Chemical Applications of Symmetry and Group Theory Prof. Manabendra Chandra Department of Chemistry Indian Institute of Technology, Kanpur

Math 31 Lesson Plan. Day 2: Sets; Binary Operations. Elizabeth Gillaspy. September 23, 2011

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Matroids and Greedy Algorithms Date: 10/31/16

Roberto s Notes on Linear Algebra Chapter 11: Vector spaces Section 1. Vector space axioms

MITOCW MIT6_041F11_lec17_300k.mp4

We're in interested in Pr{three sixes when throwing a single dice 8 times}. => Y has a binomial distribution, or in official notation, Y ~ BIN(n,p).

Squaring and Unsquaring

Math 38: Graph Theory Spring 2004 Dartmouth College. On Writing Proofs. 1 Introduction. 2 Finding A Solution

Henry James Shows Real Horror through the Governess Insanity

MITOCW ocw f99-lec05_300k

Basics of Proofs. 1 The Basics. 2 Proof Strategies. 2.1 Understand What s Going On

MA554 Assessment 1 Cosets and Lagrange s theorem

Cosets and Lagrange s theorem

Part I Electrostatics. 1: Charge and Coulomb s Law July 6, 2008

27. THESE SENTENCES CERTAINLY LOOK DIFFERENT

Quadratic Equations Part I

Solving Quadratic & Higher Degree Equations

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

MITOCW watch?v=ed_xr1bzuqs

33. SOLVING LINEAR INEQUALITIES IN ONE VARIABLE

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

Basic Probability. Introduction

CS1800: Strong Induction. Professor Kevin Gold

PHYS Statistical Mechanics I Course Outline

Hidden Markov Models: All the Glorious Gory Details

Introduction to Computer Science and Programming for Astronomers

CHAPTER 7: TECHNIQUES OF INTEGRATION

Algebra. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Transcription:

Why maximize entropy? TSILB Version 1.0, 19 May 1982 It is commonly accepted that if one is asked to select a distribution satisfying a bunch of constraints, and if these constraints do not determine a unique distribution, then one is best off picking the distribution having maximum entropy. The idea is that this distribution incorporates the least possible information. Explicitly, we start off with no reason to prefer any distribution over any other, and then we are given some information about the distribution, namely that it satisfies some constraints. We want to be as conservative as possible; we want to extract as little as possible from what we have been told about the distribution; we don t want to jump to any conclusions; if we are going to come to any conclusions we want to be forced to them. Lying behind this conservative attitude there is doubtless an Occam s razor kind of attitude: We tend to prefer, in the language of the LSAT, the narrowest principle covering the facts. There is also an element of sad experience: We easily call to mind a host of memories of being burned when we jumped to conclusions. For some time I have had the idea of making this latter feeling precise, by interpreting the process of picking a distribution satisfying a bunch of constraints as a strategy in a game we play with God. God tells us the constraints, we pick a distribution meeting those constraints, and then we have to pay according to how badly we did in guessing the distribution. The maximum entropy distribution should be our optimal strategy in this game. This Space Intentionally Left Blank. Contributors include: Peter Doyle. Copyright (C) 1982 Peter G. Doyle. This work is freely redistributable under the terms of the GNU Free Documentation License. 1

Last night I recognized for the first time what the rules of this game with God would have to be, or rather one possible set of rules perhaps there are other possibilities. I haven t yet convinced myself that these are the only natural rules for such a game, or even that they are all that natural. In thinking about these rules, the important question will be: Is this game just something that was rigged up to justify the maximum entropy distribution? After all, any point is the location of the maximum of some function. Does the statement that choosing the maximum entropy distribution is the optimal strategy in this game have any real mathematical content? (I purposely say mathematical rather than philosophical, from the prejudice that once can never have the latter without the former. Except as a man handled an axe, he had no way of knowing a fool. ) Obviously I think the answers to these questions are favorable in the case of the game I m proposing, but I haven t taken the time to think through them carefully yet. (Added later: Now that I ve finished writing this I m much more confident.) IDEA OF THE GAME: We are told the constraints, we pick a distribution, God gets to pick the real distribution, satisfying the constraints of course, some disinterested party picks an outcome according to the real distribution that God has just picked, and we have to pay according to how surprised we are to see that outcome. Of course the big question is, how much do we have to pay? The big answer is, the log of the probability we assigned to the outcome. Actually, it is better to have us pay log(p i /(1/n)), where p i is the probability we assigned to the point that got picked (let s call outcomes points ), and n is the total number of possible outcomes. To put it more positively, we get paid log(p i /(1/n)), the log of the factor by when we changed the weight of the point that got picked from the value it is given by the uniform distribution. A big factor means I thought so ; a little factor means Fooled me!. We choose this factor rather than the new weight itself so that if we start with a non-uniform a priori distribution the theory will continue to work, and so if no constraints are made at all and we stick with the a priori distribution then no money changes hands, and because it feels like the right thing to do. We take the 2

log of the factor because we are trying to measure surprise and independent surprises should add, and because it feels like the right thing to do. PROOF that choosing the maximum entropy distribution is the optimal strategy in this game. Suppose for simplicity that there is only one constraint. I should have said before that the kind of constraints I am thinking about are constraints of the form: the function f having value f i at point i has expected value f, i.e. p i f i = f. i The maximum entropy distribution is obtained via a Gibbs factor: p i = µ i e αf i /Z, where Z = i µ i e αf i, and µ is the a priori distribution. So we end up getting paid αf i log Z if i is picked. Our expected payment is therefore α f log Z no matter what distribution God picks. This appears significant and makes us think we re on the right track. And indeed, let s look at the problem this way: We are supposed to pick a distribution from a collection C of distributions so as to attain max min qi log(p i /µ i ). We want to verify that this is equivalent to maximizing entropy, i.e. that arg max min qi log(p i /µ i ) = arg max pi log(p i /µ i ) = arg min pi log(p i /µ i ), 3

where arg max means the location of the maximum. If we call the quantity ν µ (q, p) = q i log(p i /µ i ) the degree to which q verifies p, then to optimize our strategy we want to pick p so as to maximize the minimum verification. We want to know if this is the same as picking p so as to minimize the self-verification. (That s pessimism for you.) But in the case we are talking about we have seen that for the maximum entropy distribution that is, the least self-fulfilling distribution the degree of verification doesn t depend on the q chosen. So here we have a distribution pme that doesn t like itself any better than anyone else likes it. That is, this distribution is its own worst enemy. But it is a general fact of life that a distribution likes itself at least as well as it likes anyone else: qi log(p i /µ i ) = q i log(q i /µ i ) q i log(q i /p i ) q i log(q i /µ i ). So p is the least despised distribution in the world. In symbols: and so so min q ν µ (q, pme) = ν µ (pme, pme) ν µ (pme, p) ν µ (pme, pme), min ν µ(q, p) ν µ (pme, p) ν µ (pme, pme) = min ν µ(q, pme), q C q C arg max min ν µ(q, p) = pme = arg min ν µ(p, p), q.e.d. PROBLEM: For which sets C of distributions does the equality arg max min ν(q, p) = arg min ν(p, p) hold? All sets? Seems unlikely. Sets convex in a suitable sense? Check out Csiszar s paper on I-divergence geometry. These are the sets from which it makes sense to pick the maximum entropy distribution. 4

Addendum 21 May 1982: Talking to Roger Rosenkrantz has made me realize that these is no reason to limit my original alternatives to the set C of God s alternatives. Let me simply be given the information that God will pick from the set C. If I want to choose some distribution outside of the set C for my best guess, fine. Then, according to Roger, the reason we take the log in deciding how much I will be paid is that this is (roughly?) the only function with the property that when D is a one-element set I am always best off choosing that element. This seems like a pretty cogent reason for using the log. Things to think about: mixed strategies. What if all I tell God is a distribution of distributions, and the disinterested party picks one of my distributions independently of picking the point according to God s distribution. Does this change anything? contradictory information. What if I am told that the real distribution belongs to both C and D, which are disjoint. To make sense of this, let me assume that one of the two constraints is the real one, and that this will be decided by flipping a p-weighted coin at some point in the proceedings. Investigate this. 5