A Brief Review of Probability, Bayesian Statistics, and Information Theory

A Brief Review of Probability, Bayesian Statistics, and Information Theory Brendan Frey Electrical and Computer Engineering University of Toronto frey@psi.toronto.edu http://www.psi.toronto.edu

A system is described by a set of random variables with domains A configuration is an assignment to A sample space is the set of possible configurations and is given by the product of the domains: Ex: A die. Ex: 2 dice. ( &"$.!#"$ (# of dots on die) %. &"'$. ) * ) %+, "-&" $. 4 5 8 2 9 $ Ex: 2 dice.. ( &./"$. Less untuitive than above. Ex: 2 dice, angle of hand.. 32 6(7 %;: 10 ( &"$,

! 6 7 8 8 The probability of configuration, ', is a real number that satisfies ' ' Ex: 2 unbiased dice. './", for 3 " #" The probability density for configuration, number that satisfies where NOTE: 6 7 8 is a differential volume of.! is possible., is a real #" $ %. 1 6 7 8 0

"! A random experiment or simulation produces a configuration 4. Discrete case: In experiments, the fraction of times configuration ' occurs converges to as. Continuous case: Suppose is a region of. In experiments, the fraction of times a configuration in occurs converges to as.

! #" % is the probability of given the value of $ Imagine throwing away all experiments where $ equal to the given value is not & '( ) *+ and $ are independent if $ 6 $ % and $ Knowing tells us nothing about the value of and vice versa,

general, $ and in If and $ are independent, ) " From the chain rule and normalization, $ is sometimes called the marginal of For densities, $

) Since $ $ % %, Using, we get Bayes rule $ % % For observed and $ hidden, we call the prior, % $ the likelihood and the posterior For densities, $ % % $

$ $ ()*+! The expected value of is can be a vector, eg, $, or * *+ The variance of is If and $ are independent, $

$ *!* The covariance of and $ is $!$ $ If and $ are independent, $ (not vice versa) In general, 7 $ % * )* ) $ The covariance matrix of! or, for 4 5, is

! 4 37 $ (eg, coin toss) where 4 7 is the probability that is 1. Sometimes, we parameterize using - 2, 2 4 7 7 7

7 8 8 4 5 (eg, prior for the probability that a coin will land heads up) % - '*% otherwise

Machine learning and statistics study how models are learned from data. In Bayesian machine learning and statistics, the model is considered to be a hidden variable with a prior distribution. Given the data, the posterior distribution over models can be used to make predictions, interpret the data, etc. Maximum likelihood (ML) estimation and maximum a posteriori (MAP) estimation can be viewed as approximations to Bayesian learning, where the most probable model is selected. (In ML estimation, the prior over models is assumed to be uniform.)

Suppose we flip a coin a bunch of times and see heads and tails. In a frequentist approach, we estimate the probability of heads as - In the Bayesian approach, we first specify a prior, say that the probability of seeing a head, is uniform on 7. Using Bayes rule, we obtain ) which is a Beta distribution with mode mean * - %+. and This distribution can be used to make decisions, compute confidence intervals, or interpret the data. For example, the minimum squared loss estimate of is 3 - %+ This is closer to the prior than the frequentist estimate. 0

Entropy is a measure of the maximum average amount of information that a random variable can convey in its value 4 ( The entropy of a discrete variable % $ is For a discrete variable, 7 since The more uniform bits 8 is, the greater the entropy If is an integer for all, bits of information can be conveyed using an encoder that uses bits to pick If (natural logarithm) is used instead of information is measured in nats.,

* *!) String 1 0.5 1 0 2 0.125 3 100 3 0.125 3 101 4 0.125 3 110 2 bits 5 0.125 3 111 Imagine we have a queue of random bits (eg, a compressed image) that we d like to convey. We can use this information to produce a series of experiments for. Each experiment is produced thus: Draw a bit from the queue If 7 set and terminate the experiment If draw two more bits and use these to pick 2, 3, 4 or 5 and terminate the experiment This procedure picks according to average of 7 7 %. % and conveys an bits per experiment,

Instead of encoding a bit string into a random variable, we can encode into a bit string using a source code. The decoder uses the bit string to recover. It turns out that if has a distribution minimum average bit string length is then the If is an integer for all, the minimum can be achieved by mapping each to a bit string with length

! 4 37 $ (eg, coin toss, bit from a magnetic disk), where 4 7 is the probability that is 1. ) Entropy of Bernoulli variable 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -x*log(x)/log(2)-(1-x)*log(1-x)/log(2) 0 0.2 0.4 0.6 0.8 1 Probability that Bernoulli variable equals 1

) ) * ) ) ' ) * " Relative entropy is a measure of the average excess string length when the wrong source code is used. Suppose the true distribution for is minimum average string length is, so the Suppose we use bit strings determined from the wrong distribution,. The average string length will be The average excess string length is the relative entropy: 7

! Suppose we try to use a source code to compress a real 4 5 variable. We can create infinitesimal bins, where bin will have probability The minimum average string length for this distribution is at This average length ( entropy ) is infinite; ie, conveys infinite information However, on the next page, we see that the relative entropy is finite...

) ) )! *+ ) Suppose we use bit strings determined from the wrong density. Under this density, bin at will have probability. The average string length is! The relative entropy (excess average string length) is 7 Since the relative entropy is finite, we refer to as entropy, although it may be NEGATIVE!

7 " 4 5 (eg, distribution of failure times), nats The entropy increases as 7 otherwise bits increases

! 4 5 (eg, variable that is a sum of a large number of other real random variables) %;: %;: nats %;: bits The entropy increases as increases &

!! 4 5 %;: where 4 5, is an positive definite matrix and is the determinant, an covariance matrix 0

Suppose we have an invertible function $ density. and a When a small volume is mapped from -space to $ -space, the probability in the volume should stay constant. However, because the volume may change shape, the probability density will change and the Jacobian captures this effect. Conservation of probability mass gives $, and is called the Jacobian 4 5 For and $ 4 5, is a matrix of derivatives

: A. Leon-Garcia, Probability and Random Processes for Electrical Engineering, Addison Wesley, New York, NY, 1994. ) *! * : R. M. Neal. Bayesian Learning for Neural Networks, Springer, New York, NY, 1996. & : T. M. Cover and J. A. Thomas, Elements of Information Theory, John Wiley & Sons, New York, NY, 1991. ) * * ) * Useful when we study Gaussian models. : http://www.psi.toronto.edu/matrix/matrix.html,