Introduction to Probability and Statistics (Continued)

Size: px

Start display at page:

Download "Introduction to Probability and Statistics (Continued)"

Norman Murphy
5 years ago
Views:

1 Introduction to Probability and Statistics (Continued) Prof. Nicholas Zabaras Center for Informatics and Computational Science University of Notre Dame Notre Dame, Indiana, USA URL: August 9, 08

2 Contents Markov and Chebyshev Inequalities The Law of Large Numbers, Central Limit Theorem, Monte Carlo Approximation of Distributions, Estimating π, Accuracy of Monte Carlo approximation, Approximating the Binomial with a Gaussian, Approximating the Poisson Distribution with a Gaussian, Application of CLT to Noise Signals Information theory, Entropy, KL divergence, Jensen s Inequality, Mutual information, Maximal Information Coefficient

References Following closely Chris Bishops PRML book, Chapter Kevin Murphy s, Machine Learning: A probablistic perspective, Chapter Jaynes, E. T. (003). Probability Theory: The Logic of Science.

3 References Following closely Chris Bishops PRML book, Chapter Kevin Murphy s, Machine Learning: A probablistic perspective, Chapter Jaynes, E. T. (003). Probability Theory: The Logic of Science. Cambridge University Press. Bertsekas, D. and J. Tsitsiklis (008). Introduction to Probability. Athena Scientific. nd Edition Wasserman, L. (004). All of statistics. A Concise Course in Statistical Inference. Springer. 3

4 Markov and Chebyshev Inequalities You can show (Markov s inequality) that if X is a non-negative integrable random variable and for any a > 0: 0 Pr[ X a] You can generalize this using any function of the random variable X as: [ f( X)] Pr[ f ( X ) a] a Using f(x) = (X E[X]), we derive the following Chebyshev inequality: s a, X [ X ] s Pr X [ X] In terms of std s, we can restate as : Pr X [ X] s Thus the probability of X being more than s away from E[X] is ¼. [ X ] a Indeed : [ X ] x ( x) dx x ( x) dx a ( x) dx a Pr[ X a] a a 4

5 The Law of Large Numbers (LLN) Let X i for i =,,..., n be independent and identically distributed random variables (i.i.d.) with finite mean E(X i ) = μ & variance Var(X i ) = σ. Let n തX n = n i= X i n Note that E തX n μ = μ Var തX n = n i= n = n i= σ = σ n Weak LLN: Strong LLN: lim Pr തX n μ ε = 0 ε > 0 n lim തX n = μ almost surely n This means that with probability one, the average of any realizations of x, x, of the random variables X, X, converges to the mean. 5

6 Statistical Inference: Parametric & Non-Parametric Approach Assume that we have a set of observations S { x, x,... x }, x N j n The problem is to infer on the underlying probability distribution that gives rise to the data S. Parametric problem: The underlying probability density has a specified known form that depends on a number of parameters. The problem of interest is to infer those parameters. Non-parametric problem: No analytical expression for the probability density is available. Description consists of defining the dependency or independency of the data. This leads to numerical exploration. A typical situation for the parametric model is when the distribution is the n PDF of a random variable X :. 6

7 Example of the Law of Large Numbers Assume that we sample We consider a parametric model with x j realizations of where we take both the mean and the variance as unknowns. The probability density of X is: S { x, x,... x }, x x 0 N j X ~ N ( x, ) 0 T ( x x0, ) exp ( / x x0) ( x x0) det Our problem is to estimate x and 0 7

8 Empirical Mean and Empirical Covariance From the law of large numbers, we calculate: x 0 = E X N x j = xҧ j= To compute the covariance matrix, note that if X, X, are i.i.d. so are k f(x ), f(x ), for any function f : N Then we can compute: Σ = cov(x) = E ൫x E X ) x E X T E ൫x x) ҧ x ҧ N Σ N j= (x j x) ҧ x j xҧ T = ഥΣ x T The above formulas define the empirical mean and empirical covariance. 8

9 The Central Limit Theorem Let (X, X, X N ) be independent and identically distributed (i.i.d.) continuous random variables each with expectation m and variance s. Define: As N >, the distribution of Z N converges to the distribution of a standard normal random variable If തX N = N j= Z N = σ N (X X + X X N Nμ) = ത N μ σ N lim N N t / P ZN x e dt, for N large, Somewhat of a justification for assuming Gaussian noise is common x X j തX N ~N μ, σ N as N N, തX N = N j= X j 9

10 The CLT and the Gaussian Distribution As an example, assume N variables (X, X, X N ) each of which has a uniform distribution over [0, ] and then consider the distribution of the sample mean (X + X + + X N )/N. For large N, this distribution tends to a Gaussian. The convergence as N increases can be rapid N N 3.5 N MatLab Code 0

11 The CLT and the Gaussian Distribution We plot a histogram of where As N, the distribution tends towards a Gaussian. N xij, j :0000 x ~ Beta (,5) N i ij 3 N = 3 N = 5 3 N = Run centrallimitdemo from PMTK

12 Accuracy of Monte Carlo Approximation In Monte Carlo approximation of the mean using the sample mean μҧ approximation, we have: For μ = E[f(x)] μҧ μ~n 0, σ We can now derive the following error bars (using central intervals): The number of samples needed to drop the error within is then: N σ = E[f x ] E[f x ] N s= Pr μ.96 തσ N.96 തσ N N f(x s ) μҧ തσ μҧ μ +.96 തσ N = 0.95 ε N 4 തσ ε

13 Monte Carlo Approximation of Distributions Computing the distribution of y = x, p(x) is uniform. The MC approximation is shown on the right. Take samples from p(x), square them and then plot the histogram p(x) p(y) Run changeofvarsdemod from PMTK 3

14 Accuracy of Monte Carlo Approximation Increase of the accuracy of MC with the number of samples. Histograms (on the left) and (on the right) pdfs 0 samples using kernel.5 density estimation. The actual distribution is shown on red samples.6.4 N x.5, Run mcaccuracydemo from PMTK

15 Example of CLT: Estimating by MC Use the CLT to approximate π. Let x, y~u[ r, r]. r r ( ) I x y r dxdy r r r r r 4 ( ) ( r r r I r x y r p x) p( y) dxdy r r r 4 ( ) x y r p( x) p( y) dxdy r r N തπ 4 N s= I(x s + y s r ), x s, y s ~U[ r, r] Here x, y are uniform random variables on [ r, +r], r =. We find = 3.46 with standard error x, y~u[ r, r] Run mcestimatepi from PMTK 5

16 CLT: The Binomial Tends to a Gaussian One consequence of the CLT is that the binomial distribution N Bin m N, ( ) m m N m which is a distribution over m defined by the sum of N observations of the random binary variable x, will tend to a Gaussian as N. x x... xn m ~ N, N N N m ~ N N, N binomial distribution Bin( N 0, m 0.5) Matlab Code

17 Poisson Process Consider that we count the number of photons from a light source. Let N(t) be the number of photons observed in the time interval [0, t]. N(t) is an integer-valued random variable. We make the following assumptions: a. Stationarity: Let and be any two time intervals of equal length, n any non-negative integer. Assume that Prob. of n photons in = Prob. of n photons in b. Independent increments: Let,,, n be non-overlapping time intervals and k, k,, k n non-negative integers. Denote by A j the event defined as A j = k j photons arrive in the time interval j Assume that these events are mutually independent, i.e. P( A A... A ) P( A ) P( A )... P( A ) n c. Negligible probability of coincidence: Assume that the probability of two or more events at the same time is negligible. More precisely, N(0) = 0 and PN( h) lim 0 h 0 h n 7

18 Poisson Process If these assumptions hold, then for a given time t, N is a Poisson process: n t t PN( t) n e, 0, n 0,,,..., n! Let us fix t = T = observation time and define a random variable N = N(T). Let us define the parameter θ = λt. We then denote: N ~ Poisson( ) n e n! D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 007 S Ghahramani: Fundamentals of Probability,996. 8

19 Poisson Process Consider the Poisson (discrete) distribution N n P( N n) Poisson( n ) e n! 0,,,..., The mean and the variance are both equal to. N n ( ), Poisson n N n0 9

20 Approximating a Poisson Distribution With a Gaussian Theorem: A random variable X~Poisson(θ) can be considered as the sum of n independent random variables X i ~Poisson(θ/n). a According to the Central Limit Theorem, when n is large enough, Take X i n ~ Poisson(, ) X i ~ N (, ) n n n n n n Then X Xi based on the Theorem is a draw from Poisson(θ) and i from the CLT also follows a Gaussian for large n with: X n n var X n n i Thus X ~ N (, ) The approximation of a Poisson distribution with a Gaussian for large n is a result of the CLT! a For a proof that the sum of independent Poisson Random Variables is a Poisson distribution see this document. 0

21 Approximating a Poisson Distribution with a Gaussian Mean= Mean= Mean= Mean= Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the mean. The higher the, the smaller the distance between the two distributions. See this MatLab implementation.

22 Kullback-Leibler Distance Between Two Densities Let us consider the following two distributions: n n Poisson( n ) e n! x Gaussian( x, ) exp ( x ) We often use the Kullback-Leibler distance to define the distance between two distributions. In particular, in approximating the Poisson distribution with a Gaussian distribution, we have the following: KL distance (. ), (., ) ( n )log Poisson Gaussian Poisson n0 Poisson Gaussian ( n ) ( n, )

23 Approximating a Poisson Distribution With a Gaussian The KL distance of the Poisson distribution from its Gaussian approximation as a function of the mean in a logarithmic scale. The horizontal line indicates where the KL distance has dropped to /0 of its value at =. See the following MatLab implementation. 3

24 Application of the CLT: Noise Signals Consider discrete sampling where the output is noise of length n. The noise vector n x is a realization of : X n We estimate the mean and the variance of the noise in a single measurement as follows: To improve the signal to noise ratio, we repeat the measurement and average the noise vector signals: The average noise is a realization of a random variable: If X variance is, X,... () () var( X ) (a given tolerance). x x x x n j n 0 j, s 0 n j n j x N ( k) n x N k are i.i.d., X is asymptotically a Gaussian by the CLT, and its s N. We need to repeat the experiment until X N ( k) n X N k s N 4

25 Noise Reduction By Averaging Gaussian noise vectors of size 50 are used with σ = s is the std of a single noise vector 3 3 N= N= Estimated Noise level : s N N=0 3 N= See the following MatLab implementation. D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 007 5

26 Introduction to Information Theory Information theory is concerned with representing data in a compact fashion (data compression or source coding), and transmitting and storing it in a way that is robust to errors (error correction or channel coding). To compactly representing data requires allocating short codewords to highly probable bit strings, and reserving longer codewords to less probable bit strings. e.g. in natural language, common words ( a, the, and ) are much shorter than rare words. D. MacKay, Information Theory, Inference and Learning Algorithms (Video Lectures) 6

27 Introduction to Information Theory Decoding messages sent over noisy channels requires having a good probability model of the kinds of messages that people tend to send. We need models that can predict which kinds of data are likely and which unlikely. David MacKay, Information Theory, Inference and Learning Algorithms, 003 (available on line) Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley, 006. Viterbi, A. J. and J. K. Omura (979). Principles of Digital Communication and Coding. McGraw-Hill. 7

28 Introduction to Information Theory Consider a discrete random variable x. We ask how much information ( degree of surprise ) is received when we observe (learn) a specific value for this variable? Observing a highly probable event provides little additional information. If we have two events x and y that are unrelated, then the information gain from observing both of them should be h(x, y) = h(x) + h(y). Two unrelated events will be statistically independent, so p(x, y) = p(x)p(y). 8

29 Entropy From h(x, y) = h(x) + h(y) and p(x, y) = p(x)p(y), it is easily shown that h(x) must be given by the logarithm of p(x) and so we have h( x) log p( x) 0 the units of h(x) are bits ( binary digits ) Low probability events correspond to high information content. When transmitting a random variable, the average amount of transmitted information is: Entropy of X : X p( X k)log p( X k) K k 9

Noiseless Coding Theorem (Shanon) Example (Coding theory): x discrete random variable with 8 possible states; how many bits to transmit the state of x?

30 Noiseless Coding Theorem (Shanon) Example (Coding theory): x discrete random variable with 8 possible states; how many bits to transmit the state of x? All states equally likely 8 log 3 bits 8 8 x Example : consider a variable having 8 possible states {a, b, c, d, e, f, g, h} for which the respective (non-uniform) probabilities are given by ( /, /4, /8, /6, /64, /64, /64, /64 ). The entropy in this case is smaller than for the uniform distribution. 4 x log log log log log bits average code length bits Note: shorter codes for the more probable events vs longer codes for the less probable events. Shanon s Noiseless Coding Theorem (948): The entropy is a lower bound on the number of bits needed to transmit the state of a random variable 30

31 Alternative Definition of Entropy Considering a set of N identical objects that are to be divided amongst a set of bins, such that there are n i objects in the i th bin. Consider the number of different ways of allocating the objects to the bins. In the i th bin there are n i! ways of reordering the objects (microstates), and so the total number of ways of allocating the N objects to the bins is given by (multiplicity) N! W n! The entropy is defined as i i lnw ln N! ln n i! i N N N We now consider the limit N, ln N! N ln N N, ln n! n ln n n i i i i ni ni lim ln piln N N N i i p i p i is the probability of an object assigned to the i th bin. The occupation numbers p i correspond to macrostates. 3

32 Alternative Definition of Entropy Interpret the bins as the states x i of a discrete random variable X, where p(x = x i ) = p i. The entropy of the random variable X is then p p xi lnp xi i Distributions p(x) that are sharply peaked around a few values will have a relatively low entropy, whereas those that are spread more evenly across many values will have higher entropy. 3

33 Maximum Entropy: Uniform Distribution The maximum entropy configuration can be found by maximizing H using a Lagrange multiplier to enforce the normalization constraint on the probabilities. Thus we maximize ഥH = p(x i )lnp(x i ) + λ p(x i ) We find p( x ) / M, i i M is the number of possible states and H= ln M. To verify that the stationary point is indeed a maximum, we can evaluate the nd derivative of the entropy, which gives ഥH = I ij p(x i ) p(x j ൯ i p i where I ij are the elements of the identity matrix. For any discrete distribution with M states, we have: H[x] ln M p( xi )ln p( x ) i p( xi )ln ln p( xi ) ln M p( x ) p( x ) i i i i Here we used Jensen s inequality (for the concave function log) i 33

34 Example: Biosequence Analysis Recall the DNA Sequence logo example earlier. The height of each bar is defined to be H, where H is the entropy of that distribution, and (= ln 4) is the maximum possible entropy. Thus a bar of height 0 corresponds to a uniform distribution (ln 4), whereas a bar of height corresponds to a deterministic distribution. Bits Sequence Position seqlogodemo from PMTK 34

35 Binary Variable Consider binary random variables, X {0, }, we can write p(x = ) = θ and p(x = 0) = θ. X {0,}, p( X ), p( X 0) Hence the entropy becomes (binary entropy function) The maximum value of occurs when the distribution is uniform, θ = 0.5. X log log H(X) 0.5 MatLab function bernoullientropyfig from PMTK p(x = ) 35

36 Divide x into bins of width Δ. Assuming p(x) is continuous, for each such bin, there must exist x i such that ( i) i Differential Entropy p( x) dx p( x ) = probability in falling in bin i p( x ) ln p( x ) p( x ) ln p( x ) ln i i i i i i lim p( xi) ln p( xi) p( x)ln p( x) dx 0 (can be negative) i The ln Δ term is omitted since it diverges as Δ0 (indicating that infinite bits are needed to describe a continuous variable) 36

37 Differential Entropy For a density defined over multiple continuous variables, denoted collectively by the vector x, the differential entropy is given by x p x p x dx ( )ln ( ) Differential (unlike the discrete) entropy can be negative When doing variable transformation y(x), use p(x)dx = p(y)dy, e.g. if y = Ax then: p p dy n y x ( y)ln ( y) A y l A x ln A 37

38 Differential Entropy and the Gaussian Distribution The distribution that maximizes the differential entropy with constraints on the first two moments is a Gaussian: H = p( x)ln p( x) dx p( x) dx xp( x) dx m 3 x m p( x) dx s Using calculus of variations, δh = Evaluating the differential entropy of the Gaussian, we obtain (an expression for a multivariate Gaussian is also given) d x ln s ln e det, d, det s Note H[x] < 0 for σ < /(πe) x 3 x m m p x e p x Normalization Given Given mean std 3 ( ) ( ) exp / Use s s the constraint s p( x)ln p( x) dx p( x) dx p( x) dx x p( x) dx x m p( x) dx 0 x 38

39 Kullback-Leibler Divergence and Cross Entropy Consider some unknown distribution p(x), and suppose that we have modeled this using an approximating distribution q(x). If we use q(x) to construct a coding scheme for the purpose of transmitting values of x to a receiver, then the additional information to specify x is: qx ( ) KL p q p( x)ln q( x) dx p( x)ln p( x) dx p( x)ln dx px ( ) I transmit q(x) but I average it with the exact probability p(x) The cross entropy is defined as: p, q p( x)ln q( x) dx 39

40 KL Divergence and Cross Entropy p, q p( x)ln q( x) dx The cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q to define our codebook. H(p)=H(p, p) is the expected # of bits using the true model. The KL divergence is the average number of extra bits needed to encode the data, because we used distribution q to encode the data instead of the true distribution p. The extra number of bits interpretation makes it clear that KL(p q) 0, and that the KL is only equal to zero iff q = p. qx ( ) KL p q p( x)ln q( x) dx p( x)ln p( x) dx p( x)ln dx px ( ) The KL distance is not a symmetrical quantity, that is KL p q KL q p 40

41 KL Divergence Between Two Gaussians Consider p(x) = N(x μ, σ ) and q(x) = N(x m, s ). Note that the first term can be computed using the moments and normalization condition of a Gaussian and the second term from the differential entropy of a Gaussian. Finally we obtain: KL p q p( x)ln q( x) dx p( x)ln p( x) dx ( xm) N ( x m, s ) ln s dx s ln es KL p s s m mm m q ln s s 4

42 KL Divergence Between Two Gaussians Consider now p(x) = N(x μ, Σ) and q(x) = N(x m, L). KL p q p( x)ln q( x) dx T N ( x m, ) Dln ln L ( xm) L ( xm) dx Tr Dln ln T T T T L L mm m L mm L mm L m D L ln p( x)ln p( x) dx D ln ln Tr T T T T L mm m L m m L m m L m 4

43 Jensen s Inequality For a convex function f, Jensen s inequality gives (can be proven easily by induction) M M f i xi i f ( xi ), i 0 and i i i i This is equivalent (assume M = ) to our requirement for convexity f (x) > 0. Assume f (x) > 0 (strict convexity) for any x. f ( x) f ( x0) f '( x0)( x x0) f "( x*)( x x0) f ( x0) f '( x0)( x x0) f ( a) f ( x ) f '( x )( a x ) For x a b f a f b f x f x a b x 0 0 0, : ( ) ( ) ( ) ( 0) '( 0)( ( ) 0) f ( b) f ( x0) f '( x0)( b x0) Set: x Jensen s inequality is thus shown: f ( a) ( ) f ( b) f a ( ) b 0 43

44 Jensen s Inequality Assume Jensen s inequality. We should show that f (x) > 0 (strict convexity) for any x. Set the following: a = b ε, b = a + ε > a, ε > 0. Using Jensen s inequality, we can easily derive the above equation as: f ( a) f ( b) f 0.5a 0.5b f 0.5( b ) 0.5b f 0.5a 0.5( a ) f ( b ) f ( a ) f ( b) f ( b ) f ( a ) f ( a) For small, we thus have: f ( b) f ( b ) f ( a ) f ( a) or f '( b ) f '( a ) f (.) is convex 44

45 Jensen s Inequality Using Jensen s inequality f x f x and i i i for a discrete random variable results in: We can generalize this result to continuous random variables: ( for continuous rv) f xp( x) dx f ( x) p( x) dx M M ( ), 0 ( ) Set : p f x f x i i i i i i i i We will use this shortly in the context of the KL distance. We often use Jensen s inequality for concave functions (e.g. log x). In that case, be sure you reverse the inequality! 45

46 Jensen s Inequality: Example As another example of Jensen s inequality, consider the arithmetic and geometric means of a set of real variables: M M Τ M x A ҧ = M i= x i, xҧ G = x i i= Using Jensen s inequality for f(x) = log(x) (concave), i.e. x x ln( ) ln, we can show: lnxҧ G = M ln M i= x i M = i= M lnx i ln M i= M x i = lnx A ҧ xҧ G x A ҧ 46

47 The Kullback-Leibler Divergence f x f ( x) f xp( x) dx f ( x) p( x) dx Using Jensen s inequality, we can show ( log is a convex function) that: q( x) q( x) KL p q p( x)ln dx ln p( x) dx ln q( x) dx 0 p( x) p( x) Thus we derive the following Information Inequality: KL p q 0, with KL p q 0 if and only if p( x) q( x) 47

48 Principle of Insufficient Reason An important consequence of the information inequality is that the discrete distribution with the maximum entropy is the uniform distribution. More precisely, H(X) log X, where X is the number of states for X, with equality iff p(x) is uniform. To see this, let u(x) = / X. Then X KL p u p( x)log u( x) p( x)log p( x) log ( x) 0 x This principle of insufficient reason, argues in favor of using uniform distributions when there are no other reasons to favor one distribution over another. x 48

49 The Kullback-Leibler Divergence Data compression is in some way related to density estimation. The Kullback-Leibler divergence is measuring the distance between two distributions and it is zero when the two densities are identical. Suppose the data is generated from an unknown p(x) that we try to approximate with a parametric model q(x θ). Suppose we have observed training points x n ~p x, n =,, N. Then: N qx ( ) KL p q p( x)ln dx ln q xn ln p( xn) p( x) N n Sample average approximation of the mean 49

50 The KL Divergence Vs. MLE Note that only the first term is a function of q. Thus minimizing KL p q is equivalent to maximizing the likelihood function for under the distribution q. N q( x) KL p q p( x)ln dx ln q xn ln p( xn) p( x) N n So the MLE estimate minimizes the KL divergence to the empirical distribution N pemp ( x) x x n N q( x) arg min KL p ( ) q p ( )ln d const ln q q pemp ( x) N n n N emp x emp x x xn 50

Conditional Entropy For a joint distribution, the conditional entropy is y x p( y, x)ln p( y x) dydx This represents the average information to specify y if we already know the value of x It is

51 Conditional Entropy For a joint distribution, the conditional entropy is y x p( y, x)ln p( y x) dydx This represents the average information to specify y if we already know the value of x It is easily seen, using p( y, x) p( y x) p( x), and substituting inside the log in x, y p( x, y)ln p( x, y) dydx that the conditional entropy satisfies the relation x, y y x x where H[x, y] is the differential entropy of p(x, y) and H[x] is the differential entropy of p(x). 5

52 Conditional Entropy for Discrete Variables Consider the conditional entropy for discrete variables To understand further the meaning of conditional entropy, let us consider the implications of H[y x] = 0. We have: y x p( y x )ln p( y x ) p( x ) 0 From this we can conclude that y x p( y, x )ln p( y x ) i j i j i j i j i j i j j the following must hold : p( yi x j )ln p( yi x j ) 0 Since plogp = 0 p = 0 or p = and since p(y i x j ) is normalized, there is only one y i s.t. p( yi x j) with all other p(. xj ) 0. Thus y is a function of x. 0 For each x s. t. p( x ) 0 j j 5

53 Mutual Information If the variables are not independent, we can gain some idea of whether they are close to being independent by considering the KL divergence between the joint distribution and the product of the marginals: Mutual Information : x, y KL p x, y p( x) p( y) The mutual information is related to the conditional entropy through x, y x x y y y x p( x) p( y) p x, yln dxdy 0 p x, y x, y 0 iff x, y independent py ( ) x, y p x, y ln dxdy [ y] [ y x] p y x 53

54 Mutual Information The mutual information represents the reduction in the uncertainty about x once we learn the value of y (and reversely). x, y x x y y y x x x y y y x H y In a Bayesian setting, p(x) =prior, p(x y) posterior, and I[x, y] represents the reduction in uncertainty in x once we observe y. 54

Note that H[x,y] H [x]+h [y] This is easy to prove

dydx p( x, y) ln p( x) ln p( y) dydx [ x] [ y]

55 Note that H[x,y] H [x]+h [y] This is easy to prove noticing that and from which x, y y y x 0 ( KL divergence) x, y y x x x, y x y x, y x y The equality here is true only if x, y are independent: x, y p( x, y)ln p( x, y) dydx p( x, y) ln p( x) ln p( y) dydx [ x] [ y] (sufficiency condition) y x y [ x, y] 0 p( x, y) p( x) p( y) condition) (necessary 55

56 Mutual Information for Correlated Gaussians Consider two correlated Gaussians as follows: For each of these variables we can write: The joint entropy is also given similarly as Thus: Note: X X 0 s s ~, Y Y 0 s s ln X Y s e X, Y ln e s ( ) det 4 x, y x y x, y log 0 ( independent X, Y ) x, y 0 ( linear correlated X Y ) x, y 56

57 Pointwise Mutual Information A quantity which is closely related to MI is the pointwise mutual information or PMI. For two events (not random variables) x and y, this is defined as p( x) p( y) p( x y) p( y x) PMI ( x, y) : log log log p x, y p( x) p( y) This measures the discrepancy between these events occurring together compared to what would be expected by chance. Clearly the MI, xy,, of X and Y is just the expected value of the PMI. This is the amount we learn from updating the prior p(x) into the posterior p(x y), or equivalently, updating the prior p(y) into the posterior p(y x). 57

58 Mutual Information For continuous random variables, it is common to first discretize or quantize them into bins, and computing how many values fall in each histogram bin (Scott 979). The number of bins used, and the location of the bin boundaries, can have a significant effect on the results. One can estimate the MI directly, without performing density estimation (Learned-Miller, 004). Another approach is to try many different bin sizes and locations, and to compute the maximum MI achieved. Scott, D. (979). On optimal and data-based histograms, Biometrika 66(3), Learned-Miller, E. (004). Hyperspacings and the estimation of information theoretic quantities. Technical Report 04-04, U. Mass. Amherst Comp. Sci. Dept. Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334, Speed, T. (0, December). A correlation for the st century. Science 334, *Use MatLab function mimixeddemo from Kevin Murphys PMTK 58

59 Maximal Information Coefficient This statistic appropriately normalized is known as the maximal information coefficient (MIC). We first define: m( x, y) G (, ) Here G (x, y) is the set of d grids of size, and X(G), Y (G) represents a discretization of the variables onto this grid (The maximization over bin locations is performed efficiently using dynamic programming) max G x y X ( G); Y ( G) log min( xy, ) x y Now define the MIC as MIC max m( x, y) x, y: xyb Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334,

60 Maximal Information Coefficient The MIC is defined as: m( x, y) max ( ); ( ) GG ( x, y) X G Y G log min( xy, ) B is some sample-size dependent bound on the number of bins we can use and still reliably estimate the distribution (Reshef et al. suggest B ~ N 0. 6 ). MIC lies in the range [0, ], where 0 represents no relationship between the variables, and represents a noise-free relationship of any form, not just linear. MIC max m( x, y) x, y: xyb Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334,

61 Correlation Coefficient Vs MIC see mutualinfoallpairsmixed for and mimixeddemo from PMTK3 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334, The data consists of 357 variables measuring a variety of social, economic, etc. indicators, collected by WHO. On the left, we see the correlation coefficient (CC) plotted against the MIC for all 63,566 variable pairs. On the right, we see scatter plots for particular pairs of variables. 6

62 Correlation Coefficient Vs MIC Point marked C has a low CC and a low MIC. From the corresponding scatter we see that there is no relationship between these two variables. The points marked D and H have high CC (in absolute value) and high MIC and we see from the scatter plot that they represent nearly linear relationships. 6

63 Correlation Coefficient Vs MIC The points E, F, and G have low CC but high MIC. They correspond to non-linear (and sometimes, as in E and F, one-to-many) relationships between the variables. Statistics (such as MIC) based on mutual information can be used to discover interesting relationships between variables in a way that correlation coefficients cannot. 63

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email: