Introduction to Probability and Statistics (Continued)

Size: px
Start display at page:

Download "Introduction to Probability and Statistics (Continued)"

Transcription

1 Introduction to Probability and Statistics (Continued) Prof. Nicholas Zabaras Center for Informatics and Computational Science University of Notre Dame Notre Dame, Indiana, USA URL: August 9, 08

2 Contents Markov and Chebyshev Inequalities The Law of Large Numbers, Central Limit Theorem, Monte Carlo Approximation of Distributions, Estimating π, Accuracy of Monte Carlo approximation, Approximating the Binomial with a Gaussian, Approximating the Poisson Distribution with a Gaussian, Application of CLT to Noise Signals Information theory, Entropy, KL divergence, Jensen s Inequality, Mutual information, Maximal Information Coefficient

3 References Following closely Chris Bishops PRML book, Chapter Kevin Murphy s, Machine Learning: A probablistic perspective, Chapter Jaynes, E. T. (003). Probability Theory: The Logic of Science. Cambridge University Press. Bertsekas, D. and J. Tsitsiklis (008). Introduction to Probability. Athena Scientific. nd Edition Wasserman, L. (004). All of statistics. A Concise Course in Statistical Inference. Springer. 3

4 Markov and Chebyshev Inequalities You can show (Markov s inequality) that if X is a non-negative integrable random variable and for any a > 0: 0 Pr[ X a] You can generalize this using any function of the random variable X as: [ f( X)] Pr[ f ( X ) a] a Using f(x) = (X E[X]), we derive the following Chebyshev inequality: s a, X [ X ] s Pr X [ X] In terms of std s, we can restate as : Pr X [ X] s Thus the probability of X being more than s away from E[X] is ¼. [ X ] a Indeed : [ X ] x ( x) dx x ( x) dx a ( x) dx a Pr[ X a] a a 4

5 The Law of Large Numbers (LLN) Let X i for i =,,..., n be independent and identically distributed random variables (i.i.d.) with finite mean E(X i ) = μ & variance Var(X i ) = σ. Let n തX n = n i= X i n Note that E തX n μ = μ Var തX n = n i= n = n i= σ = σ n Weak LLN: Strong LLN: lim Pr തX n μ ε = 0 ε > 0 n lim തX n = μ almost surely n This means that with probability one, the average of any realizations of x, x, of the random variables X, X, converges to the mean. 5

6 Statistical Inference: Parametric & Non-Parametric Approach Assume that we have a set of observations S { x, x,... x }, x N j n The problem is to infer on the underlying probability distribution that gives rise to the data S. Parametric problem: The underlying probability density has a specified known form that depends on a number of parameters. The problem of interest is to infer those parameters. Non-parametric problem: No analytical expression for the probability density is available. Description consists of defining the dependency or independency of the data. This leads to numerical exploration. A typical situation for the parametric model is when the distribution is the n PDF of a random variable X :. 6

7 Example of the Law of Large Numbers Assume that we sample We consider a parametric model with x j realizations of where we take both the mean and the variance as unknowns. The probability density of X is: S { x, x,... x }, x x 0 N j X ~ N ( x, ) 0 T ( x x0, ) exp ( / x x0) ( x x0) det Our problem is to estimate x and 0 7

8 Empirical Mean and Empirical Covariance From the law of large numbers, we calculate: x 0 = E X N x j = xҧ j= To compute the covariance matrix, note that if X, X, are i.i.d. so are k f(x ), f(x ), for any function f : N Then we can compute: Σ = cov(x) = E ൫x E X ) x E X T E ൫x x) ҧ x ҧ N Σ N j= (x j x) ҧ x j xҧ T = ഥΣ x T The above formulas define the empirical mean and empirical covariance. 8

9 The Central Limit Theorem Let (X, X, X N ) be independent and identically distributed (i.i.d.) continuous random variables each with expectation m and variance s. Define: As N >, the distribution of Z N converges to the distribution of a standard normal random variable If തX N = N j= Z N = σ N (X X + X X N Nμ) = ത N μ σ N lim N N t / P ZN x e dt, for N large, Somewhat of a justification for assuming Gaussian noise is common x X j തX N ~N μ, σ N as N N, തX N = N j= X j 9

10 The CLT and the Gaussian Distribution As an example, assume N variables (X, X, X N ) each of which has a uniform distribution over [0, ] and then consider the distribution of the sample mean (X + X + + X N )/N. For large N, this distribution tends to a Gaussian. The convergence as N increases can be rapid N N 3.5 N MatLab Code 0

11 The CLT and the Gaussian Distribution We plot a histogram of where As N, the distribution tends towards a Gaussian. N xij, j :0000 x ~ Beta (,5) N i ij 3 N = 3 N = 5 3 N = Run centrallimitdemo from PMTK

12 Accuracy of Monte Carlo Approximation In Monte Carlo approximation of the mean using the sample mean μҧ approximation, we have: For μ = E[f(x)] μҧ μ~n 0, σ We can now derive the following error bars (using central intervals): The number of samples needed to drop the error within is then: N σ = E[f x ] E[f x ] N s= Pr μ.96 തσ N.96 തσ N N f(x s ) μҧ തσ μҧ μ +.96 തσ N = 0.95 ε N 4 തσ ε

13 Monte Carlo Approximation of Distributions Computing the distribution of y = x, p(x) is uniform. The MC approximation is shown on the right. Take samples from p(x), square them and then plot the histogram p(x) p(y) Run changeofvarsdemod from PMTK 3

14 Accuracy of Monte Carlo Approximation Increase of the accuracy of MC with the number of samples. Histograms (on the left) and (on the right) pdfs 0 samples using kernel.5 density estimation. The actual distribution is shown on red samples.6.4 N x.5, Run mcaccuracydemo from PMTK

15 Example of CLT: Estimating by MC Use the CLT to approximate π. Let x, y~u[ r, r]. r r ( ) I x y r dxdy r r r r r 4 ( ) ( r r r I r x y r p x) p( y) dxdy r r r 4 ( ) x y r p( x) p( y) dxdy r r N തπ 4 N s= I(x s + y s r ), x s, y s ~U[ r, r] Here x, y are uniform random variables on [ r, +r], r =. We find = 3.46 with standard error x, y~u[ r, r] Run mcestimatepi from PMTK 5

16 CLT: The Binomial Tends to a Gaussian One consequence of the CLT is that the binomial distribution N Bin m N, ( ) m m N m which is a distribution over m defined by the sum of N observations of the random binary variable x, will tend to a Gaussian as N. x x... xn m ~ N, N N N m ~ N N, N binomial distribution Bin( N 0, m 0.5) Matlab Code

17 Poisson Process Consider that we count the number of photons from a light source. Let N(t) be the number of photons observed in the time interval [0, t]. N(t) is an integer-valued random variable. We make the following assumptions: a. Stationarity: Let and be any two time intervals of equal length, n any non-negative integer. Assume that Prob. of n photons in = Prob. of n photons in b. Independent increments: Let,,, n be non-overlapping time intervals and k, k,, k n non-negative integers. Denote by A j the event defined as A j = k j photons arrive in the time interval j Assume that these events are mutually independent, i.e. P( A A... A ) P( A ) P( A )... P( A ) n c. Negligible probability of coincidence: Assume that the probability of two or more events at the same time is negligible. More precisely, N(0) = 0 and PN( h) lim 0 h 0 h n 7

18 Poisson Process If these assumptions hold, then for a given time t, N is a Poisson process: n t t PN( t) n e, 0, n 0,,,..., n! Let us fix t = T = observation time and define a random variable N = N(T). Let us define the parameter θ = λt. We then denote: N ~ Poisson( ) n e n! D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 007 S Ghahramani: Fundamentals of Probability,996. 8

19 Poisson Process Consider the Poisson (discrete) distribution N n P( N n) Poisson( n ) e n! 0,,,..., The mean and the variance are both equal to. N n ( ), Poisson n N n0 9

20 Approximating a Poisson Distribution With a Gaussian Theorem: A random variable X~Poisson(θ) can be considered as the sum of n independent random variables X i ~Poisson(θ/n). a According to the Central Limit Theorem, when n is large enough, Take X i n ~ Poisson(, ) X i ~ N (, ) n n n n n n Then X Xi based on the Theorem is a draw from Poisson(θ) and i from the CLT also follows a Gaussian for large n with: X n n var X n n i Thus X ~ N (, ) The approximation of a Poisson distribution with a Gaussian for large n is a result of the CLT! a For a proof that the sum of independent Poisson Random Variables is a Poisson distribution see this document. 0

21 Approximating a Poisson Distribution with a Gaussian Mean= Mean= Mean= Mean= Poisson distributions (dots) vs their Gaussian approximations (solid line) for various values of the mean. The higher the, the smaller the distance between the two distributions. See this MatLab implementation.

22 Kullback-Leibler Distance Between Two Densities Let us consider the following two distributions: n n Poisson( n ) e n! x Gaussian( x, ) exp ( x ) We often use the Kullback-Leibler distance to define the distance between two distributions. In particular, in approximating the Poisson distribution with a Gaussian distribution, we have the following: KL distance (. ), (., ) ( n )log Poisson Gaussian Poisson n0 Poisson Gaussian ( n ) ( n, )

23 Approximating a Poisson Distribution With a Gaussian The KL distance of the Poisson distribution from its Gaussian approximation as a function of the mean in a logarithmic scale. The horizontal line indicates where the KL distance has dropped to /0 of its value at =. See the following MatLab implementation. 3

24 Application of the CLT: Noise Signals Consider discrete sampling where the output is noise of length n. The noise vector n x is a realization of : X n We estimate the mean and the variance of the noise in a single measurement as follows: To improve the signal to noise ratio, we repeat the measurement and average the noise vector signals: The average noise is a realization of a random variable: If X variance is, X,... () () var( X ) (a given tolerance). x x x x n j n 0 j, s 0 n j n j x N ( k) n x N k are i.i.d., X is asymptotically a Gaussian by the CLT, and its s N. We need to repeat the experiment until X N ( k) n X N k s N 4

25 Noise Reduction By Averaging Gaussian noise vectors of size 50 are used with σ = s is the std of a single noise vector 3 3 N= N= Estimated Noise level : s N N=0 3 N= See the following MatLab implementation. D. Calvetti and E. Somersalo, Introduction to Bayesian Scientific Computing, 007 5

26 Introduction to Information Theory Information theory is concerned with representing data in a compact fashion (data compression or source coding), and transmitting and storing it in a way that is robust to errors (error correction or channel coding). To compactly representing data requires allocating short codewords to highly probable bit strings, and reserving longer codewords to less probable bit strings. e.g. in natural language, common words ( a, the, and ) are much shorter than rare words. D. MacKay, Information Theory, Inference and Learning Algorithms (Video Lectures) 6

27 Introduction to Information Theory Decoding messages sent over noisy channels requires having a good probability model of the kinds of messages that people tend to send. We need models that can predict which kinds of data are likely and which unlikely. David MacKay, Information Theory, Inference and Learning Algorithms, 003 (available on line) Thomas M. Cover, Joy A. Thomas, Elements of Information Theory, Wiley, 006. Viterbi, A. J. and J. K. Omura (979). Principles of Digital Communication and Coding. McGraw-Hill. 7

28 Introduction to Information Theory Consider a discrete random variable x. We ask how much information ( degree of surprise ) is received when we observe (learn) a specific value for this variable? Observing a highly probable event provides little additional information. If we have two events x and y that are unrelated, then the information gain from observing both of them should be h(x, y) = h(x) + h(y). Two unrelated events will be statistically independent, so p(x, y) = p(x)p(y). 8

29 Entropy From h(x, y) = h(x) + h(y) and p(x, y) = p(x)p(y), it is easily shown that h(x) must be given by the logarithm of p(x) and so we have h( x) log p( x) 0 the units of h(x) are bits ( binary digits ) Low probability events correspond to high information content. When transmitting a random variable, the average amount of transmitted information is: Entropy of X : X p( X k)log p( X k) K k 9

30 Noiseless Coding Theorem (Shanon) Example (Coding theory): x discrete random variable with 8 possible states; how many bits to transmit the state of x? All states equally likely 8 log 3 bits 8 8 x Example : consider a variable having 8 possible states {a, b, c, d, e, f, g, h} for which the respective (non-uniform) probabilities are given by ( /, /4, /8, /6, /64, /64, /64, /64 ). The entropy in this case is smaller than for the uniform distribution. 4 x log log log log log bits average code length bits Note: shorter codes for the more probable events vs longer codes for the less probable events. Shanon s Noiseless Coding Theorem (948): The entropy is a lower bound on the number of bits needed to transmit the state of a random variable 30

31 Alternative Definition of Entropy Considering a set of N identical objects that are to be divided amongst a set of bins, such that there are n i objects in the i th bin. Consider the number of different ways of allocating the objects to the bins. In the i th bin there are n i! ways of reordering the objects (microstates), and so the total number of ways of allocating the N objects to the bins is given by (multiplicity) N! W n! The entropy is defined as i i lnw ln N! ln n i! i N N N We now consider the limit N, ln N! N ln N N, ln n! n ln n n i i i i ni ni lim ln piln N N N i i p i p i is the probability of an object assigned to the i th bin. The occupation numbers p i correspond to macrostates. 3

32 Alternative Definition of Entropy Interpret the bins as the states x i of a discrete random variable X, where p(x = x i ) = p i. The entropy of the random variable X is then p p xi lnp xi i Distributions p(x) that are sharply peaked around a few values will have a relatively low entropy, whereas those that are spread more evenly across many values will have higher entropy. 3

33 Maximum Entropy: Uniform Distribution The maximum entropy configuration can be found by maximizing H using a Lagrange multiplier to enforce the normalization constraint on the probabilities. Thus we maximize ഥH = p(x i )lnp(x i ) + λ p(x i ) We find p( x ) / M, i i M is the number of possible states and H= ln M. To verify that the stationary point is indeed a maximum, we can evaluate the nd derivative of the entropy, which gives ഥH = I ij p(x i ) p(x j ൯ i p i where I ij are the elements of the identity matrix. For any discrete distribution with M states, we have: H[x] ln M p( xi )ln p( x ) i p( xi )ln ln p( xi ) ln M p( x ) p( x ) i i i i Here we used Jensen s inequality (for the concave function log) i 33

34 Example: Biosequence Analysis Recall the DNA Sequence logo example earlier. The height of each bar is defined to be H, where H is the entropy of that distribution, and (= ln 4) is the maximum possible entropy. Thus a bar of height 0 corresponds to a uniform distribution (ln 4), whereas a bar of height corresponds to a deterministic distribution. Bits Sequence Position seqlogodemo from PMTK 34

35 Binary Variable Consider binary random variables, X {0, }, we can write p(x = ) = θ and p(x = 0) = θ. X {0,}, p( X ), p( X 0) Hence the entropy becomes (binary entropy function) The maximum value of occurs when the distribution is uniform, θ = 0.5. X log log H(X) 0.5 MatLab function bernoullientropyfig from PMTK p(x = ) 35

36 Divide x into bins of width Δ. Assuming p(x) is continuous, for each such bin, there must exist x i such that ( i) i Differential Entropy p( x) dx p( x ) = probability in falling in bin i p( x ) ln p( x ) p( x ) ln p( x ) ln i i i i i i lim p( xi) ln p( xi) p( x)ln p( x) dx 0 (can be negative) i The ln Δ term is omitted since it diverges as Δ0 (indicating that infinite bits are needed to describe a continuous variable) 36

37 Differential Entropy For a density defined over multiple continuous variables, denoted collectively by the vector x, the differential entropy is given by x p x p x dx ( )ln ( ) Differential (unlike the discrete) entropy can be negative When doing variable transformation y(x), use p(x)dx = p(y)dy, e.g. if y = Ax then: p p dy n y x ( y)ln ( y) A y l A x ln A 37

38 Differential Entropy and the Gaussian Distribution The distribution that maximizes the differential entropy with constraints on the first two moments is a Gaussian: H = p( x)ln p( x) dx p( x) dx xp( x) dx m 3 x m p( x) dx s Using calculus of variations, δh = Evaluating the differential entropy of the Gaussian, we obtain (an expression for a multivariate Gaussian is also given) d x ln s ln e det, d, det s Note H[x] < 0 for σ < /(πe) x 3 x m m p x e p x Normalization Given Given mean std 3 ( ) ( ) exp / Use s s the constraint s p( x)ln p( x) dx p( x) dx p( x) dx x p( x) dx x m p( x) dx 0 x 38

39 Kullback-Leibler Divergence and Cross Entropy Consider some unknown distribution p(x), and suppose that we have modeled this using an approximating distribution q(x). If we use q(x) to construct a coding scheme for the purpose of transmitting values of x to a receiver, then the additional information to specify x is: qx ( ) KL p q p( x)ln q( x) dx p( x)ln p( x) dx p( x)ln dx px ( ) I transmit q(x) but I average it with the exact probability p(x) The cross entropy is defined as: p, q p( x)ln q( x) dx 39

40 KL Divergence and Cross Entropy p, q p( x)ln q( x) dx The cross entropy is the average number of bits needed to encode data coming from a source with distribution p when we use model q to define our codebook. H(p)=H(p, p) is the expected # of bits using the true model. The KL divergence is the average number of extra bits needed to encode the data, because we used distribution q to encode the data instead of the true distribution p. The extra number of bits interpretation makes it clear that KL(p q) 0, and that the KL is only equal to zero iff q = p. qx ( ) KL p q p( x)ln q( x) dx p( x)ln p( x) dx p( x)ln dx px ( ) The KL distance is not a symmetrical quantity, that is KL p q KL q p 40

41 KL Divergence Between Two Gaussians Consider p(x) = N(x μ, σ ) and q(x) = N(x m, s ). Note that the first term can be computed using the moments and normalization condition of a Gaussian and the second term from the differential entropy of a Gaussian. Finally we obtain: KL p q p( x)ln q( x) dx p( x)ln p( x) dx ( xm) N ( x m, s ) ln s dx s ln es KL p s s m mm m q ln s s 4

42 KL Divergence Between Two Gaussians Consider now p(x) = N(x μ, Σ) and q(x) = N(x m, L). KL p q p( x)ln q( x) dx T N ( x m, ) Dln ln L ( xm) L ( xm) dx Tr Dln ln T T T T L L mm m L mm L mm L m D L ln p( x)ln p( x) dx D ln ln Tr T T T T L mm m L m m L m m L m 4

43 Jensen s Inequality For a convex function f, Jensen s inequality gives (can be proven easily by induction) M M f i xi i f ( xi ), i 0 and i i i i This is equivalent (assume M = ) to our requirement for convexity f (x) > 0. Assume f (x) > 0 (strict convexity) for any x. f ( x) f ( x0) f '( x0)( x x0) f "( x*)( x x0) f ( x0) f '( x0)( x x0) f ( a) f ( x ) f '( x )( a x ) For x a b f a f b f x f x a b x 0 0 0, : ( ) ( ) ( ) ( 0) '( 0)( ( ) 0) f ( b) f ( x0) f '( x0)( b x0) Set: x Jensen s inequality is thus shown: f ( a) ( ) f ( b) f a ( ) b 0 43

44 Jensen s Inequality Assume Jensen s inequality. We should show that f (x) > 0 (strict convexity) for any x. Set the following: a = b ε, b = a + ε > a, ε > 0. Using Jensen s inequality, we can easily derive the above equation as: f ( a) f ( b) f 0.5a 0.5b f 0.5( b ) 0.5b f 0.5a 0.5( a ) f ( b ) f ( a ) f ( b) f ( b ) f ( a ) f ( a) For small, we thus have: f ( b) f ( b ) f ( a ) f ( a) or f '( b ) f '( a ) f (.) is convex 44

45 Jensen s Inequality Using Jensen s inequality f x f x and i i i for a discrete random variable results in: We can generalize this result to continuous random variables: ( for continuous rv) f xp( x) dx f ( x) p( x) dx M M ( ), 0 ( ) Set : p f x f x i i i i i i i i We will use this shortly in the context of the KL distance. We often use Jensen s inequality for concave functions (e.g. log x). In that case, be sure you reverse the inequality! 45

46 Jensen s Inequality: Example As another example of Jensen s inequality, consider the arithmetic and geometric means of a set of real variables: M M Τ M x A ҧ = M i= x i, xҧ G = x i i= Using Jensen s inequality for f(x) = log(x) (concave), i.e. x x ln( ) ln, we can show: lnxҧ G = M ln M i= x i M = i= M lnx i ln M i= M x i = lnx A ҧ xҧ G x A ҧ 46

47 The Kullback-Leibler Divergence f x f ( x) f xp( x) dx f ( x) p( x) dx Using Jensen s inequality, we can show ( log is a convex function) that: q( x) q( x) KL p q p( x)ln dx ln p( x) dx ln q( x) dx 0 p( x) p( x) Thus we derive the following Information Inequality: KL p q 0, with KL p q 0 if and only if p( x) q( x) 47

48 Principle of Insufficient Reason An important consequence of the information inequality is that the discrete distribution with the maximum entropy is the uniform distribution. More precisely, H(X) log X, where X is the number of states for X, with equality iff p(x) is uniform. To see this, let u(x) = / X. Then X KL p u p( x)log u( x) p( x)log p( x) log ( x) 0 x This principle of insufficient reason, argues in favor of using uniform distributions when there are no other reasons to favor one distribution over another. x 48

49 The Kullback-Leibler Divergence Data compression is in some way related to density estimation. The Kullback-Leibler divergence is measuring the distance between two distributions and it is zero when the two densities are identical. Suppose the data is generated from an unknown p(x) that we try to approximate with a parametric model q(x θ). Suppose we have observed training points x n ~p x, n =,, N. Then: N qx ( ) KL p q p( x)ln dx ln q xn ln p( xn) p( x) N n Sample average approximation of the mean 49

50 The KL Divergence Vs. MLE Note that only the first term is a function of q. Thus minimizing KL p q is equivalent to maximizing the likelihood function for under the distribution q. N q( x) KL p q p( x)ln dx ln q xn ln p( xn) p( x) N n So the MLE estimate minimizes the KL divergence to the empirical distribution N pemp ( x) x x n N q( x) arg min KL p ( ) q p ( )ln d const ln q q pemp ( x) N n n N emp x emp x x xn 50

51 Conditional Entropy For a joint distribution, the conditional entropy is y x p( y, x)ln p( y x) dydx This represents the average information to specify y if we already know the value of x It is easily seen, using p( y, x) p( y x) p( x), and substituting inside the log in x, y p( x, y)ln p( x, y) dydx that the conditional entropy satisfies the relation x, y y x x where H[x, y] is the differential entropy of p(x, y) and H[x] is the differential entropy of p(x). 5

52 Conditional Entropy for Discrete Variables Consider the conditional entropy for discrete variables To understand further the meaning of conditional entropy, let us consider the implications of H[y x] = 0. We have: y x p( y x )ln p( y x ) p( x ) 0 From this we can conclude that y x p( y, x )ln p( y x ) i j i j i j i j i j i j j the following must hold : p( yi x j )ln p( yi x j ) 0 Since plogp = 0 p = 0 or p = and since p(y i x j ) is normalized, there is only one y i s.t. p( yi x j) with all other p(. xj ) 0. Thus y is a function of x. 0 For each x s. t. p( x ) 0 j j 5

53 Mutual Information If the variables are not independent, we can gain some idea of whether they are close to being independent by considering the KL divergence between the joint distribution and the product of the marginals: Mutual Information : x, y KL p x, y p( x) p( y) The mutual information is related to the conditional entropy through x, y x x y y y x p( x) p( y) p x, yln dxdy 0 p x, y x, y 0 iff x, y independent py ( ) x, y p x, y ln dxdy [ y] [ y x] p y x 53

54 Mutual Information The mutual information represents the reduction in the uncertainty about x once we learn the value of y (and reversely). x, y x x y y y x x x y y y x H y In a Bayesian setting, p(x) =prior, p(x y) posterior, and I[x, y] represents the reduction in uncertainty in x once we observe y. 54

55 Note that H[x,y] H [x]+h [y] This is easy to prove noticing that and from which x, y y y x 0 ( KL divergence) x, y y x x x, y x y x, y x y The equality here is true only if x, y are independent: x, y p( x, y)ln p( x, y) dydx p( x, y) ln p( x) ln p( y) dydx [ x] [ y] (sufficiency condition) y x y [ x, y] 0 p( x, y) p( x) p( y) condition) (necessary 55

56 Mutual Information for Correlated Gaussians Consider two correlated Gaussians as follows: For each of these variables we can write: The joint entropy is also given similarly as Thus: Note: X X 0 s s ~, Y Y 0 s s ln X Y s e X, Y ln e s ( ) det 4 x, y x y x, y log 0 ( independent X, Y ) x, y 0 ( linear correlated X Y ) x, y 56

57 Pointwise Mutual Information A quantity which is closely related to MI is the pointwise mutual information or PMI. For two events (not random variables) x and y, this is defined as p( x) p( y) p( x y) p( y x) PMI ( x, y) : log log log p x, y p( x) p( y) This measures the discrepancy between these events occurring together compared to what would be expected by chance. Clearly the MI, xy,, of X and Y is just the expected value of the PMI. This is the amount we learn from updating the prior p(x) into the posterior p(x y), or equivalently, updating the prior p(y) into the posterior p(y x). 57

58 Mutual Information For continuous random variables, it is common to first discretize or quantize them into bins, and computing how many values fall in each histogram bin (Scott 979). The number of bins used, and the location of the bin boundaries, can have a significant effect on the results. One can estimate the MI directly, without performing density estimation (Learned-Miller, 004). Another approach is to try many different bin sizes and locations, and to compute the maximum MI achieved. Scott, D. (979). On optimal and data-based histograms, Biometrika 66(3), Learned-Miller, E. (004). Hyperspacings and the estimation of information theoretic quantities. Technical Report 04-04, U. Mass. Amherst Comp. Sci. Dept. Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334, Speed, T. (0, December). A correlation for the st century. Science 334, *Use MatLab function mimixeddemo from Kevin Murphys PMTK 58

59 Maximal Information Coefficient This statistic appropriately normalized is known as the maximal information coefficient (MIC). We first define: m( x, y) G (, ) Here G (x, y) is the set of d grids of size, and X(G), Y (G) represents a discretization of the variables onto this grid (The maximization over bin locations is performed efficiently using dynamic programming) max G x y X ( G); Y ( G) log min( xy, ) x y Now define the MIC as MIC max m( x, y) x, y: xyb Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334,

60 Maximal Information Coefficient The MIC is defined as: m( x, y) max ( ); ( ) GG ( x, y) X G Y G log min( xy, ) B is some sample-size dependent bound on the number of bins we can use and still reliably estimate the distribution (Reshef et al. suggest B ~ N 0. 6 ). MIC lies in the range [0, ], where 0 represents no relationship between the variables, and represents a noise-free relationship of any form, not just linear. MIC max m( x, y) x, y: xyb Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334,

61 Correlation Coefficient Vs MIC see mutualinfoallpairsmixed for and mimixeddemo from PMTK3 Reshef, D., Y. Reshef, H. Finucane, S. Grossman, G. McVean, P. Turnbaugh, E. Lander, M. Mitzenmacher, and P. Sabeti (0, December). Detecting novel associations n large data sets. Science 334, The data consists of 357 variables measuring a variety of social, economic, etc. indicators, collected by WHO. On the left, we see the correlation coefficient (CC) plotted against the MIC for all 63,566 variable pairs. On the right, we see scatter plots for particular pairs of variables. 6

62 Correlation Coefficient Vs MIC Point marked C has a low CC and a low MIC. From the corresponding scatter we see that there is no relationship between these two variables. The points marked D and H have high CC (in absolute value) and high MIC and we see from the scatter plot that they represent nearly linear relationships. 6

63 Correlation Coefficient Vs MIC The points E, F, and G have low CC but high MIC. They correspond to non-linear (and sometimes, as in E and F, one-to-many) relationships between the variables. Statistics (such as MIC) based on mutual information can be used to discover interesting relationships between variables in a way that correlation coefficients cannot. 63

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Srihari. Information Theory. Sargur N. Srihari Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy

More information

1.6. Information Theory

1.6. Information Theory 48. INTRODUCTION Section 5.6 Exercise.7 (b) First solve the inference problem of determining the conditional density p(t x), and then subsequently marginalize to find the conditional mean given by (.89).

More information

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18 Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Introduction to Machine Learning

Introduction to Machine Learning What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Hands-On Learning Theory Fall 2016, Lecture 3

Hands-On Learning Theory Fall 2016, Lecture 3 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

Statistical Machine Learning Lectures 4: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157 Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x

More information

Lecture 11: Continuous-valued signals and differential entropy

Lecture 11: Continuous-valued signals and differential entropy Lecture 11: Continuous-valued signals and differential entropy Biology 429 Carl Bergstrom September 20, 2008 Sources: Parts of today s lecture follow Chapter 8 from Cover and Thomas (2007). Some components

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Lecture 22: Final Review

Lecture 22: Final Review Lecture 22: Final Review Nuts and bolts Fundamental questions and limits Tools Practical algorithms Future topics Dr Yao Xie, ECE587, Information Theory, Duke University Basics Dr Yao Xie, ECE587, Information

More information

(Classical) Information Theory III: Noisy channel coding

(Classical) Information Theory III: Noisy channel coding (Classical) Information Theory III: Noisy channel coding Sibasish Ghosh The Institute of Mathematical Sciences CIT Campus, Taramani, Chennai 600 113, India. p. 1 Abstract What is the best possible way

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

CHAPTER 3. P (B j A i ) P (B j ) =log 2. j=1

CHAPTER 3. P (B j A i ) P (B j ) =log 2. j=1 CHAPTER 3 Problem 3. : Also : Hence : I(B j ; A i ) = log P (B j A i ) P (B j ) 4 P (B j )= P (B j,a i )= i= 3 P (A i )= P (B j,a i )= j= =log P (B j,a i ) P (B j )P (A i ).3, j=.7, j=.4, j=3.3, i=.7,

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian

More information

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission

More information

Bayesian Inference Course, WTCN, UCL, March 2013

Bayesian Inference Course, WTCN, UCL, March 2013 Bayesian Course, WTCN, UCL, March 2013 Shannon (1948) asked how much information is received when we observe a specific value of the variable x? If an unlikely event occurs then one would expect the information

More information

Estimation of information-theoretic quantities

Estimation of information-theoretic quantities Estimation of information-theoretic quantities Liam Paninski Gatsby Computational Neuroscience Unit University College London http://www.gatsby.ucl.ac.uk/ liam liam@gatsby.ucl.ac.uk November 16, 2004 Some

More information

Noisy channel communication

Noisy channel communication Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University

More information

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35

Information Theory. Mark van Rossum. January 24, School of Informatics, University of Edinburgh 1 / 35 1 / 35 Information Theory Mark van Rossum School of Informatics, University of Edinburgh January 24, 2018 0 Version: January 24, 2018 Why information theory 2 / 35 Understanding the neural code. Encoding

More information

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Theorems Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

ELEC546 Review of Information Theory

ELEC546 Review of Information Theory ELEC546 Review of Information Theory Vincent Lau 1/1/004 1 Review of Information Theory Entropy: Measure of uncertainty of a random variable X. The entropy of X, H(X), is given by: If X is a discrete random

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Lecture 14 February 28

Lecture 14 February 28 EE/Stats 376A: Information Theory Winter 07 Lecture 4 February 8 Lecturer: David Tse Scribe: Sagnik M, Vivek B 4 Outline Gaussian channel and capacity Information measures for continuous random variables

More information

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Probability quantifies randomness and uncertainty How do I estimate the normalization and logarithmic slope of a X ray continuum, assuming

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER

EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER EVALUATING SYMMETRIC INFORMATION GAP BETWEEN DYNAMICAL SYSTEMS USING PARTICLE FILTER Zhen Zhen 1, Jun Young Lee 2, and Abdus Saboor 3 1 Mingde College, Guizhou University, China zhenz2000@21cn.com 2 Department

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16

EE5139R: Problem Set 4 Assigned: 31/08/16, Due: 07/09/16 EE539R: Problem Set 4 Assigned: 3/08/6, Due: 07/09/6. Cover and Thomas: Problem 3.5 Sets defined by probabilities: Define the set C n (t = {x n : P X n(x n 2 nt } (a We have = P X n(x n P X n(x n 2 nt

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 Please submit the solutions on Gradescope. EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018 1. Optimal codeword lengths. Although the codeword lengths of an optimal variable length code

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

Lecture 6 I. CHANNEL CODING. X n (m) P Y X 6- Introduction to Information Theory Lecture 6 Lecturer: Haim Permuter Scribe: Yoav Eisenberg and Yakov Miron I. CHANNEL CODING We consider the following channel coding problem: m = {,2,..,2 nr} Encoder

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately

More information

Chapter 5. Chapter 5 sections

Chapter 5. Chapter 5 sections 1 / 43 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Stat410 Probability and Statistics II (F16)

Stat410 Probability and Statistics II (F16) Stat4 Probability and Statistics II (F6 Exponential, Poisson and Gamma Suppose on average every /λ hours, a Stochastic train arrives at the Random station. Further we assume the waiting time between two

More information

EE 4TM4: Digital Communications II. Channel Capacity

EE 4TM4: Digital Communications II. Channel Capacity EE 4TM4: Digital Communications II 1 Channel Capacity I. CHANNEL CODING THEOREM Definition 1: A rater is said to be achievable if there exists a sequence of(2 nr,n) codes such thatlim n P (n) e (C) = 0.

More information

Lecture 5: GPs and Streaming regression

Lecture 5: GPs and Streaming regression Lecture 5: GPs and Streaming regression Gaussian Processes Information gain Confidence intervals COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1 Recall: Non-parametric regression Input space X

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

Solutions to Homework Set #4 Differential Entropy and Gaussian Channel

Solutions to Homework Set #4 Differential Entropy and Gaussian Channel Solutions to Homework Set #4 Differential Entropy and Gaussian Channel 1. Differential entropy. Evaluate the differential entropy h(x = f lnf for the following: (a Find the entropy of the exponential density

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Bayesian Inference for DSGE Models. Lawrence J. Christiano Bayesian Inference for DSGE Models Lawrence J. Christiano Outline State space-observer form. convenient for model estimation and many other things. Bayesian inference Bayes rule. Monte Carlo integation.

More information

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Bayesian statistics DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Frequentist vs Bayesian statistics In frequentist statistics

More information

Integrating Correlated Bayesian Networks Using Maximum Entropy

Integrating Correlated Bayesian Networks Using Maximum Entropy Applied Mathematical Sciences, Vol. 5, 2011, no. 48, 2361-2371 Integrating Correlated Bayesian Networks Using Maximum Entropy Kenneth D. Jarman Pacific Northwest National Laboratory PO Box 999, MSIN K7-90

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Quantitative Biology Lecture 3

Quantitative Biology Lecture 3 23 nd Sep 2015 Quantitative Biology Lecture 3 Gurinder Singh Mickey Atwal Center for Quantitative Biology Summary Covariance, Correlation Confounding variables (Batch Effects) Information Theory Covariance

More information

GWAS V: Gaussian processes

GWAS V: Gaussian processes GWAS V: Gaussian processes Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS V: Gaussian processes Summer 2011

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 3, 25 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. In many situations,

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University Chapter 3, 4 Random Variables ENCS6161 - Probability and Stochastic Processes Concordia University ENCS6161 p.1/47 The Notion of a Random Variable A random variable X is a function that assigns a real

More information

Variable selection and feature construction using methods related to information theory

Variable selection and feature construction using methods related to information theory Outline Variable selection and feature construction using methods related to information theory Kari 1 1 Intelligent Systems Lab, Motorola, Tempe, AZ IJCNN 2007 Outline Outline 1 Information Theory and

More information

Chapter 5 continued. Chapter 5 sections

Chapter 5 continued. Chapter 5 sections Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

Final Examination Solutions (Total: 100 points)

Final Examination Solutions (Total: 100 points) Final Examination Solutions (Total: points) There are 4 problems, each problem with multiple parts, each worth 5 points. Make sure you answer all questions. Your answer should be as clear and readable

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Probabilities Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: AI systems need to reason about what they know, or not know. Uncertainty may have so many sources:

More information

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Practical Bayesian Optimization of Machine Learning. Learning Algorithms Practical Bayesian Optimization of Machine Learning Algorithms CS 294 University of California, Berkeley Tuesday, April 20, 2016 Motivation Machine Learning Algorithms (MLA s) have hyperparameters that

More information

Upper Bounds on the Capacity of Binary Intermittent Communication

Upper Bounds on the Capacity of Binary Intermittent Communication Upper Bounds on the Capacity of Binary Intermittent Communication Mostafa Khoshnevisan and J. Nicholas Laneman Department of Electrical Engineering University of Notre Dame Notre Dame, Indiana 46556 Email:{mhoshne,

More information

Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

More information