Hands-On Learning Theory Fall 2016, Lecture 3

Size: px
Start display at page:

Download "Hands-On Learning Theory Fall 2016, Lecture 3"

Transcription

1 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete random variable x of support X and probability mass function p is defined as: H(x) = x X p(x) log p(x) A basic property of the entropy of a discrete random variable x is that: 0 H(x) log X In fact, the entropy is maximal for the discrete uniform distribution. That is, ( x X ) p(x) = 1/ X, in which case H(x) = log X. Definition 3. (Conditional entropy). The conditional entropy of y given x is defined as: H(y x) = v X p x (v)h(y x = v) = p x (v) p y x (y v) log p y x (y v) v X y Y = p xy (x, y) log p y x (y x) x X,y Y Theorem 3.1 (Chain rule for entropy). Similarly: (See [1] if interested in the proof.) H(x, y) = H(x) + H(y x) H(x, y z) = H(x z) + H(y x, z) 1

2 Definition 3.3 (Mutual information). I(x, y) = x X,y Y p xy (x, y) log p xy(x, y) p x (x)p y (y) A basic property of the mutual information of random variables x and y is that: I(x, y) 0 Furthermore, the mutual information can be expressed in terms of the entropy: I(x, y) = H(x) H(x y) Note that random variables x and y are independent if and only if I(x, y) = 0. Theorem 3. (Conditioning reduces entropy). Proof. 0 I(x, y) = H(x) H(x y). H(x y) H(x) Definition 3.4 (Conditional mutual information). I(x, y z) = H(x z) H(x y, z) Theorem 3.3 (Chain rule for mutual information). (See [1] if interested in the proof.) I(x, (y, z)) = I(x, y) + I(x, z y) Definition 3.5 (Markov chain). Random variables x, y and z are said to form a Markov chain x y z if and only if their joint probability distribution can be written as: p xyz (x, y, z) = p x (x)p y x (y x)p z y (z y) Equivalently, random variables x, y and z are said to form a Markov chain x y z if and only if x and z are conditionally independent given y, and thus I(x, z y) = 0. Data processing inequality The way to understand the following inequality in an intuitive fashion is as follows. Assume there is a random variable x. Then we process x by possibly adding some additional randomness (e.g., noise) and/or losing some information (e.g. rounding) and we obtain a variable y. Similarly, we process y by possibly adding some additional randomness (e.g., noise) and/or losing some information (e.g. rounding) and we obtain a variable z. There is likely more information about x in y than in z.

3 Theorem 3.4 (Data processing inequality). If x y z then I(x, y) I(x, z), or equivalently H(x y) H(x z). Proof. By using the chain rule for mutual information (Theorem 3.3), we have: I(x, (y, z)) = I(x, y) + I(x, z y) = I(x, z) + I(x, y z) Since I(x, z y) = 0 and since I(x, y z) 0, we get: The second statement is trivial since: I(x, y) I(x, z) H(x) H(x y) = I(x, y) I(x, z) = H(x) H(x z) 3 Fano s inequality Fano s inequality allows to provide information-theoretic lower bounds on the sample complexity. The setting for the analysis is as follows. Nature picks a true hypothesis f from some distribution of hypotheses. Then, a dataset S of n samples is produced, conditioned on the choice of f. The learner then infers f from the dataset S. The probability of error of the learner is given by P[ f f]. By lower-bounding this probability of error, one can find the necessary number of samples for learning. (Analyses as in the previous lecture allows to find a sufficient number of samples.) Theorem 3.5 (Fano s inequality). For any estimator f with k possible outcomes, such that f S f, we have: P[ f f] H(f S) log log k Proof. Define the following error random variable: { 1 if w = f f 0 if f = f First, we analyze some quantities that will be useful later. Since conditioning reduces entropy (Theorem 3.), we have that H(w f) H(w) log. Note that H(f w = 0, f) = 0, since if w = 0 then f = f. Also, note that since conditioning reduces entropy (Theorem 3.) H(f w = 1, f) H(f) log k. By Definition 3., we have: H(f w, f) = P[w = 0] H(f w = 0, f) + P[w = 1] H(f w = 1, f) P[w = 1] log k = P[ f f] log k 3

4 Finally, since w is a deterministic function of f and f, then H(w f, f) = 0. By using the chain rule for entropies (Theorem 3.1), in two different ways, we have: H(w, f f) = H(f f) + H(w f, f) = H(w f) + H(f w, f) and thus from the conclusions at the beginning of the proof, we have: H(f f) = H(w f) + H(f w, f) H(w f, f) log + P[ f f] log k 0 By the data-processing inequality (Theorem 3.4) and since f S f is a Markov chain, we have H(f S) H(f f) and thus the above implies: H(f S) log + P[ f f] log k By rearranging the terms, we prove our claim. Corollary 3.1 (Fano s inequality). For any estimator f with k possible outcomes, such that f S f, where f is chosen by nature uniformly at random (also from k possible outcomes), we have: P[ f f] 1 I(f, S) + log log k Proof. By property of the mutual information, we have H(f S) = H(f) I(f, S). Since f is chosen uniformly at random from k possible outcomes, then H(f) = log k and we prove our claim. The key in using Fano s inequality is to define a hypothesis class F for which k = F is large, while the mutual information I(f, S) is small and of order n. 4 Upper Bounds on the Mutual Information One key step in the application of Fano s inequality is to upper-bound the mutual information I(f, S). Next, we revise some important definitions and inequalities from information theory. Definition 3.6 (Kullback-Leibler (KL) divergence). Assume that a random variable x has support X. Assume that there are two probability density functions p and q, which define two probability distributions P = p( ) and Q = q( ) respectively. The KL divergence is defined as: KL(P Q) = x X p(x) log p(x) q(x) dx 4

5 One important property of the KL divergence for independent random variables is the following. Let P xy = p xy ( ) and P x P y = p x ( )p y ( ). Assume that x and y are independent, and thus P xy = P x P y and likewise, assume that Q xy = Q x Q y. We have: KL(P xy Q xy ) = KL(P x Q x ) + KL(P y Q y ) (1) (The proof of the above might be left for homework very soon.) Let P xy = p xy ( ) and P x P y = p x ( )p y ( ). We can define the mutual information as: I(x, y) = KL(P xy P x P y ) = p xy (x, y) log p xy(x, y) p x (x)p y (y) dxdy x X,y Y By well-known identities p f,s (f, S) = p f (f)p S f (S) and p S (S) = f F p f,s (f, S), and since f follows a uniform distribution p f (f) = 1/k, we have: I(f, S) = S f F = S f F = 1 k f F p f,s (f, S) log p f,s (f, S) p f (f)p S (S) ds p f (f)p S f (S) log p f (f)p S f (S) p f (f)p S (S) ds S p S f (S) log p S f (S) p S (S) ds = 1 KL(P k S f P S ) f F In the above, we use the distribution P S f = p S f ( ) as well as the distribution P S = p S (S) = 1 k f F p S f (S). Furthermore, from the convexity of the KL divergence, we can show that: I(f, S) 1 k f F f F KL(P S f P S f ) () (The proof of the above might be left for homework very soon.) 5 Application: Empirical Risk Minimization with a Finite Hypothesis Class Here we will prove a negative result in a setting similar to Theorem.1. First, some necessary definitions. 5

6 Definition 3.7. The multivariate normal distribution of a random vector x R k with mean µ R k and (symmetric and positive definite) covariance Σ R k k is defined by the probability density function: p(x) = 1 (π)k det Σ e 1 (x µ)t Σ 1 (x µ) For shortness, we write x N (µ, Σ). Let the distributions N 1 = N (µ 1, Σ 1 ) and N = N (µ, Σ ), then: KL(N 1 N ) = 1 ( tr(σ 1 Σ 1 ) + (µ µ 1 ) T Σ 1 (µ µ 1 ) k + log det Σ ) det Σ 1 Note that when Σ 1 = Σ = I, the KL divergence becomes: KL(N 1 N ) = 1 µ µ 1 (3) (The proof of the above might be left for homework very soon.) Next, we show our negative result. As we mentioned before, our main goal will be to upper-bound the mutual information I(f, S) in order to apply Fano s inequality. Theorem 3.6. Assume that nature picks a true hypothesis f from some distribution of hypotheses with support F where F = k. Then, a dataset of n samples is produced, conditioned on the choice of f. The learner then infers f from the dataset. Under the same setting as in Theorem.1, there exists a specific prediction problem and data distribution such that if n log k log, then learning fails, i.e., P[ f f] 1/ for any mechanism (or algorithm) that a learner could use for picking f. Proof. Recall that in Theorem.1, we assume that F is a finite set of hypotheses, i.e., F = {f 1... f k } where k < + and ( j) f j : X Y. Here, we further assume that X = R k and Y = { 1, +1} and that f j (x) is the sign of the j-th element of the k-dimensional vector x, i.e., f j (x) = sgn(x j ). (For clarity, we are now using a super-index for the sample index and a subindex for the vector entry.) Assume that nature picks a true hypothesis f uniformly at random from F. Then, a dataset S = x (1), y (1)... x (n), y (n) of n samples is produced, conditioned on the choice of f. We assume that P[y = +1 f = f j ] = P[y = 1 f = f j ] = 1/. We also assume that x y = +1, f = f j N (µ (j), I) and x y = 1, f = f j N ( µ (j), I) where µ (j) i = 1[i = j]. That is, if f = f j then every hypothesis f F {f j } 6

7 has on average a 50% risk, since the sign of a normal random variable with mean zero is either +1 or 1 with 50% probability. Note that for any pair of distributions P xy and Q xy where p y (+1) = p y ( 1) = q y (+1) = q y ( 1) = 1/, we have: KL(P xy Q xy ) = p y (y)p x y (x) log p y(y)p x y (x) y { 1,+1} x X q y (y)q x y (x) dx = 1 ( p x y=+1 (x) log p x y=+1(x) x X q x y=+1 (x) dx + p x y= 1 (x) log p ) x y= 1(x) x X q x y= 1 (x) dx = 1 ( KL(Px y=+1 Q x y=+1 ) + KL(P x y= 1 Q x y= 1 ) ) (4) By eq.(), by eq.(1) since S is a dataset of n independent samples (x (i), y (i) ) for i = 1... n, by eq.(4), and by eq.(3) since x y is normally distributed, we have: I(f, S) 1 k = n k = n k = n k = n k = n k = n k KL(P S fj P S fj ) KL(P x,y fj P x,y fj ) = n(k k) k n ( ) KL(P x fj,y=+1 P x fj,y=+1) + KL(P x fj,y= 1 P x fj,y= 1) ( ) KL(N (µ (j), I) N (µ (j ), I)) + KL(N ( µ (j), I) N ( µ (j ), I)) ( 1 µ(j) µ (j ) + 1 ) µ(j) µ (j ) µ (j) µ (j ) 1[j j ] By Corollary 3.1 and assuming a probability of error of at least 1/: P[ f f] 1 I(f, S) + log log k 1 n + log log k = 1 7

8 By solving for n in the above, we obtain that if n log k log, then we have that P[ f f] 1/. References [1] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, nd edition,

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang

More information

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2 COMPSCI 650 Applied Information Theory Jan 21, 2016 Lecture 2 Instructor: Arya Mazumdar Scribe: Gayane Vardoyan, Jong-Chyi Su 1 Entropy Definition: Entropy is a measure of uncertainty of a random variable.

More information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information 4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information Ramji Venkataramanan Signal Processing and Communications Lab Department of Engineering ramji.v@eng.cam.ac.uk

More information

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015

Homework 1 Due: Thursday 2/5/2015. Instructions: Turn in your homework in class on Thursday 2/5/2015 10-704 Homework 1 Due: Thursday 2/5/2015 Instructions: Turn in your homework in class on Thursday 2/5/2015 1. Information Theory Basics and Inequalities C&T 2.47, 2.29 (a) A deck of n cards in order 1,

More information

Lecture 5 - Information theory

Lecture 5 - Information theory Lecture 5 - Information theory Jan Bouda FI MU May 18, 2012 Jan Bouda (FI MU) Lecture 5 - Information theory May 18, 2012 1 / 42 Part I Uncertainty and entropy Jan Bouda (FI MU) Lecture 5 - Information

More information

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H. Appendix A Information Theory A.1 Entropy Shannon (Shanon, 1948) developed the concept of entropy to measure the uncertainty of a discrete random variable. Suppose X is a discrete random variable that

More information

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 2: Entropy and Mutual Information Chapter 2 outline Definitions Entropy Joint entropy, conditional entropy Relative entropy, mutual information Chain rules Jensen s inequality Log-sum inequality

More information

Information Theory Primer:

Information Theory Primer: Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen s inequality Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro,

More information

Information Theory and Communication

Information Theory and Communication Information Theory and Communication Ritwik Banerjee rbanerjee@cs.stonybrook.edu c Ritwik Banerjee Information Theory and Communication 1/8 General Chain Rules Definition Conditional mutual information

More information

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University Charalampos E. Tsourakakis January 25rd, 2017 Probability Theory The theory of probability is a system for making better guesses.

More information

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157 Lecture 6: Gaussian Channels Copyright G. Caire (Sample Lectures) 157 Differential entropy (1) Definition 18. The (joint) differential entropy of a continuous random vector X n p X n(x) over R is: Z h(x

More information

5 Mutual Information and Channel Capacity

5 Mutual Information and Channel Capacity 5 Mutual Information and Channel Capacity In Section 2, we have seen the use of a quantity called entropy to measure the amount of randomness in a random variable. In this section, we introduce several

More information

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline LECTURE 2 Convexity and related notions Last time: Goals and mechanics of the class notation entropy: definitions and properties mutual information: definitions and properties Lecture outline Convexity

More information

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University.

Application of Information Theory, Lecture 7. Relative Entropy. Handout Mode. Iftach Haitner. Tel Aviv University. Application of Information Theory, Lecture 7 Relative Entropy Handout Mode Iftach Haitner Tel Aviv University. December 1, 2015 Iftach Haitner (TAU) Application of Information Theory, Lecture 7 December

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Lecture 1: Introduction, Entropy and ML estimation

Lecture 1: Introduction, Entropy and ML estimation 0-704: Information Processing and Learning Spring 202 Lecture : Introduction, Entropy and ML estimation Lecturer: Aarti Singh Scribes: Min Xu Disclaimer: These notes have not been subjected to the usual

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information 1 Conditional entropy Let (Ω, F, P) be a probability space, let X be a RV taking values in some finite set A. In this lecture

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information. L65 Dept. of Linguistics, Indiana University Fall 205 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission rate

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 28 Information theory answers two fundamental questions in communication theory: What is the ultimate data compression? What is the transmission

More information

Lecture 17: Differential Entropy

Lecture 17: Differential Entropy Lecture 17: Differential Entropy Differential entropy AEP for differential entropy Quantization Maximum differential entropy Estimation counterpart of Fano s inequality Dr. Yao Xie, ECE587, Information

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18

Information Theory. David Rosenberg. June 15, New York University. David Rosenberg (New York University) DS-GA 1003 June 15, / 18 Information Theory David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 18 A Measure of Information? Consider a discrete random variable

More information

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye

Chapter 8: Differential entropy. University of Illinois at Chicago ECE 534, Natasha Devroye Chapter 8: Differential entropy Chapter 8 outline Motivation Definitions Relation to discrete entropy Joint and conditional differential entropy Relative entropy and mutual information Properties AEP for

More information

How to Quantitate a Markov Chain? Stochostic project 1

How to Quantitate a Markov Chain? Stochostic project 1 How to Quantitate a Markov Chain? Stochostic project 1 Chi-Ning,Chou Wei-chang,Lee PROFESSOR RAOUL NORMAND April 18, 2015 Abstract In this project, we want to quantitatively evaluate a Markov chain. In

More information

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University

Chapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission

More information

Variational Autoencoders (VAEs)

Variational Autoencoders (VAEs) September 26 & October 3, 2017 Section 1 Preliminaries Kullback-Leibler divergence KL divergence (continuous case) p(x) andq(x) are two density distributions. Then the KL-divergence is defined as Z KL(p

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017

ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 2017 ECE 587 / STA 563: Lecture 2 Measures of Information Information Theory Duke University, Fall 207 Author: Galen Reeves Last Modified: August 3, 207 Outline of lecture: 2. Quantifying Information..................................

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 3, 25 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. In many situations,

More information

Chapter 2 Review of Classical Information Theory

Chapter 2 Review of Classical Information Theory Chapter 2 Review of Classical Information Theory Abstract This chapter presents a review of the classical information theory which plays a crucial role in this thesis. We introduce the various types of

More information

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY

DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY DEEP LEARNING CHAPTER 3 PROBABILITY & INFORMATION THEORY OUTLINE 3.1 Why Probability? 3.2 Random Variables 3.3 Probability Distributions 3.4 Marginal Probability 3.5 Conditional Probability 3.6 The Chain

More information

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute

(each row defines a probability distribution). Given n-strings x X n, y Y n we can use the absence of memory in the channel to compute ENEE 739C: Advanced Topics in Signal Processing: Coding Theory Instructor: Alexander Barg Lecture 6 (draft; 9/6/03. Error exponents for Discrete Memoryless Channels http://www.enee.umd.edu/ abarg/enee739c/course.html

More information

LECTURE 3. Last time:

LECTURE 3. Last time: LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate

More information

2.1 Optimization formulation of k-means

2.1 Optimization formulation of k-means MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 2: k-means Clustering Lecturer: Jiaming Xu Scribe: Jiaming Xu, September 2, 2016 Outline Optimization formulation of k-means Convergence

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. Nicholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of Notre Dame Notre Dame, Indiana, USA Email:

More information

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy Coding and Information Theory Chris Williams, School of Informatics, University of Edinburgh Overview What is information theory? Entropy Coding Information Theory Shannon (1948): Information theory is

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Exercises with solutions (Set D)

Exercises with solutions (Set D) Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Series 7, May 22, 2018 (EM Convergence)

Series 7, May 22, 2018 (EM Convergence) Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page.

EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 2018 Please submit on Gradescope. Start every question on a new page. EE376A: Homeworks #4 Solutions Due on Thursday, February 22, 28 Please submit on Gradescope. Start every question on a new page.. Maximum Differential Entropy (a) Show that among all distributions supported

More information

The Method of Types and Its Application to Information Hiding

The Method of Types and Its Application to Information Hiding The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,

More information

Bayesian Machine Learning - Lecture 7

Bayesian Machine Learning - Lecture 7 Bayesian Machine Learning - Lecture 7 Guido Sanguinetti Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh gsanguin@inf.ed.ac.uk March 4, 2015 Today s lecture 1

More information

Homework 2: Solution

Homework 2: Solution 0-704: Information Processing and Learning Sring 0 Lecturer: Aarti Singh Homework : Solution Acknowledgement: The TA graciously thanks Rafael Stern for roviding most of these solutions.. Problem Hence,

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Information Theory and Feature Selection (Joint Informativeness and Tractability)

Information Theory and Feature Selection (Joint Informativeness and Tractability) Information Theory and Feature Selection (Joint Informativeness and Tractability) Leonidas Lefakis Zalando Research Labs 1 / 66 Dimensionality Reduction Feature Construction Construction X 1,..., X D f

More information

Statistical Machine Learning Lectures 4: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14

Math 3215 Intro. Probability & Statistics Summer 14. Homework 5: Due 7/3/14 Math 325 Intro. Probability & Statistics Summer Homework 5: Due 7/3/. Let X and Y be continuous random variables with joint/marginal p.d.f. s f(x, y) 2, x y, f (x) 2( x), x, f 2 (y) 2y, y. Find the conditional

More information

the Information Bottleneck

the Information Bottleneck the Information Bottleneck Daniel Moyer December 10, 2017 Imaging Genetics Center/Information Science Institute University of Southern California Sorry, no Neuroimaging! (at least not presented) 0 Instead,

More information

Multivariate probability distributions and linear regression

Multivariate probability distributions and linear regression Multivariate probability distributions and linear regression Patrik Hoyer 1 Contents: Random variable, probability distribution Joint distribution Marginal distribution Conditional distribution Independence,

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Srihari. Information Theory. Sargur N. Srihari Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Empirical Risk Minimization Algorithms

Empirical Risk Minimization Algorithms Empirical Risk Minimization Algorithms Tirgul 2 Part I November 2016 Reminder Domain set, X : the set of objects that we wish to label. Label set, Y : the set of possible labels. A prediction rule, h:

More information

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory

Ch. 8 Math Preliminaries for Lossy Coding. 8.5 Rate-Distortion Theory Ch. 8 Math Preliminaries for Lossy Coding 8.5 Rate-Distortion Theory 1 Introduction Theory provide insight into the trade between Rate & Distortion This theory is needed to answer: What do typical R-D

More information

EE 4TM4: Digital Communications II. Channel Capacity

EE 4TM4: Digital Communications II. Channel Capacity EE 4TM4: Digital Communications II 1 Channel Capacity I. CHANNEL CODING THEOREM Definition 1: A rater is said to be achievable if there exists a sequence of(2 nr,n) codes such thatlim n P (n) e (C) = 0.

More information

Information theory and decision tree

Information theory and decision tree Information theory and decision tree Jianxin Wu LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China wujx2001@gmail.com June 14, 2018 Contents 1 Prefix code and Huffman

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1

arxiv:physics/ v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1 The information bottleneck method arxiv:physics/0004057v1 [physics.data-an] 24 Apr 2000 Naftali Tishby, 1,2 Fernando C. Pereira, 3 and William Bialek 1 1 NEC Research Institute, 4 Independence Way Princeton,

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

Introduction to Machine Learning

Introduction to Machine Learning What does this mean? Outline Contents Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola December 26, 2017 1 Introduction to Probability 1 2 Random Variables 3 3 Bayes

More information

Some New Results on Information Properties of Mixture Distributions

Some New Results on Information Properties of Mixture Distributions Filomat 31:13 (2017), 4225 4230 https://doi.org/10.2298/fil1713225t Published by Faculty of Sciences and Mathematics, University of Niš, Serbia Available at: http://www.pmf.ni.ac.rs/filomat Some New Results

More information

The sequential decoding metric for detection in sensor networks

The sequential decoding metric for detection in sensor networks The sequential decoding metric for detection in sensor networks B. Narayanaswamy, Yaron Rachlin, Rohit Negi and Pradeep Khosla Department of ECE Carnegie Mellon University Pittsburgh, PA, 523 Email: {bnarayan,rachlin,negi,pkk}@ece.cmu.edu

More information

Classification & Information Theory Lecture #8

Classification & Information Theory Lecture #8 Classification & Information Theory Lecture #8 Introduction to Natural Language Processing CMPSCI 585, Fall 2007 University of Massachusetts Amherst Andrew McCallum Today s Main Points Automatically categorizing

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#8:(November-08-2010) Cancer and Signals Outline 1 Bayesian Interpretation of Probabilities Information Theory Outline Bayesian

More information

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities.

6.1 Main properties of Shannon entropy. Let X be a random variable taking values x in some alphabet with probabilities. Chapter 6 Quantum entropy There is a notion of entropy which quantifies the amount of uncertainty contained in an ensemble of Qbits. This is the von Neumann entropy that we introduce in this chapter. In

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives

Machine Learning Brett Bernstein. Recitation 1: Gradients and Directional Derivatives Machine Learning Brett Bernstein Recitation 1: Gradients and Directional Derivatives Intro Question 1 We are given the data set (x 1, y 1 ),, (x n, y n ) where x i R d and y i R We want to fit a linear

More information

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases

Medical Imaging. Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases Uses of Information Theory in Medical Imaging Norbert Schuff, Ph.D. Center for Imaging of Neurodegenerative Diseases Norbert.schuff@ucsf.edu With contributions from Dr. Wang Zhang Medical Imaging Informatics,

More information

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi

Lecture Notes for Statistics 311/Electrical Engineering 377. John Duchi Lecture Notes for Statistics 311/Electrical Engineering 377 March 13, 019 Contents 1 Introduction and setting 6 1.1 Information theory..................................... 6 1. Moving to statistics....................................

More information