Hands-On Learning Theory Fall 2016, Lecture 3

Size: px

Start display at page:

Download "Hands-On Learning Theory Fall 2016, Lecture 3"

Joel Fletcher
5 years ago
Views:

1 Hands-On Learning Theory Fall 016, Lecture 3 Jean Honorio jhonorio@purdue.edu 1 Information Theory First, we provide some information theory background. Definition 3.1 (Entropy). The entropy of a discrete random variable x of support X and probability mass function p is defined as: H(x) = x X p(x) log p(x) A basic property of the entropy of a discrete random variable x is that: 0 H(x) log X In fact, the entropy is maximal for the discrete uniform distribution. That is, ( x X ) p(x) = 1/ X, in which case H(x) = log X. Definition 3. (Conditional entropy). The conditional entropy of y given x is defined as: H(y x) = v X p x (v)h(y x = v) = p x (v) p y x (y v) log p y x (y v) v X y Y = p xy (x, y) log p y x (y x) x X,y Y Theorem 3.1 (Chain rule for entropy). Similarly: (See [1] if interested in the proof.) H(x, y) = H(x) + H(y x) H(x, y z) = H(x z) + H(y x, z) 1

2 Definition 3.3 (Mutual information). I(x, y) = x X,y Y p xy (x, y) log p xy(x, y) p x (x)p y (y) A basic property of the mutual information of random variables x and y is that: I(x, y) 0 Furthermore, the mutual information can be expressed in terms of the entropy: I(x, y) = H(x) H(x y) Note that random variables x and y are independent if and only if I(x, y) = 0. Theorem 3. (Conditioning reduces entropy). Proof. 0 I(x, y) = H(x) H(x y). H(x y) H(x) Definition 3.4 (Conditional mutual information). I(x, y z) = H(x z) H(x y, z) Theorem 3.3 (Chain rule for mutual information). (See [1] if interested in the proof.) I(x, (y, z)) = I(x, y) + I(x, z y) Definition 3.5 (Markov chain). Random variables x, y and z are said to form a Markov chain x y z if and only if their joint probability distribution can be written as: p xyz (x, y, z) = p x (x)p y x (y x)p z y (z y) Equivalently, random variables x, y and z are said to form a Markov chain x y z if and only if x and z are conditionally independent given y, and thus I(x, z y) = 0. Data processing inequality The way to understand the following inequality in an intuitive fashion is as follows. Assume there is a random variable x. Then we process x by possibly adding some additional randomness (e.g., noise) and/or losing some information (e.g. rounding) and we obtain a variable y. Similarly, we process y by possibly adding some additional randomness (e.g., noise) and/or losing some information (e.g. rounding) and we obtain a variable z. There is likely more information about x in y than in z.

3 Theorem 3.4 (Data processing inequality). If x y z then I(x, y) I(x, z), or equivalently H(x y) H(x z). Proof. By using the chain rule for mutual information (Theorem 3.3), we have: I(x, (y, z)) = I(x, y) + I(x, z y) = I(x, z) + I(x, y z) Since I(x, z y) = 0 and since I(x, y z) 0, we get: The second statement is trivial since: I(x, y) I(x, z) H(x) H(x y) = I(x, y) I(x, z) = H(x) H(x z) 3 Fano s inequality Fano s inequality allows to provide information-theoretic lower bounds on the sample complexity. The setting for the analysis is as follows. Nature picks a true hypothesis f from some distribution of hypotheses. Then, a dataset S of n samples is produced, conditioned on the choice of f. The learner then infers f from the dataset S. The probability of error of the learner is given by P[ f f]. By lower-bounding this probability of error, one can find the necessary number of samples for learning. (Analyses as in the previous lecture allows to find a sufficient number of samples.) Theorem 3.5 (Fano s inequality). For any estimator f with k possible outcomes, such that f S f, we have: P[ f f] H(f S) log log k Proof. Define the following error random variable: { 1 if w = f f 0 if f = f First, we analyze some quantities that will be useful later. Since conditioning reduces entropy (Theorem 3.), we have that H(w f) H(w) log. Note that H(f w = 0, f) = 0, since if w = 0 then f = f. Also, note that since conditioning reduces entropy (Theorem 3.) H(f w = 1, f) H(f) log k. By Definition 3., we have: H(f w, f) = P[w = 0] H(f w = 0, f) + P[w = 1] H(f w = 1, f) P[w = 1] log k = P[ f f] log k 3

4 Finally, since w is a deterministic function of f and f, then H(w f, f) = 0. By using the chain rule for entropies (Theorem 3.1), in two different ways, we have: H(w, f f) = H(f f) + H(w f, f) = H(w f) + H(f w, f) and thus from the conclusions at the beginning of the proof, we have: H(f f) = H(w f) + H(f w, f) H(w f, f) log + P[ f f] log k 0 By the data-processing inequality (Theorem 3.4) and since f S f is a Markov chain, we have H(f S) H(f f) and thus the above implies: H(f S) log + P[ f f] log k By rearranging the terms, we prove our claim. Corollary 3.1 (Fano s inequality). For any estimator f with k possible outcomes, such that f S f, where f is chosen by nature uniformly at random (also from k possible outcomes), we have: P[ f f] 1 I(f, S) + log log k Proof. By property of the mutual information, we have H(f S) = H(f) I(f, S). Since f is chosen uniformly at random from k possible outcomes, then H(f) = log k and we prove our claim. The key in using Fano s inequality is to define a hypothesis class F for which k = F is large, while the mutual information I(f, S) is small and of order n. 4 Upper Bounds on the Mutual Information One key step in the application of Fano s inequality is to upper-bound the mutual information I(f, S). Next, we revise some important definitions and inequalities from information theory. Definition 3.6 (Kullback-Leibler (KL) divergence). Assume that a random variable x has support X. Assume that there are two probability density functions p and q, which define two probability distributions P = p( ) and Q = q( ) respectively. The KL divergence is defined as: KL(P Q) = x X p(x) log p(x) q(x) dx 4

5 One important property of the KL divergence for independent random variables is the following. Let P xy = p xy ( ) and P x P y = p x ( )p y ( ). Assume that x and y are independent, and thus P xy = P x P y and likewise, assume that Q xy = Q x Q y. We have: KL(P xy Q xy ) = KL(P x Q x ) + KL(P y Q y ) (1) (The proof of the above might be left for homework very soon.) Let P xy = p xy ( ) and P x P y = p x ( )p y ( ). We can define the mutual information as: I(x, y) = KL(P xy P x P y ) = p xy (x, y) log p xy(x, y) p x (x)p y (y) dxdy x X,y Y By well-known identities p f,s (f, S) = p f (f)p S f (S) and p S (S) = f F p f,s (f, S), and since f follows a uniform distribution p f (f) = 1/k, we have: I(f, S) = S f F = S f F = 1 k f F p f,s (f, S) log p f,s (f, S) p f (f)p S (S) ds p f (f)p S f (S) log p f (f)p S f (S) p f (f)p S (S) ds S p S f (S) log p S f (S) p S (S) ds = 1 KL(P k S f P S ) f F In the above, we use the distribution P S f = p S f ( ) as well as the distribution P S = p S (S) = 1 k f F p S f (S). Furthermore, from the convexity of the KL divergence, we can show that: I(f, S) 1 k f F f F KL(P S f P S f ) () (The proof of the above might be left for homework very soon.) 5 Application: Empirical Risk Minimization with a Finite Hypothesis Class Here we will prove a negative result in a setting similar to Theorem.1. First, some necessary definitions. 5

6 Definition 3.7. The multivariate normal distribution of a random vector x R k with mean µ R k and (symmetric and positive definite) covariance Σ R k k is defined by the probability density function: p(x) = 1 (π)k det Σ e 1 (x µ)t Σ 1 (x µ) For shortness, we write x N (µ, Σ). Let the distributions N 1 = N (µ 1, Σ 1 ) and N = N (µ, Σ ), then: KL(N 1 N ) = 1 ( tr(σ 1 Σ 1 ) + (µ µ 1 ) T Σ 1 (µ µ 1 ) k + log det Σ ) det Σ 1 Note that when Σ 1 = Σ = I, the KL divergence becomes: KL(N 1 N ) = 1 µ µ 1 (3) (The proof of the above might be left for homework very soon.) Next, we show our negative result. As we mentioned before, our main goal will be to upper-bound the mutual information I(f, S) in order to apply Fano s inequality. Theorem 3.6. Assume that nature picks a true hypothesis f from some distribution of hypotheses with support F where F = k. Then, a dataset of n samples is produced, conditioned on the choice of f. The learner then infers f from the dataset. Under the same setting as in Theorem.1, there exists a specific prediction problem and data distribution such that if n log k log, then learning fails, i.e., P[ f f] 1/ for any mechanism (or algorithm) that a learner could use for picking f. Proof. Recall that in Theorem.1, we assume that F is a finite set of hypotheses, i.e., F = {f 1... f k } where k < + and ( j) f j : X Y. Here, we further assume that X = R k and Y = { 1, +1} and that f j (x) is the sign of the j-th element of the k-dimensional vector x, i.e., f j (x) = sgn(x j ). (For clarity, we are now using a super-index for the sample index and a subindex for the vector entry.) Assume that nature picks a true hypothesis f uniformly at random from F. Then, a dataset S = x (1), y (1)... x (n), y (n) of n samples is produced, conditioned on the choice of f. We assume that P[y = +1 f = f j ] = P[y = 1 f = f j ] = 1/. We also assume that x y = +1, f = f j N (µ (j), I) and x y = 1, f = f j N ( µ (j), I) where µ (j) i = 1[i = j]. That is, if f = f j then every hypothesis f F {f j } 6

7 has on average a 50% risk, since the sign of a normal random variable with mean zero is either +1 or 1 with 50% probability. Note that for any pair of distributions P xy and Q xy where p y (+1) = p y ( 1) = q y (+1) = q y ( 1) = 1/, we have: KL(P xy Q xy ) = p y (y)p x y (x) log p y(y)p x y (x) y { 1,+1} x X q y (y)q x y (x) dx = 1 ( p x y=+1 (x) log p x y=+1(x) x X q x y=+1 (x) dx + p x y= 1 (x) log p ) x y= 1(x) x X q x y= 1 (x) dx = 1 ( KL(Px y=+1 Q x y=+1 ) + KL(P x y= 1 Q x y= 1 ) ) (4) By eq.(), by eq.(1) since S is a dataset of n independent samples (x (i), y (i) ) for i = 1... n, by eq.(4), and by eq.(3) since x y is normally distributed, we have: I(f, S) 1 k = n k = n k = n k = n k = n k = n k KL(P S fj P S fj ) KL(P x,y fj P x,y fj ) = n(k k) k n ( ) KL(P x fj,y=+1 P x fj,y=+1) + KL(P x fj,y= 1 P x fj,y= 1) ( ) KL(N (µ (j), I) N (µ (j ), I)) + KL(N ( µ (j), I) N ( µ (j ), I)) ( 1 µ(j) µ (j ) + 1 ) µ(j) µ (j ) µ (j) µ (j ) 1[j j ] By Corollary 3.1 and assuming a probability of error of at least 1/: P[ f f] 1 I(f, S) + log log k 1 n + log log k = 1 7

8 By solving for n in the above, we obtain that if n log k log, then we have that P[ f f] 1/. References [1] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, nd edition,

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang