Machine Learning Theory Lecture 4

Size: px

Start display at page:

Download "Machine Learning Theory Lecture 4"

Justina Crawford
5 years ago
Views:

1 Machine Learning Theory Lecture 4 Nicholas Harvey October 9, Basic Probability One of the first concentration bounds that you learn in probability theory is Markov s inequality. It bounds the right-tail of a random variable, using very few assumptions. Theorem 1.1 (Markov s Inequality). Let Y be a real-valued random variable that assumes only nonnegative values. Then, for all a > 0, Pr Y a E Y. a Wikipedia, 1, Equation B.3, Grimmett-Stirzaker Lemma 7..7, Durrett Theorem Unions and Intersections Another useful tool is the union bound. We typically use this to show that no bad events should happen. Fact 1. (Union Bound). Let E 1 and E be arbitrary events, not necessarily independent. Then Pr E 1 E Pr E 1 + Pr E. Often we use it in the reverse direction, to show all good events should happen. Fact 1.3 (Reverse Union Bound). Let F 1 and F be arbitrary events, not necessarily independent. Suppose that Pr F 1 1 p 1 and Pr F 1 p. Then Pr F 1 F 1 (p 1 + p ). Proof. Let E i = F i. Then Pr E i p i. Then Pr F 1 F = 1 Pr F 1 F = 1 Pr F 1 F (complementary event) (De Morgan s law) = 1 Pr E 1 E (definition of E i ) 1 (p 1 + p ) (union bound). Another useful trick concerns the union of independent events. If F 1 and F are each likely to happen, and independent, then their union is even more likely to happen. 1

2 Fact 1.4 (Union Bound). Let F 1 and F be independent events. Suppose that Pr F 1 1 p 1 and Pr F 1 p. Then Pr F 1 F = 1 p 1 p. Proof. Observe that Pr F i pi. So Pr F 1 F = 1 Pr F 1 F = 1 Pr F 1 Pr F 1 p 1 p. (De Morgan s law) (independence) Hoeffding s Inequality Theorem.1. Let X 1,..., X n be independent random variables such that X i always lies in the interval 0, 1. Define X = n X i. Then Pr X E X t exp( t /n) t 0. Wikipedia, 1, Lemma B.6. We will prove a weaker result where exponent is decreased from to 1/. Simplifications. First of all, we will center the random variables, which cleans up the inequality by eliminating the expectation. Define ˆX i = X i E X i and ˆX = n ˆX i. Note that 1 ˆX i 1, 1. Our main argument is to prove that Pr ˆX t exp( t /n). (.1) The same argument also applies to ˆX, so we get that Pr ˆX t = Pr ˆX t exp( t /n). Combining them with a union bound, we get Pr X E X t = Pr ˆX t Pr ˆX t + Pr ˆX t exp( t /n). This proves the theorem (with the weaker exponent). Proof (of (.1)). The Hoeffding inequality crucially relies on mutual independence of the ˆX i random variables. How can we exploit independence in the proof? What special properties to independent random variables have? One basic property is that for any independent random variables A and B. E A B = E A E B (.) 1 This step is where the argument is not careful enough to obtain the optimal exponent: ˆXi is supported on an interval of length 1, not an interval of length.

3 Key Idea #1: The Hoeffding inequality has nothing to do with products of random variables, it is about sums of random variables. So one trick we could try is to convert sums into products using the exponential function. Fix some parameter λ > 0 whose value we will choose later. Define Y i = exp(λ ˆX i ) Y = exp(λ ˆX) = exp ( λ n ˆX i ) = exp(λ ˆX i ) = Y i. It is easy to check that, since {X 1,..., X n } is mutually independent, so is { ˆX1,..., ˆ Xn } and {Y 1,..., Y n }. Therefore, by (.), E Y = E Y i. (.3) So far this all seems quite good. We want to prove that ˆX is small, which is equivalent to proving Y is small. Using (.3), we can do this by showing that the E Y i terms are small. Doing so involves an extremely useful tool. Key Idea #: The second main idea is a clever trick to bound terms of the form E exp(λa), where A is a mean-zero random variable. We discuss this idea in more detail in the next subsection. We will use Claim. to show E Y i = E exp(λ ˆX i ) exp(λ /). Thus, combining this with (.3), E Y exp(λ /) = exp(λ n/). (.4) Now we are ready to prove Hoeffding s inequality: Pr ˆX t = Pr exp(λ ˆX) exp ( λt ) (by monotonicity of e x ) E exp(λ ˆX) (by Markov s inequality (Theorem 1.1)) exp(λt) = E Y exp( λt) exp(λ n/ λt) (by (.4)) = exp( t /n), by optimizing to get λ = t/n.. If we were more careful here and instead used Lemma.4, we could improve the exponent in (.1) from 1/ to 3

4 .1 Exponentiated Mean-Zero RVs The second main idea of Hoeffding s inequality is the following claim. Claim.. Let A be a random variable such that A 1 with probability 1 and E A = 0. Then for any λ > 0, we have E exp(λa) exp(λ /). Intuitively, the expectation should be maximized by the random variable A that is uniform on { 1, +1}. In this case, E exp(λa) = 1 eλ 1 e λ e λ /. This inequality is a nice bound on the hyperbolic cosine function (Claim.3). The full proof of Claim. basically reduces to the case of A { 1, 1} using convexity of e x. Proof. Define p = (1 + A)/ and q = (1 A)/. Observe that p, q 0, p + q = 1, and p q = A. By convexity, exp(λa) = exp ( λ(p q) ) = exp ( λp + ( λ)q ) p exp(λ) + q exp( λ) = eλ + e λ + A (eλ e λ ). Thus, e λ + e λ E exp(λa) E + A (eλ e λ ) = eλ + e λ, since E A = 0. This last quantity is bounded by the following technical claim. Claim.3 (Approximation of Cosh). For any real x, we have (e x + e x )/ exp(x /). A more general result can be found in Alon & Spencer Lemma A.1.5. Proof. First observe that the product of all the even numbers at most n does not exceed the product of all numbers at most n. In symbols, n (n!) = (i) i = (n)! Now to bound (e x + e x )/, we write it as a Taylor series and observe that the odd terms cancel. e x + e x = n 0 x n n! + ( x) n n! n 0 = n 0 x n (n)! n 0 x n n (n!) = (x /) n n! n 0 = exp(x /). A common scenario is that A is mean-zero, but lies in an asymmetric interval a, b, where a < 0 < b. A slightly tighter version of these MGF bounds can be derived for this scenario. Lemma.4 (Hoeffding s Lemma). Let A be a random variable such that A a, b with probability 1 and E A = 0. Then for any λ > 0, we have E exp(λa) exp ( λ (b a) /8 ). Wikipedia, 1, Lemma B.7. The proof uses ideas similar to the proof of Claim., except we cannot use Claim.3 and must instead use an ad-hoc calculus argument. 4

5 References 1 Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press,

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)