Learning Theory: Lecture Notes

Learig Theory: Lecture Notes Kamalika Chaudhuri October 4, 0 Cocetratio of Averages Cocetratio of measure is very useful i showig bouds o the errors of machie-learig algorithms. We will begi with a basic cocetratio iequality, which shows the cocetratio of measure of averages of a umber of idepedet radom variables. Theorem (Hoeffdig s Iequality) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i. The, ( ( ) X +... + X X +... + X ) Pr E ɛ e ɛ / Example : Estimatig the Bias of a Coi. Cosider a coi with bias p, ad suppose we toss it times. If X is the umber of heads obtaied, Hoeffdig s Iequality gives us: Pr ( X p ɛ) e ɛ Cocetratio of Lipschitz Fuctios Hoeffdig s Iequality shows that the mea of idepedet radom variables is tightly cocetrated aroud their expectatio. It turs out that similar cocetratio bouds ca be obtaied for smooth or Lipschitz fuctios. Defiitio A fuctio f : R R is said to be λ-lipschitz wrt to the L p -metric if for all x ad y, f(x) f(y) λ x y p We will oly cosider fuctios which are Lipschitz with respect to the L ad the L metrics. For example, if x = (x,..., x ) the the fuctio f m (x) = (x +... + x ) is -Lipschitz with respect to the L metric. Theorem (Cocetratio of Lipschitz Fuctios wrt L -metric) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i, ad let f be a fuctio. If f is λ-lipschitz with respect to the L metric, the, Pr ( f(x,..., X ) E[f(X,..., X )] ɛ) e ɛ /λ Cocetratio bouds ca be show for fuctios which are λ-lipschitz with respect to the L metric.

Theorem 3 (Cocetratio of Lipschitz Fuctio wrt L -metric) Let S d be the surface of the uit sphere i (d ) dimesios, ad let µ be the uiform measure o S d. Let f : R R be λ-lipschitz wrt the L metric. The, µ (f media(f) + ɛ) 4e ɛ d/λ Example : Cocetratio of Volume o the Sphere. Let X µ; let w be ay fixed uit vector, ad let f be the fuctio: f(x) = X, w The f is -Lipschitz wrt the L metric, because: f(x) f(y) = x y, w w x y x y Observe that media(f) = 0 due to symmetry. Applyig the theorem above o f(x) ad f(x), we get that for ay vector w, µ( w, X > ɛ) 8e ɛ d This implies that most of the volume of a d-dimesioal sphere is cocetrated aroud the equator. We will ext prove Hoeffdig s Iequality, but first we eed to recall a few basic probability ad geometric facts. 3 Some Basic Facts Fact (Liearity of Expectatio) For ay two radom variables X ad Y, Fact (Variace) For a radom variable X, E[X + Y ] = E[X] + E[Y ] Var(X) = E[(X E[X]) ] = E[X ] (E[X]) Fact 3 (Liearity of Variace) If X,..., X are idepedet radom variables, the: Var(X +... + X ) = Var(X ) + Var(X ) +... + Var(X ) Fact 4 (Uio Boud) For ay two evets A ad B, Pr(A B) = Pr(A) + Pr(B) Fact 5 (Jese s Iequality) If f is a covex fuctio, the E[f(X)] f(e[x]) 4 Some Basic Cocetratio Iequalities As a exercise, we first look at two (weaker) cocetratio iequalities ad their proofs. Theorem 4 (Markov s Iequality) For ay radom variable X, ad ay a 0, Pr( X a) E[X] a

Proof: Observe that X a X a. Takig expectatios o both sides, we get the iequality. Markov s Iequality i tur ca be applied to prove stroger cocetratio iequalities. Theorem 5 (Chebyshev s Iequality) For ay radom variable X, Pr( X E[X] a) Var(X) Proof: Let Z = (X E[X]). Applyig Markov s Iequality to Z, we get: Pr( X E[X] a) = Pr(Z ) E[(X E[X]) ] = Var(X) Usually Chebyshev s Iequality gives a stroger boud tha Markov s Iequality. However, Markov s Iequality also requires less of the radom variable it oly requires E[X] to be fiite, whereas Chebyshev s Iequality requires both E[X] ad Var(X) to be fiite. Example 3: Symmetric Radom Walks o the Lie. Cosider the followig stochastic process: we start at the origi, ad at each time step t, we take a step to the left w.p. ad to the right w.p.. What is our positio after time steps? More formally, for each time step t, we defie a radom variable X t to represet each step of the walk as follows. X t = +, with probability / =, with probability / Sice we start at the origi, the positio S after steps is defied as: S = X + X +... + X Observe that usig the liearity of expectatio, E[S ] = 0, ad usig the liearity of variace (as the steps X t are idepedet), Var(S ) =. If we apply Markov s Iequality o S we get that for c >, Pr( S c ) E[ S ] E[S c ] c Var(S ) c c Applyig Chebyshev s Iequality, Pr( S c ) Var(S ) c Thus Chebyshev s Iequality provides a better boud. 5 Proof of Hoeffdig s Iequality I the proof of Chebyshev s Iequality, we used Markov s Iequality o X E[X] to get a stroger boud; to prove Hoeffdig s Iequality, we will exted this idea further. To do so, we eed the cocept of momet geeratig fuctios. Defiitio The momet geeratig fuctio ψ(t) of a radom variable X is defied as the fuctio: ψ(t) = E[e tx ] c 3

Example 4: Momet Geeratig Fuctios. Suppose X is a radom variable which represets the outcome of a coi toss with bias p. The the momet geeratig fuctio (m.g.f) of X is: E[e tx ] = pe t + ( p) I geeral if X is a discrete radom variable, which takes values x,..., x k w.p. p,..., p k, the, E[e tx ] = p e tx + p e tx +... + p k e tx k If X is a stadard ormal variable, the the m.g.f of X is E[e tx ] = e t /. I geeral momet geeratig fuctios may ot always be defied. But if ψ(t) is defied i a iterval [ δ, δ] aroud 0, the,. All momets of X are fiite, ad E[X k ] = k ψ t k t=0. If X ad Y are two radom variables such that ψ X (t) = ψ Y (t) for all t [ δ, δ], the X ad Y have the same cumulative frequecy distributio. Fact 6 If X ad Y are two idepedet radom variables, the E[e t(x+y ) ] = E[e tx ] E[e ty ] Before we prove Hoeffdig s Iequality, we eed oe more lemma. Lemma If X is a radom variable such that E[X] = 0 ad a X b, the, for ay t > 0, E[e tx ] e t (b a) /8 Proof: Recall that e tx is a covex fuctio of x. If x = λa+( λ)b, we ca use Jese s Iequality to write: e tx λe ta + ( λ)e tb Pluggig i λ = b x b a, we get that: e tx b x b a eta + x a b a etb Takig expectatios o both sides ad otig that E[X] = 0, we get: E[e tx ] beta ae tb b a We ca show usig simple calculus that the right had side of this equatio is at most e t (b a) /8. We are ow ready to prove Hoeffdig s iequality. Theorem 6 (Hoeffdig s Iequality, restated) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i. The, ( ) (X Pr +... + X ) E[X +... + X ] ɛ e ɛ / 4

Proof: Let S = X +... + X, ad let Y i = X i E[X i ]. The, a i E[X i ] Y i b i E[X i ]. For ay t > 0, Pr(S E[S ] ɛ) = Pr(Y +... + Y ɛ) = Pr(e t(y+...+y) e tɛ ) E[et(Y+...+Y) ] e tɛ where the last step follows from applyig a Markov s Iequality. Usig the idepedece of momet geeratig fuctios, we get that: E[e t(y+...+y) ] e tɛ Usig Lemma, the right had side is at most: Pluggig i t = = E[e ty ] E[e ty ] E[e ty ] e tɛ e t (b a ) /8 e t (b ) /8... e t (b a ) /8 e tɛ 4ɛ, this is at most e i ɛ / (bi ai) i (bi ai). 5