Learning Theory: Lecture Notes

Similar documents
This section is optional.

Lecture 2: Concentration Bounds

An Introduction to Randomized Algorithms

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

7.1 Convergence of sequences of random variables

EE 4TM4: Digital Communications II Probability Theory

Distribution of Random Samples & Limit theorems

Lecture 3: August 31

7.1 Convergence of sequences of random variables

Lecture 6: Coupon Collector s problem

STAT Homework 1 - Solutions

Glivenko-Cantelli Classes

1 Convergence in Probability and the Weak Law of Large Numbers

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

Advanced Stochastic Processes.

Agnostic Learning and Concentration Inequalities

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Problem Set 2 Solutions

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Probability and Random Processes

Lecture 1 Measure concentration

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

Parameter, Statistic and Random Samples

Rademacher Complexity

1 Review and Overview

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

Lecture 15: Learning Theory: Concentration Inequalities

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Lecture 12: September 27

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

2.2. Central limit theorem.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 7: Properties of Random Samples

Random Variables, Sampling and Estimation

Problems from 9th edition of Probability and Statistical Inference by Hogg, Tanis and Zimmerman:

On Equivalence of Martingale Tail Bounds and Deterministic Regret Inequalities

Section 11.8: Power Series

1 = δ2 (0, ), Y Y n nδ. , T n = Y Y n n. ( U n,k + X ) ( f U n,k + Y ) n 2n f U n,k + θ Y ) 2 E X1 2 X1

Notes 5 : More on the a.s. convergence of sums

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

ECE534, Spring 2018: Solutions for Problem Set #2

Notes 19 : Martingale CLT

Lecture 19: Convergence

4. Partial Sums and the Central Limit Theorem

Machine Learning Brett Bernstein

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Probability for mathematicians INDEPENDENCE TAU

2 Banach spaces and Hilbert spaces

Mathematics 170B Selected HW Solutions.

Discrete Probability Functions

ST5215: Advanced Statistical Theory

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Lecture 16

Basics of Probability Theory (for Theory of Computation courses)

Lecture 4: April 10, 2013

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 7

Supplementary Material for Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Limit Theorems. Convergence in Probability. Let X be the number of heads observed in n tosses. Then, E[X] = np and Var[X] = np(1-p).

Lecture 12: November 13, 2018

1 Review and Overview

Lecture Chapter 6: Convergence of Random Sequences

LECTURE 8: ASYMPTOTICS I

Introduction to Probability. Ariel Yadin

Integrable Functions. { f n } is called a determining sequence for f. If f is integrable with respect to, then f d does exist as a finite real number

Lecture 9: Expanders Part 2, Extractors

Empirical Process Theory and Oracle Inequalities

CS 330 Discussion - Probability

Topic 9: Sampling Distributions of Estimators

AMS 216 Stochastic Differential Equations Lecture 02 Copyright by Hongyun Wang, UCSC ( ( )) 2 = E X 2 ( ( )) 2

SDS 321: Introduction to Probability and Statistics

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Topic 9: Sampling Distributions of Estimators

Lecture 3 : Random variables and their distributions

STAT Homework 2 - Solutions

Introduction to Econometrics (4th Edition) Solutions to Odd-Numbered End-of-Chapter Exercises: Chapter 2*

On Random Line Segments in the Unit Square

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

AMS570 Lecture Notes #2

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Estimation of the Mean and the ACVF

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

NOTES ON DISTRIBUTIONS

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 2 February 8, 2016

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

CS 171 Lecture Outline October 09, 2008

Exponential Families and Bayesian Inference

Application to Random Graphs

Unbiased Estimation. February 7-12, 2008

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

It is often useful to approximate complicated functions using simpler ones. We consider the task of approximating a function by a polynomial.

6. Sufficient, Complete, and Ancillary Statistics

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Transcription:

Learig Theory: Lecture Notes Kamalika Chaudhuri October 4, 0 Cocetratio of Averages Cocetratio of measure is very useful i showig bouds o the errors of machie-learig algorithms. We will begi with a basic cocetratio iequality, which shows the cocetratio of measure of averages of a umber of idepedet radom variables. Theorem (Hoeffdig s Iequality) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i. The, ( ( ) X +... + X X +... + X ) Pr E ɛ e ɛ / Example : Estimatig the Bias of a Coi. Cosider a coi with bias p, ad suppose we toss it times. If X is the umber of heads obtaied, Hoeffdig s Iequality gives us: Pr ( X p ɛ) e ɛ Cocetratio of Lipschitz Fuctios Hoeffdig s Iequality shows that the mea of idepedet radom variables is tightly cocetrated aroud their expectatio. It turs out that similar cocetratio bouds ca be obtaied for smooth or Lipschitz fuctios. Defiitio A fuctio f : R R is said to be λ-lipschitz wrt to the L p -metric if for all x ad y, f(x) f(y) λ x y p We will oly cosider fuctios which are Lipschitz with respect to the L ad the L metrics. For example, if x = (x,..., x ) the the fuctio f m (x) = (x +... + x ) is -Lipschitz with respect to the L metric. Theorem (Cocetratio of Lipschitz Fuctios wrt L -metric) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i, ad let f be a fuctio. If f is λ-lipschitz with respect to the L metric, the, Pr ( f(x,..., X ) E[f(X,..., X )] ɛ) e ɛ /λ Cocetratio bouds ca be show for fuctios which are λ-lipschitz with respect to the L metric.

Theorem 3 (Cocetratio of Lipschitz Fuctio wrt L -metric) Let S d be the surface of the uit sphere i (d ) dimesios, ad let µ be the uiform measure o S d. Let f : R R be λ-lipschitz wrt the L metric. The, µ (f media(f) + ɛ) 4e ɛ d/λ Example : Cocetratio of Volume o the Sphere. Let X µ; let w be ay fixed uit vector, ad let f be the fuctio: f(x) = X, w The f is -Lipschitz wrt the L metric, because: f(x) f(y) = x y, w w x y x y Observe that media(f) = 0 due to symmetry. Applyig the theorem above o f(x) ad f(x), we get that for ay vector w, µ( w, X > ɛ) 8e ɛ d This implies that most of the volume of a d-dimesioal sphere is cocetrated aroud the equator. We will ext prove Hoeffdig s Iequality, but first we eed to recall a few basic probability ad geometric facts. 3 Some Basic Facts Fact (Liearity of Expectatio) For ay two radom variables X ad Y, Fact (Variace) For a radom variable X, E[X + Y ] = E[X] + E[Y ] Var(X) = E[(X E[X]) ] = E[X ] (E[X]) Fact 3 (Liearity of Variace) If X,..., X are idepedet radom variables, the: Var(X +... + X ) = Var(X ) + Var(X ) +... + Var(X ) Fact 4 (Uio Boud) For ay two evets A ad B, Pr(A B) = Pr(A) + Pr(B) Fact 5 (Jese s Iequality) If f is a covex fuctio, the E[f(X)] f(e[x]) 4 Some Basic Cocetratio Iequalities As a exercise, we first look at two (weaker) cocetratio iequalities ad their proofs. Theorem 4 (Markov s Iequality) For ay radom variable X, ad ay a 0, Pr( X a) E[X] a

Proof: Observe that X a X a. Takig expectatios o both sides, we get the iequality. Markov s Iequality i tur ca be applied to prove stroger cocetratio iequalities. Theorem 5 (Chebyshev s Iequality) For ay radom variable X, Pr( X E[X] a) Var(X) Proof: Let Z = (X E[X]). Applyig Markov s Iequality to Z, we get: Pr( X E[X] a) = Pr(Z ) E[(X E[X]) ] = Var(X) Usually Chebyshev s Iequality gives a stroger boud tha Markov s Iequality. However, Markov s Iequality also requires less of the radom variable it oly requires E[X] to be fiite, whereas Chebyshev s Iequality requires both E[X] ad Var(X) to be fiite. Example 3: Symmetric Radom Walks o the Lie. Cosider the followig stochastic process: we start at the origi, ad at each time step t, we take a step to the left w.p. ad to the right w.p.. What is our positio after time steps? More formally, for each time step t, we defie a radom variable X t to represet each step of the walk as follows. X t = +, with probability / =, with probability / Sice we start at the origi, the positio S after steps is defied as: S = X + X +... + X Observe that usig the liearity of expectatio, E[S ] = 0, ad usig the liearity of variace (as the steps X t are idepedet), Var(S ) =. If we apply Markov s Iequality o S we get that for c >, Pr( S c ) E[ S ] E[S c ] c Var(S ) c c Applyig Chebyshev s Iequality, Pr( S c ) Var(S ) c Thus Chebyshev s Iequality provides a better boud. 5 Proof of Hoeffdig s Iequality I the proof of Chebyshev s Iequality, we used Markov s Iequality o X E[X] to get a stroger boud; to prove Hoeffdig s Iequality, we will exted this idea further. To do so, we eed the cocept of momet geeratig fuctios. Defiitio The momet geeratig fuctio ψ(t) of a radom variable X is defied as the fuctio: ψ(t) = E[e tx ] c 3

Example 4: Momet Geeratig Fuctios. Suppose X is a radom variable which represets the outcome of a coi toss with bias p. The the momet geeratig fuctio (m.g.f) of X is: E[e tx ] = pe t + ( p) I geeral if X is a discrete radom variable, which takes values x,..., x k w.p. p,..., p k, the, E[e tx ] = p e tx + p e tx +... + p k e tx k If X is a stadard ormal variable, the the m.g.f of X is E[e tx ] = e t /. I geeral momet geeratig fuctios may ot always be defied. But if ψ(t) is defied i a iterval [ δ, δ] aroud 0, the,. All momets of X are fiite, ad E[X k ] = k ψ t k t=0. If X ad Y are two radom variables such that ψ X (t) = ψ Y (t) for all t [ δ, δ], the X ad Y have the same cumulative frequecy distributio. Fact 6 If X ad Y are two idepedet radom variables, the E[e t(x+y ) ] = E[e tx ] E[e ty ] Before we prove Hoeffdig s Iequality, we eed oe more lemma. Lemma If X is a radom variable such that E[X] = 0 ad a X b, the, for ay t > 0, E[e tx ] e t (b a) /8 Proof: Recall that e tx is a covex fuctio of x. If x = λa+( λ)b, we ca use Jese s Iequality to write: e tx λe ta + ( λ)e tb Pluggig i λ = b x b a, we get that: e tx b x b a eta + x a b a etb Takig expectatios o both sides ad otig that E[X] = 0, we get: E[e tx ] beta ae tb b a We ca show usig simple calculus that the right had side of this equatio is at most e t (b a) /8. We are ow ready to prove Hoeffdig s iequality. Theorem 6 (Hoeffdig s Iequality, restated) Let X,..., X be idepedet ad bouded radom variables such that a i X i b i. The, ( ) (X Pr +... + X ) E[X +... + X ] ɛ e ɛ / 4

Proof: Let S = X +... + X, ad let Y i = X i E[X i ]. The, a i E[X i ] Y i b i E[X i ]. For ay t > 0, Pr(S E[S ] ɛ) = Pr(Y +... + Y ɛ) = Pr(e t(y+...+y) e tɛ ) E[et(Y+...+Y) ] e tɛ where the last step follows from applyig a Markov s Iequality. Usig the idepedece of momet geeratig fuctios, we get that: E[e t(y+...+y) ] e tɛ Usig Lemma, the right had side is at most: Pluggig i t = = E[e ty ] E[e ty ] E[e ty ] e tɛ e t (b a ) /8 e t (b ) /8... e t (b a ) /8 e tɛ 4ɛ, this is at most e i ɛ / (bi ai) i (bi ai). 5