Lecture 3: August 31

Similar documents
Lecture 2: Concentration Bounds

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Learning Theory: Lecture Notes

7.1 Convergence of sequences of random variables

7.1 Convergence of sequences of random variables

1 Review and Overview

Rademacher Complexity

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators

Regression with quadratic loss

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Problem Set 2 Solutions

Convergence of random variables. (telegram style notes) P.J.C. Spreij

Agnostic Learning and Concentration Inequalities

REGRESSION WITH QUADRATIC LOSS

Application to Random Graphs

Lecture 4: April 10, 2013

Lecture 15: Learning Theory: Concentration Inequalities

An Introduction to Randomized Algorithms

Topic 9: Sampling Distributions of Estimators

Advanced Stochastic Processes.

Lecture 12: September 27

Empirical Process Theory and Oracle Inequalities

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

2 Banach spaces and Hilbert spaces

Glivenko-Cantelli Classes

Lecture 27. Capacity of additive Gaussian noise channel and the sphere packing bound

Lecture Chapter 6: Convergence of Random Sequences

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 12

STAT Homework 1 - Solutions

Statistics 511 Additional Materials

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

1 Review and Overview

Machine Learning Brett Bernstein

Math 61CM - Solutions to homework 3

Lecture 2. The Lovász Local Lemma

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

Fall 2013 MTH431/531 Real analysis Section Notes

Lecture 19: Convergence

arxiv: v1 [math.pr] 13 Oct 2011

Optimally Sparse SVMs

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Rates of Convergence by Moduli of Continuity

1 Introduction to reducing variance in Monte Carlo simulations

CHAPTER I: Vector Spaces

Maximum Likelihood Estimation and Complexity Regularization

Notes 19 : Martingale CLT

On Random Line Segments in the Unit Square

Lesson 10: Limits and Continuity

A survey on penalized empirical risk minimization Sara A. van de Geer

Probability and statistics: basic terms

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 3

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Chapter 6 Principles of Data Reduction

Asymptotic distribution of products of sums of independent random variables

ECE 330:541, Stochastic Signals and Systems Lecture Notes on Limit Theorems from Probability Fall 2002

MATH/STAT 352: Lecture 15

Lecture 6: Coupon Collector s problem

MAS111 Convergence and Continuity

Notes for Lecture 11

Distribution of Random Samples & Limit theorems

Sieve Estimators: Consistency and Rates of Convergence

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 21 11/27/2013

Random Variables, Sampling and Estimation

A remark on p-summing norms of operators

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Lecture 7: Properties of Random Samples

A Note on the Distribution of the Number of Prime Factors of the Integers

Lecture 2 February 8, 2016

MA131 - Analysis 1. Workbook 3 Sequences II

18.657: Mathematics of Machine Learning

Math 2784 (or 2794W) University of Connecticut

Self-normalized deviation inequalities with application to t-statistic

The standard deviation of the mean

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Lecture 7: Density Estimation: k-nearest Neighbor and Basis Approach

Notes 27 : Brownian motion: path properties

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Direction: This test is worth 250 points. You are required to complete this test within 50 minutes.

Statisticians use the word population to refer the total number of (potential) observations under consideration

Lecture 9: Expanders Part 2, Extractors

Law of the sum of Bernoulli random variables

Chapter 6 Sampling Distributions

36-755, Fall 2017 Homework 5 Solution Due Wed Nov 15 by 5:00pm in Jisu s mailbox

Since X n /n P p, we know that X n (n. Xn (n X n ) Using the asymptotic result above to obtain an approximation for fixed n, we obtain

Basics of Probability Theory (for Theory of Computation courses)

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Lecture 8: Convergence of transformations and law of large numbers

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Machine Learning Theory Tübingen University, WS 2016/2017 Lecture 11

Concentration inequalities

Transcription:

36-705: Itermediate Statistics Fall 018 Lecturer: Siva Balakrisha Lecture 3: August 31 This lecture will be mostly a summary of other useful expoetial tail bouds We will ot prove ay of these i lecture, some of them follow similar lies of usig Cheroff s method i clever ways I ca provide refereces if you are curious I particular, we will go through: 1 Berstei s iequality: sharper cocetratio for bouded radom variables McDiarmid s iequality: Cocetratio of Lipschitz fuctios of bouded radom variables 3 Levy s iequality/tsirelso s iequality: Cocetratio of Lipschitz fuctios of Gaussia radom variables 4 χ tail boud Fially, we will see a applicatio of the χ tail boud i provig the Johso-Lidestrauss lemma 31 Berstei s iequality Oe ice thig about the Gaussia tail iequality was that it explicitly depeded o the variace of the radom variable X, ie roughly the iequality guarateed us that the deviatio from the mea was at most σ log(/δ/ with probability at least 1 δ O the other had Hoeffdig s boud depeded oly o the bouds of the radom variable but ot explicitly o the variace of the RVs The boud b a, provides a (possibly loose upper boud o the stadard deviatio Oe might at least hope that if the radom variables were bouded, ad additioally had small variace we might be able to improve Hoeffdig s boud This is ideed the case Such iequalities are typically kow as Berstei iequalities As a cocrete example, suppose we had X 1,, X which were iid from a distributio with mea µ, bouded support [a, b], with variace E[X ] = σ The, ( P( µ µ t exp 3-1 t (σ + (b at

3- Lecture 3: August 31 Roughly this iequality says with probability at least 1 δ, l(/δ 4(b a l(/δ µ µ 4σ + As a exercise work through the above algebra Upto some small costats this is ever worse tha Hoeffdig s boud, which just comes from usig the worst-case upper boud of σ b a Whe the RVs have small variace, ie σ is small, this boud ca be much sharper tha Hoeffdig s boud These are cases where oe has a radom variable that occasioally takes large values (so the bouds are ot great but has much smaller variace Ituitively, it captures more of the Chebyshev effect, ie that radom variables with small variace should be tightly cocetrated aroud their mea 3 McDiarmid s iequality So far we have mostly bee focused o the cocetratio of averages A atural questio is whether other fuctios of iid radom variables also show expoetial cocetratio It turs out that may other fuctios do cocetrate sharply, ad roughly the mai property of the fuctio that we eed is that if we chage the value of oe radom variable the fuctio does ot chage dramatically Formally, we have iid RVs X 1,, X, where each X i R We have a fuctio f : R R, that satisfies the property that: f(x 1,, x f(x 1,, x k 1, x k, x k+1,, x L k, for every x, x R, ie the fuctio chages by at most L k if its k-th co-ordiate is chaged This is kow as the bouded differece coditio If the radom variables X 1,, X are iid the for all t 0 P( f(x 1,, X E[f(X 1,, X ] t exp ( t k=1 L k Example 1: A simple example of this iequality i actio is to see that it directly implies the Hoeffdig boud I this case the fuctio of iterest is the average: f(x 1,, X = 1 X i, ad sice the radom variables are bouded we have that each L k (b a/ This i tur directly yields Hoeffdig s boud (with slightly better costats

Lecture 3: August 31 3-3 Example : A perhaps more iterestig example is that of U-statistics A U-statistic is defied by a kerel, which is just a fuctio of two radom variables, ie g : R R The U-statistic is the give as: U(X 1,, X := ( 1 g(x j, X k j<k There are may examples of U-statistics, for istace: 1 Variace: The usual estimator of the sample variace: σ = 1 (X i µ, 1 is the U-statistic that arises from takig g(x j, X k = 1 (X i X j Mea absolute deviatio: If we take g(x j, X k = X j X k, this leads to a U-statistic that is a ubiased estimator of the mea absolute deviatio E X 1 X For bouded U-statistics, ie if g(x i, X j b, we ca apply McDiarmid s iequality to obtai a cocetratio boud Note that sice each radom variable X i participates i ( 1 terms we have that, U(X 1,, X U(X 1,, X i,, X ( 1 ( 1(b = 4b So that McDiarmid s iequality tells us that, P( U(X 1,, X E[U(X 1,, X ] t exp( t /(8b 33 Levy s iequality There is a similar cocetratio iequality that applies to fuctios of Gaussia radom variables that are sufficietly smooth I this case, the assumptio is quite differet We assume that: f(x 1,, X f(y 1,, Y L (X i Y i, for all X 1,, X, Y 1,, Y R For such fuctios we have that if X 1,, X N(0, 1 the, P( f(x 1,, X E[f(X 1,, X ] t exp ( t L

3-4 Lecture 3: August 31 34 χ tail bouds A χ radom variable with degrees of freedom, deoted by Y χ, is a RV that is a sum of iid stadard Gaussia RVs, ie Y = X i where each X i N(0, 1 The expected value E[Xi ] = 1, ad we have the χ tail boud: ( 1 P Zk 1 t exp( t /8 for all t (0, 1 k=1 You will derive this i your HW usig the Cheroff method Aalogous to the class of sub- Gaussia RVs, χ radom variables belog to a class of what are kow as sub-expoetial radom variables The mai ote-worthy differece is that the Gaussia-type behaviour of the tail oly holds for small values of the deviatio t Detour: The uio boud have evets A 1,, A the This is also kow as Boole s iequality It says that if we ( P A i P(A i I particular, if we cosider a case whe each evet A i is a failure of some type, the the above iequality says that the probability that eve a sigle failure occurs is at most the sum of the probabilities of each failure Example: The Johso-Lidestrauss Lemma Oe very ice applicatio of χ tail bouds is i the aalysis of what are kow as radom projectios Suppose we have a data set X 1,, X R d where d is quite large Storig such a dataset might be expesive ad as a result we ofte resort to sketchig or radom projectio where the goal is to create a map F : R d R m, with m d We the istead store the mapped dataset {F (X 1,, F (X } The challege is to desig this map F i a way that preserves essetial features of the origial dataset I particular, we would like that for every pair (X i, X j we have that, (1 ɛ X i X j F (X i F (X j (1 + ɛ X i X j, ie the map preserves all the pair-wise distaces up to a (1 ± ɛ factor Of course, if m is large we might expect this is ot too difficult The Johso-Lidestrauss lemma is quite stuig: it says that a simple radomized costructio will produce such a map with probability at least 1 δ provided that, 16 log(/δ m ɛ Notice that this is completely idepedet of the origial dimesio d ad depeds o logarithmically o the umber of poits This map ca result i huge savigs i storage cost while still essetially preservig all the pairwise distaces

Lecture 3: August 31 3-5 The map itself is quite simple: we costruct a matrix Z R m d, where each etry of Z is iid N(0, 1 We the defie the map as: F (X i = ZX i m Now let us fix a pair (X j, X k ad cosider, F (X j F (X k X j X k = Z(X j X k m Xj X k = 1 m X j X k Z i, m X j X k }{{ } T i Now, for some fixed umbers a j the distributio of d j=1 a jz ij is Gaussia with mea 0 ad variace d j=1 a j So each term T i is a idepedet χ radom variable Now applyig the χ tail boud, we obtai that, ( F (X j F (X k P 1 X j X k ɛ exp( mɛ /8 Thus for the fixed pair (X i, X j the probability that our map fails to preserve the distace is expoetially small, ie is at most exp( mɛ /8 Now, to fid the probability that our map fails to preserve ay of our ( pairwise distaces we simply apply the uio boud to coclude that, the probability of ay failure is at most: ( P(failure exp( mɛ /8 Now, it is straightforward to verify that if m 16 log(/δ ɛ, the this probability is at most δ as desired A importat poit to ote is that the expoetial cocetratio is what leads to such a small value for m (ie it oly eeds to grow logarithmically with the sample size