Chapter 7. Basic Probability Theory

Similar documents
Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Probability Background

STAT/MATH 395 A - PROBABILITY II UW Winter Quarter Moment functions. x r p X (x) (1) E[X r ] = x r f X (x) dx (2) (x E[X]) r p X (x) (3)

1: PROBABILITY REVIEW

Lecture 2: Repetition of probability theory and statistics

Exercises and Answers to Chapter 1

Statistics for scientists and engineers

Probability Theory and Statistics. Peter Jochumzen

Algorithms for Uncertainty Quantification

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

ELEMENTS OF PROBABILITY THEORY

Lecture 2: Review of Basic Probability Theory

Probability and Measure

Formulas for probability theory and linear models SF2941

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

1 Presessional Probability

Lecture 1: Review on Probability and Statistics

Review of Probability Theory

Quick Tour of Basic Probability Theory and Linear Algebra

Part II Probability and Measure

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

Random Variables. P(x) = P[X(e)] = P(e). (1)

3. Review of Probability and Statistics

ACM 116: Lectures 3 4

Chp 4. Expectation and Variance

Fundamentals of Digital Commun. Ch. 4: Random Variables and Random Processes

CDA5530: Performance Models of Computers and Networks. Chapter 2: Review of Practical Random Variables

Lecture 22: Variance and Covariance

Lecture 6 Basic Probability

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Multiple Random Variables

conditional cdf, conditional pdf, total probability theorem?

Probability: Handout

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

STAT 200C: High-dimensional Statistics

Northwestern University Department of Electrical Engineering and Computer Science

Probability and Distributions

Expectation of Random Variables

3. Probability and Statistics

Continuous Random Variables

6.1 Moment Generating and Characteristic Functions

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Exercises with solutions (Set D)

Week 12-13: Discrete Probability

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Random Variables and Their Distributions

LIST OF FORMULAS FOR STK1100 AND STK1110

1.1 Review of Probability Theory

1 Probability theory. 2 Random variables and probability theory.

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Lecture 11. Probability Theory: an Overveiw

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

STAT 200C: High-dimensional Statistics

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

1 Review of Probability

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

Lectures on Statistics. William G. Faris

Qualifying Exam in Probability and Statistics.

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Econ 508B: Lecture 5

EE514A Information Theory I Fall 2013

2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

Gaussian vectors and central limit theorem

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

Lecture 1: August 28

18.175: Lecture 3 Integration

Qualifying Exam in Probability and Statistics.

MAS113 Introduction to Probability and Statistics

Chapter 1 Statistical Reasoning Why statistics? Section 1.1 Basics of Probability Theory

Chap 2.1 : Random Variables

Chapter 2: Random Variables

Discrete Probability Refresher

1 Review of Probability and Distributions

where r n = dn+1 x(t)

1 Random Variable: Topics

Probability and Measure

Introduction to Machine Learning

1 Exercises for lecture 1

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

Probability reminders

CDA6530: Performance Models of Computers and Networks. Chapter 2: Review of Practical Random

Product measure and Fubini s theorem

Appendix A : Introduction to Probability and stochastic processes

Fundamental Tools - Probability Theory II

Chapter 5. Random Variables (Continuous Case) 5.1 Basic definitions

If we want to analyze experimental or simulated data we might encounter the following tasks:

ECE534, Spring 2018: Solutions for Problem Set #3

STAT/MATH 395 PROBABILITY II

STAT 430/510: Lecture 16

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Week 2. Review of Probability, Random Variables and Univariate Distributions

Transcription:

Chapter 7. Basic Probability Theory I-Liang Chern October 20, 2016 1 / 49

What s kind of matrices satisfying RIP Random matrices with iid Gaussian entries iid Bernoulli entries (+/ 1) iid subgaussian entries random Fourier ensemble random ensemble in bounded orthogonal systems In each case, m = O(s ln N), they satisfy RIP with very high probability (1 e Cm ).. This is a note from S. Foucart and H. Rauhut, A Mathematical Introduction to Compressive Sensing, Springer 2013. 2 / 49

Outline of this Chapter: basic probability Basic probability theory Moments and tail Concentration inequalities 3 / 49

Basic notion of probability theory Probability space Random variables Expectation and variance Sum of independent random variables 4 / 49

Probability space A probability space is a triple (Ω, F, P ), Ω: sample space, F: the set of events, P : the probability measure. The sample space is the set of all possible outcomes. The collection of events F should be a σ-algebra: 1. F; 2. If E F, so is E c F; 3. F is closed under countable union, i.e. if E i F, i = 1, 2,, then i=1 E i F. The probability measure P : F [0, 1] satisfies 1. P (Ω) = 1; 2. If E i are mutually exclusive (i.e. E i E j = φ), then P ( i=1 E i) = i=1 P (E i). 5 / 49

Examples 1. Bernoulli trial: a trial which has only two outcomes: success or fail. We represent it as Ω = {1, 0}. The collect of events F = 2 Ω = {φ, {0}, {1}, {0, 1}}. The probability P ({1}) = p, P ({0}) = 1 p, 0 p 1. If we denote the outcome of a Bernoulli trial by x, i.e. x = 1 or 0, then P (x) = p x (1 p) 1 x. 2. Binomial trials: Let us perform Bernoulli trials n times independently. An outcome has the form (x 1, x 2,..., x n), where x i = 0 or 1 is the outcome of the ith trial. There are 2 n outcomes. The sample space Ω = {(x 1,..., x n) x i = 0 or 1}. The collection of events F = 2 Ω is indeed the collection of all subsets of Ω. The probability P ({(x 1,..., x n)}) := p x i (1 p) n x i. 6 / 49

Property of probability measure 1. P (E c ) = 1 P (E); 2. If E F, then P (E) P (F ); 3. If {E n, n 1} is either increasing (i.e. E n E n+1 ) or decreasing to E, then lim P (E n) = P (E). n 7 / 49

Independence and conditional probability Let A, B F and P (B) 0. The conditional probability P (A B) (the probability of A given B) is defined to be P (A B) := P (A B). P (B) Two events A and B are called independent if P (A B) = P (A)P (B). In this case P (A B) = P (A) and P (B A) = P (B). 8 / 49

Random variables Outline: Discrete random variables Continuous random variables Expectation and variances 9 / 49

Discrete random variables A discrete random variable is a mapping x : Ω {a 1, a 2, }, denoted by Ω x. The random variable x induces a probability on the discrete set Ω x := {a 1, a 2, } with probability P ({a k }) := P x ({x = a k }) and with σ-algebra F x which is 2 Ωx, the collection of all subsets of Ω x.. We call the function a k P x (a k ) the probability mass function of x. Once we have (Ω x, F x, P x ), we can just deal with this probability space if we only concern with x, and forget the original probability space (Ω, F, P ). 10 / 49

Binomial random variable Let x k be the kth (k = 1,..., n) outcome of the n independent Bernoulli trials. Clearly, x k is a random variable. Let S n = n i=1 x i be the number of successes in n Bernoulli trials. We see that S n is also a random variable. The sample space that S n induces is Ω Sn = {0, 1,..., n}. P Sn (S n = k) = ( n k ) p k (1 p) n k. The binomial distribution models the following uncertainty: the number of successes in n independent Bernoulli trials; the relative increase or decrease of a stock in a day; 11 / 49

Poisson random variable The Poisson random variable x takes values 0, 1, 2,... with probability where λ > 0 is a parameter. λ λk P (x = k) = e k!, The sample space is Ω = {0, 1, 2,...}. The probability satisfies λ λk P (Ω) = P (x = k) = e = 1. k! k=0 k=0 The Poisson random process can be used to model the following uncertainties: the number of customers visiting a specific counter in a day; the number of particles hitting a specific radiation detector in certain period of time; the number of phone calls of a specific phone in a week. The parameter λ is different for different cases. It can be estimated by experiments. 12 / 49

Continuous Random Variables A continuous random variable is a (Borel) measurable mapping from Ω R. This means that x 1 ([a, b)) F for any [a, b). It induces a probability space (Ω x, F x, P x) by Ω x = x(ω); F x = {A R x 1 (A) F} P x(a) := P (x 1 (A)). In particular, define F x(x) := P x((, x)) := P ({x < x}). called the (cumulative) distribution function. Its derivative p x(x) w.r.t. dx is called the probability density function: p x(x) = dfx(x). dx In other word, b P ({a x < b}) = F x(b) F x(a) = p x(x) dx. a Thus, x can be completely characterized by the density function p x on R. 13 / 49

Gaussian distribution The density function of Gaussian distribution is p(x) := 1 2πσ e x µ 2 /2σ 2, < x <. A random variable x with the above probability density function is called a Gaussian random variable and is denoted by x N(µ, σ). The Gaussian distribution is used to model: the motion of a big particle (called Brownian particle) in water; the limit of binomial distribution with infinite many trials. 14 / 49

Exponential distribution. The density is { λe λx if x 0 p(x) = 0 if x < 0. The exponential distribution is used to model the length of a telephone call; the length to the next earthquake. Laplace distribution The density function is given by p(x) = λ 2 e λ x. It is used to model some noise in images. It is used as a prior in Baysian regression (LASSO). 15 / 49

Remarks The probability mass function which takes discrete values on R can be viewed as a special case of probability density function by introducing the notion of delta function. The definition of the delta function is δ(x a)f(x) dx = f(a). Thus, a discrete random random variable x with value a i with probability p i has the probability density function p(x) = i p i δ(x a i ). b p(x) dx = p i. a a<a i <b 16 / 49

Expectation Given a random variable x with pdf p(x), we define the expectation of x by E(x) = xp(x) dx. If f is a continuous function on R, then f(x) is again a random variable. Its expectation is denoted by E(f(x)). We have E(f(x)) = f(x)p(x) dx. 17 / 49

The kth moment of x is defined to be: m k := E( x k ). In particular, the first and second moments have special names: mean: µ := E(x) variance: Var(x) := E((x E(x)) 2 ). The variance measures the spread out of values of a random variable. 18 / 49

Examples 1. Bernoulli distribution: mean µ = p, variance σ 2 = p(1 p). 2. Binomial distribution S n : the mean µ = np, variance: σ 2 = np(1 p). 3. Poisson distribution: mean µ = λ, variance σ 2 = λ. 4. Normal distribution N(µ, σ): mean µ, variance σ 2. 5. Uniform distribution: mean µ = (a + b)/2, variance σ 2 = (b a) 2 /12. 19 / 49

Joint Probability Let x and y be two random variables on (Ω, F, P ). The joint probability distribution of (x, y) is the measure on R 2 defined by µ(a) := P ((x, y) A) for any Borel set A. The derivative of µ w.r.t. the Lebesgue measure dx dy is called the joint probability density: P ((x, y) A) = p (x,y) (x, y) dx dy. A 20 / 49

Independent random variables Two random variables x and y are called independent if the events (a < x < b) and (c < y < d) are independent for any a < b and c < d. If x and y are independent, then by taking A = (a, b) (c, d), we can show that the joint probability p (x,y) (x, y) dx dy = P (a < x < b and c < y < d) (a,b) (c,d) ( b ) ( d ) = P (a < x < b)p (c < y < d) = p x(x) dx p y(y) dy a c This yields that p (x,y) (x, y) = p x(x) p y(y). If x and y are independent, then E[xy] = E[x]E[y]. The covariance of x and y is defined as cov[x, y] := E[(x E[x])(y E[y])]. Two random variables are called uncorrelated if their covariance is 0. 21 / 49

Sum of independent random variables If x and y are independent, then F x+y (z) = P (x + y < z) = z y dy dx p x(x)p y(y) We differentiate this in z and get p x+y (z) = dy p x(z y)p y(y) := p x p y(z). Sum of n independent identical distributed (iid) random variables {x i } n i=1 with mean µ and variance σ 2. Their average is x n := 1 n (x 1 + + x n) which has mean µ and variance σ 2 /n: ( ) E[( x n µ) 2 ] = E 1 n 2 ( ) x i µ = E 1 n 2 (x i µ) n n i=1 i=1 = E 1 n [ ] n n 2 (x i µ)(x j µ) 1 n = E n i=1 j=1 2 (x i µ) 2 i=1 = σ2 n 22 / 49

Limit of sum of ranviables Moments and Tails Gaussian, Subgaussian, Subexponential distributions Law of large numbers, central limit theorem Concentration inequalities 23 / 49

Tails and Moments: Markov inequality Theorem (Markov s inequality) Let x be a random variable. Then P ( x t) E[ x ] t for all t > 0. Proof. Note that P ( x t) = E[I x t ], where { 1 if ω A I A (ω) = 0 otherwise is called an indicator function supported on A, which satisfies Thus, I { x t} x t. P ( x t) = E[I { x t} ] E[ x ]. t 24 / 49

Remarks. For p > 0, P ( x t) = P ( x p t p ) E[ x p ] t p. For p = 2, apply Markov s inequality to x µ, we obtain Chebyshev inequality: For θ > 0, P ( x µ 2 t 2 ) σ2 t 2 P (x t) = P (exp(θx) exp(θt)) exp( θt)e[exp(θx)]. 25 / 49

Law of large numbers Theorem (Law of large numbers) Let x 1,..., x n be i.i.d. with mean µ and variance σ 2. Then the sample average x n := 1 n (x 1 + + x n ) converges in probability to its expected value: That is, for any ɛ > 0, x n µ P 0 as n. lim P ( x n µ ɛ) = 0. n 26 / 49

Proof. 1. Using iid of x j, the mean and variance of x n are: E[ x n] = µ, while σ 2 ( x n) = E[( x n µ) 2 ] = E 1 n n 2 (x j µ) 2 j=1 = 1 n n 2 E[(x j µ)(x k µ)] = 1 n n 2 E[(x j µ) 2 ] = 1 n σ2 j,k=1 j=1 2. We apply the Chebyshev s inequality to x n: P ( x n µ ɛ) = P ( x n µ 2 ɛ 2 ) σ( xn)2 ɛ 2 = σ2 0 as n. nɛ2 Remarks. The condition on variance can be removed. But we use Chebyshev inequality to prove this theorem, which uses the assumption of finite variance. No convergent rate here. The concentration inequality provides rate estimate, which needs tail control. 27 / 49

Tail probability can be controlled by moments Lemma (Markov) For p > 0, Proposition If x is a random variable satisfying P ( x t) = P ( x p t p ) E[ x p ] t p. E[ x p ] α p βp p/2 for all p 2, then P ( x e 1/2 αu) βe u2 /2 for all u 2. 1. Use Markov inequality P ( x ( ) eαu) E[ x p ] α p p ( eαu) p β. eαu 2. Choosing p = u 2, we get the estimate. 28 / 49

Moments can be controlled by tail probability Proposition The moments of a random variable x can be expressed as Proof. 1. Use Fubini theorem: E[ x p ] = p P ( x t)t p 1 dt, p > 0. 0 x p E[ x p ] = x p dp = 1 dx dp = I { x p x} dx dp Ω Ω 0 Ω 0 = I { x p x} dp dx = P ( x p x) dx 0 Ω 0 = p P ( x p t p )t p 1 dt = p P ( x t)t p 1 dt 0 0 2. Here I { x p x} is a random variable which is 1 as x p x and 0 otherwise. 29 / 49

Moments can be controlled by tail probability Proposition Suppose x is a random variable satisfying P ( x e 1/2 αu) βe u2 /2 for all u > 0, then for all p > 0, ( p ) E[ x p ] βα p (2e) p/2 Γ 2 + 1. Proposition If P ( x t) βe κt2 for all t > 0, then E[ x n ] nβ ( n ) 2 κ n/2 Γ. 2 30 / 49

Subgaussian and subexponential distributions Definition A random variable x is called subgaussian if there exist constants β, κ > 0 such that P ( x t) βe κt2 for all t > 0; It is called subexponential if there exist constants β, κ > 0 such that P ( x t) βe κt for all t > 0. Notice that x is subgaussian if and only if x 2 is subexponential. 31 / 49

Subgaussian Proposition A random variable is subgaussian if and only if c, C > 0 such that E[exp(cx 2 )] C. Proof. ( ) 1. Estimating moments from tail, we get E[x 2n ] βκ n n!. 2. Expand exponential function E[exp(cx 2 )] = 1 + ( ) From Markov inequality n=1 c n E[x 2n ] n! 1 + β n=1 c n κ n n! n! C. P ( x t) = P (exp(cx 2 ) e ct2 ) E[exp(cx 2 )]e ct2 Ce ct2. 32 / 49

Subgaussian with mean 0 Proposition A random variable x is subgaussian with Ex = 0 if and only if c > 0 such that E[exp(θx)] exp(cθ 2 ) for all θ R. ( ) 1. Apply Markov inequality P (x t) = P (exp(θx) exp(θt)) E[exp(θx)]e θt e cθ2 θt. Optimal θ yields P (x t) e t2 /(4c). 2. Repeating this for x, we also get P ( x t) e t2 /(4c). Thus, 3. To show E[x] = 0, we use P ( x t) = P (x t) + P ( x t) 2e t2 /(4c). Take θ 0, we obtain E[x] = 0. 1 + θe[x] E[exp(θx)] e cθ2 33 / 49

( ) 1. It is enough to prove the statement for θ 0. For θ < 0, we replace x by x. 2. For θ < θ 0 small, expand exp, use E[x] = 0, moment estimate via tail and Stirling formula: E[exp(θx)] = 1 + 1 + θ 2 β(ce)2 2πκ n=2 n=0 θ n E[x n ] n! 1 + β θ n C n κ n/2 n n/2 n! n=2 ( ) n Ceθ0 1 + θ 2 β(ce)2 1 κ 2πκ 1 Ceθ = 1 + c 1 θ 2 exp(c 1 θ 2 ). 0 κ Here, θ 0 = κ/(2ce) and satisfies Ceθ 0 κ 1/2 < 1. 3. For θ > θ 0, we aim at proving E[exp(θx c 2 θ 2 )] 1. Here, c 2 = 1/(4c). E[exp(θx c 2 θ 2 )] = E[exp( q 2 + x2 4c 2 )] E[exp( x2 4c 2 )] C. 4. Define ρ = ln(c)θ 2 0 yields E[exp(θx)] Ce c 2θ 2 = Ce ( ρ+(ρ+c 2))θ 2 Ce ρθ2 0 e (ρ+c 2)θ 2 e (ρ+c 2)θ 2 Setting c 3 = max(c 1, c 2 + ρ). This completes the proof. 34 / 49

Bounded random variable Corollary If a random variable x has mean 0 and x B almost surely, then E[exp(θx)] exp(b 2 θ 2 /2) Proof. 1. We write x = ( B)t + (1 t)b, where t = (B x)/2b is a random variable, 0 t 1 and E[t] = 1/2.. 2. By Jensen inequality: e θx te Bθ + (1 t)e Bθ, taking expectation, E[exp(θx)] 1 2 e Bθ + 1 2 ebθ = k=0 (θb) 2 n (2n)! k=0 (θb) 2 n (2 n n! = exp(b 2 θ 2 /2). 35 / 49

Exponential decay tails, cumulant function ln E[exp(θx)] In the case when the tail decays fast, the corresponding moment information can be grouped into exp[θx]. The function C x (θ) := ln E[exp(θx)] is called the cumulant function of x. 36 / 49

Example of C x Let g N(0, 1). Then This is from In particular, E[exp(ag 2 + θg)] = On the other hand, ( 1 exp 1 2a θ 2 2(1 2a) ), for a < 1/2, θ R. E[exp(ag 2 + θg)] = 1 exp(ax 2 + θx) exp( x 2 /2) dx 2π E[exp(θg)] = ( θ 2 ) E[exp(θg)] = exp. 2 j=0 θ j E[g j ] j! = n=0 By comparing the two expansions for E[exp(θx)], we obtain E[g 2n+1 ] = 0, θ 2n E[g 2n ] (2n)! E[g 2n ] = (2n)! 2 n, n = 0, 1,... n! C g(θ) := ln E[exp(θg)] = θ2 2. 37 / 49

Examples of C x Let the random variable x have the pdf χ [ B,B] /(2B). Then E[exp(θx)] = 1 B exp(θx) dx = ebθ e Bθ 2B B 2Bθ = n=0 (Bθ) 2n (2n + 1)!. On the other hand, E[exp(θx)] = θ k E[x k ] k=0. By comparing these two k! expansions, we obtain E[x 2n+1 ] = 0, E[x 2n ] = B2n, n = 0, 1,... 2n + 1 ( C x(θ) = ln e Bθ e Bθ) ln θ + C. 38 / 49

Examples of C x The pdf of Rademacher distribution is p ɛ(x) = (δ(x + 1) + δ(x 1))/2. E[exp(θɛ)] = 1 2 e θx (δ(x + 1) + δ(x 1)) dx = eθ + e θ 2 = n=0 θ 2n (2n)!. Thus, we get E[ɛ 2n+1 ] = 0, E[ɛ 2n ] = 1, n = 1, 2,... C ɛ(θ) = ln(e θ + e θ ) + C. 39 / 49

Concentration inequalities Motivations: In the law of large numbers, if the distribution function satisfies certain growth condition, e.g. decay exponentially fast, or even finite support, then we have sharp estimate how fast x n tends to µ. This is the large deviation theory below. The rate is controlled by the cumulant function C x(θ) := ln E[exp(θx)]. Theorem (Cremér s theorem) Let x 1,..., x n be independent random variables with cumulant-generating function C xl. Then for t > 0, P ( 1 n ) n x l x exp( ni(x)), l=1 [ ] I(x) := sup θx 1 n C xl (θ). θ>0 n l=1 40 / 49

Proof. 1. By Markov s inequality and independence of x l, [ ( n )] P ( x n x) =P (exp(θ x n) exp(θx)) e θx E[exp(θ x n)] = e θx θx l E exp n l=1 [ n ( ) ] = e θx θxl n [ ( )] E exp = e θx θxl E exp n n l=1 l=1 ( n [ ( )] ) ( = e θx θxl n ( ) ) θ exp ln E exp = exp θx + C xl n n l=1 l=1 ( ( )) = exp n θ x 1 n C xl (θ ) exp ( ni(x)) n l=1 Here, [ ] I(x) = sup θx 1 n C xl (θ). θ>0 n l=1 Remark. If each C xl is subgaussian with 0 mean, then C xl (θ) cθ 2. This leads to I(x) x 2 /4c 41 / 49

Hoeffding inequality Theorem (Hoeffding inequality) Let x 1,..., x n be independent random variables with mean 0 and x l [ B, B] almost surely for l = 1,..., n. Then P ( x n > x) exp( ni(x)). Here, I(x) := 1 2n x 2 x2 n = i=1 (2B)2 2B 2 42 / 49

Proof. 1. Let S n = n l=1 x l. Use Markov s inequality n n P (S n > t) e tθ E[exp(θS n)] = e tθ E[ exp(θx l )] = e tθ E[exp(θx l )] l=1 l=1 n ( B e tθ 2 ) ( n ( B 2 ) ) exp 2 θ2 = exp tθ + 2 θ2. l=1 l=1 2. Let write t = nx, taking convex conjugate: I(x) := sup (xθ 12 ) B2 θ 2 = x2 θ 2B 2. we get P (S n > nx) exp( n(xθ B2 2 θ2 )) exp( ni(x)). 43 / 49

Bernstein s inequality Theorem (Bernstein s inequality) Let x 1,..., x n be independent random variables with mean 0 and variance σ 2 l, and x l [ B, B] almost surely for l = 1,..., n. Then P ( P ( n l=1 n l=1 ( ) x l > t) exp t2 /2, σ 2 + Bt/3 ( ) x l > t) 2 exp t2 /2, σ 2 + Bt/3 where σ 2 = n l=1 σ2 l. 44 / 49

Remark. In Hoeffding s inequality, we do not use the variance information. The concentration estimation is ) P ( x n > ɛ) exp ( n ɛ2 2B 2. In Bernstein inequality, we use the variance information. Let σ 2 := 1 n n l=1 σ2 l. The Bernstein inequality reads P ( x n > ɛ) exp ( n ɛ 2 σ 2 + Bɛ 3 Comparing the denominators, Bernstein s inequality is sharper, provided σ < B. ) 45 / 49

Proof. 1. For a random variable x l which has mean 0, variance σl 2 and x B almost surely, its moment generating function E[exp(θx l )] satisfies E[exp(θx l )] = E Here, [ k=0 1 + θ2 σ 2 l 2 θ k x k l k! ] E [ 1 + 2(θB) k 2 k=2 k! 2(θB) k 2 F l (θ) = k! k=2 k=2 ] θ k x l 2 B k 2 k=2 k! = 1 + θ2 σ 2 l 2 F l(θ) exp(θ 2 σ 2 l F l(θ)/2). (θb) k 2 3 k 2 = 1 1 Bθ/3 := 1 1 Rθ where R := B/3. We require 0 θ < 1/R. 2. Let S n = n l=1 x l. Using Cramer theorem, n ( n ) P (S n > t) e tθ E[exp(θx l )] e θt exp θ 2 σl 2 F l(θ)/2 Here, σ 2 = n l=1 σ l. l=1 exp ( θt + σ2 θ 2 ) 2 1 Rθ l=1 46 / 49

Proof (Cont.) 3 Choose θ = t/(σ 2 + Rt), which satisfies θ < 1/R. We then get P (S n > t) = exp ( ) t 2 σ 2 1 t 2 ( 2(σ 2 + Rt) 2 1 Rt σ 2 exp + Rt σ 2 +Rt t 2 2(σ 2 + Rt) 4 Replacing x l by x l yields the same estimate. We then get the estimate for P ( S n > t). (0.1) ). 47 / 49

Bernstein inequality for subexponential r.v. We can extend Bernstein s inequality to random variables without bound, but decay exponentially fast. That is, those subexponential random variables. Corollary Let x 1,..., x n be independent mean 0 subexponential random variables, i.e. P ( x l t) βe κt for some constant β, κ > 0 for all t > 0. Then n ( ) P ( x l t) 2 exp (κt)2 /2. 2βn + κt l=1 48 / 49

Proof. 1. For subexponential random variable x, for k 2, E[ x k ] = k P ( x t)t k 1 dt βk e κt t k 1 dt 0 0 = βkκ k e u u k 1 du = βκ k k!. 0 2. Using this estimate and E[x l ] = 0, we get [ ] θ k x k l θ k βκ k k! E[exp(θx l )] = E 1 + k! k! k=0 k=2 = 1 + β θ2 κ 2 exp (β θ2 κ 2 ) 1 θκ 1 1 θκ 1. 3. Using Cramer s inequality, we have P (S n t) exp ( θt + nβ θ 2 ) κ 2 1 κ 1 θ Comparing this formula and (0.1) with R = 1/κ, σ 2 = 2nβκ 2, we get ( t 2 ) ( t 2 ) P (S n t) exp 2(σ 2 = exp + Rt) 2(2nβκ 2 + κ 1 t) ( ) = exp (κt)2 /2. 2nβ + κt 49 / 49