MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Similar documents
Lecture 4: September 12

Eigenvalues of Random Graphs

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Vapnik-Chervonenkis theory

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Notes on Frequency Estimation in Data Streams

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

APPENDIX A Some Linear Algebra

P exp(tx) = 1 + t 2k M 2k. k N

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Canonical transformations

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Finding Dense Subgraphs in G(n, 1/2)

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

More metrics on cartesian products

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Computing MLE Bias Empirically

Feature Selection: Part 1

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

1 Convex Optimization

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Lecture 4: Universal Hash Functions/Streaming Cont d

Conjugacy and the Exponential Family

Edge Isoperimetric Inequalities

Lecture 17 : Stochastic Processes II

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Lecture 10 Support Vector Machines II

Lecture 14: Bandits with Budget Constraints

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Expected Value and Variance

Min Cut, Fast Cut, Polynomial Identities

Lecture Space-Bounded Derandomization

Applied Stochastic Processes

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

= z 20 z n. (k 20) + 4 z k = 4

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

PHYS 705: Classical Mechanics. Calculus of Variations II

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Problem Set 9 Solutions

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Complete subgraphs in multipartite graphs

Generalized Linear Methods

Physics 5153 Classical Mechanics. Principle of Virtual Work-1

COS 521: Advanced Algorithms Game Theory and Linear Programming

10.34 Fall 2015 Metropolis Monte Carlo Algorithm

The Feynman path integral

A note on almost sure behavior of randomly weighted sums of φ-mixing random variables with φ-mixing weights

Lecture Notes on Linear Regression

Randomness and Computation

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

3.1 ML and Empirical Distribution

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Kernel Methods and SVMs Extension

Module 9. Lecture 6. Duality in Assignment Problems

Credit Card Pricing and Impact of Adverse Selection

Lecture 10: May 6, 2013

Hidden Markov Models

find (x): given element x, return the canonical element of the set containing x;

The lower and upper bounds on Perron root of nonnegative irreducible matrices

Difference Equations

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

NP-Completeness : Proofs

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Graph Reconstruction by Permutations

Spectral Graph Theory and its Applications September 16, Lecture 5

CSCE 790S Background Results

Equilibrium with Complete Markets. Instructor: Dmytro Hryshko

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Announcements EWA with ɛ-exploration (recap) Lecture 20: EXP3 Algorithm. EECS598: Prediction and Learning: It s Only a Game Fall 2013.

Week 5: Neural Networks

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Integrals and Invariants of Euler-Lagrange Equations

Lecture 12: Discrete Laplacian

Maximizing the number of nonnegative subsets

Statistical mechanics handout 4

EPR Paradox and the Physical Meaning of an Experiment in Quantum Mechanics. Vesselin C. Noninski

U-statistics on network-structured data with kernels of degree larger than one

Spectral graph theory: Applications of Courant-Fischer

18.1 Introduction and Recap

2.3 Nilpotent endomorphisms

Lecture 3 January 31, 2017

Learning Theory: Lecture Notes

Homework Assignment 3 Due in class, Thursday October 15

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

5.62 Physical Chemistry II Spring 2008

Convergence of random processes

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 3: Probability Distributions

REAL ANALYSIS I HOMEWORK 1

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lecture 3: Shannon s Theorem

Chapter 8 Indicator Variables

Transcription:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements. Concentraton for Lpschtz contnuous functons 3. Examples n statstcs and random graph theory 1 Azuma-Hoeffdng nequalty Suppose X n s a martngale wrt fltraton F n such that X 0 = 0 The goal of ths lecture s to obtan bounds of the form P X n δn) exp Θn)) under some condton on X n. Note that snce E[X n ] = 0, the devaton from zero s the rght regme to look for rare events. It turns out the exponental bound of the form above holds under very smple assumpton that the ncrements of X n are bounded. The theorem below s known as Azuma-Hoeffdng Inequalty. Theorem 1 Azuma-Hoeffdng Inequalty). Suppose X n, n 1 s a martngale such that X 0 = 0 and X X 1 d, 1 n almost surely for some constants d, 1 n. Then, for every t > 0, t n =1 d P X n > t) exp. Notce that n the specal p case when d = d, we can take t = xn and obtan an upper bound exp x n/d ) - whch s of the form promsed above. Note that ths s consstent wth the Chernoff bound for the specal case X n s the sum of..d. zero mean terms, though t s applcable only n the specal case of a.s. bounded ncrements. Proof. fx) expλx) s a convex functon n x for any λ R. Then we have f d ) = exp λd ) and fd ) = expλd ). Usng convexty we have that 1

when x/d 1 1 x 1 x expλx) = fx) = f + 1)d + 1 ) d ) d d 1 x 1 x + 1 fd ) + 1 f d ) d d fd ) + f d ) fd ) f d ) = + x. 1) Further, for every a k k k expa) + exp a) a 1) k a a = + = k! k! k)! k=0 k=0 k=0 a k because k k! k)!) k=0 k k! a ) k a = = exp ). ) k! k=0 We conclude that for every x such that x/d 1 d expλd ) exp λd ) expλx) exp ) + x. 3) We now turn to our martngale sequence X n. For every t > 0 and every λ > 0 we have PX n t) = P expλx n ) expλt)) exp λt)e[expλx n )] = exp λt)e[expλ X X 1 ))], 1 n where X 0 = 0 was used n the last equalty. Applyng the tower property of condtonal expectaton we have E expλ X X 1 )) 1 n = E E expλx n X n 1 )) expλ X X 1 )) F n 1. 1 n 1

Now, snce X, n 1 are measurable wrt F n 1, then E expλx n X n 1 )) expλ X X 1 )) F n 1 1 n 1 = expλ X X 1 ))E [expλx n X n 1 )) F n 1 ] 1 n 1 expλ X X 1 )) 1 n 1 λ d ) ) n expλd ) exp λd ) exp + E[X n X n 1 F n 1 ], where 3) was used n the last nequalty. Martngale property mples E[X n X n 1 F n 1 ] = 0, and we have obtaned an upper bound λ d ) n E expλ X X 1 )) E expλ X X 1 )) exp 1 n 1 n 1 Iteratng further we obtan the followng upper bound on PX n t): 1 n λ d exp λt) exp Optmzng over the choce of λ, we see that the tghtest bound s obtaned by settng λ = t/ d > 0, leadng to an upper bound t ) PX n t) exp. d A smlar approach usng λ < 0 gves for every t > 0 t ) PX n t) exp. Combnng, we obtan the requred result. d Applcaton to Lpschtz contnuous functons of..d. random varables Suppose X 1,..., X n are ndependent random varables. Suppose g : R n R s a functon and d 1,..., d n are constants such that for any two vectors x 1,..., x n 3

and y 1,..., y n n gx 1,..., x n ) gy 1,..., y n ) d 1{x = y }. 4) In partcular when a vector x changes value only n ts -th coordnate the amount of change n functon g s at most d. As a specal case, consder a subset of vectors x = x 1,..., x n ) such that x c and suppose g s Lpschtz contnuous wth constant K. Namely, for for every x, y, gx) gy) K x y, where x y = x y. Then for any two such vectors Theorem. Suppose X, 1 n are..d. and functon g : R n R satsfes 4). Then for every t 0 t ) P gx 1,..., X n ) E[gX 1,..., X n )] > t) exp. =1 gx) gy) K x y K c x y, and therefore ths fts nto a prevous framework wth d = Kc. d Proof. Let F be the σ-feld generated by varables X 1,..., X : F = σx 1,..., X ). For convenence, we also set F 0 to be the trval σ-feld consstng of Ø, Ω, so that E[Z F 0 ] = E[Z] for every r.v. Z. Let M 0 = E[gX 1,..., X n )], M 1 = E[gX 1,..., X n ) F 1 ],...,M n = E[gX 1,..., X n ) F n ]. Observe that M n s smply gx 1,..., X n ), snce X 1,..., X n are measurable wrt F n. Thus, we by tower property E[M n F n 1 ] = E[E[gX 1,..., X n ) F n ] F n 1 ] = M n 1. Thus, M s a martngale. We have M +1 M = E[E[gX 1,..., X n ) F +1 ] E[gX 1,..., X n ) F ]] = E[E[gX 1,..., X n ) E[gX 1,..., X n ) F ] F +1 ]]. Snce X s are ndependent, then M s a r.v. whch on any vector x = x 1,..., x n ) Ω takes value M = gx 1,..., x n )dpx +1 ) dpx n ), x +1,...,x n 4

and n partcular only depends on the frst coordnates of x). Smlarly M +1 = gx 1,..., x n )dpx + ) dpx n ). x +,...,x n Thus M +1 M = gx 1,..., x n ) gx 1,..., x n )dpx +1 ))dpx +1 ) dpx n ), x +,...,x n x +1 d +1 dpx +1 ) dpx n ) x +,...,x n = d +1. Ths dervaton represents a smple dea that M and M +1 only dffer n averagng out X +1 n M. Now defnng Mˆ = M M 0 = M E[gX 1,..., X n )], we have that Mˆ s also a martngale wth dfferences bounded by d, but wth an addtonal property M 0 = 0. Applyng Theorem 1 we obtan the requred result. 3 Two examples We now consder two applcatons of the concentraton nequaltes developed n the prevous sectons. Our frst example concerns convergence emprcal dstrbutons to the true dstrbutons of random varables. Specfcally, suppose we have a dstrbuton functon F, and..d. sequence X 1,..., X n wth dstrbuton F. From the sample X 1,..., X n we can buld an emprcal dstrbuton functon F 1 n x) = n 1 n 1{X x}. Namely, F n x) s smply the frequency of observng values at most x n our sample. We should realze that F n s a random functon, snce t depends on the sample X 1,..., X n. An mportant Theorem called Glvenko-Cantell says that sup x R F n x) F x) converges to zero and n expectaton, the latter meanng of course that E[sup x R F n x) F x) ] 0. Provng ths result s beyond our scope. However, applyng the martngale concentraton nequalty we can bound the devaton of sup x R F n x) F x) around ts expectaton. For convenence let L n = L n X 1,..., X n ) = sup x R F n x) F x), whch s commonly called emprcal rsk n the statstcs and machne learnng felds. We need to bound P L n E[L n ] > t). Observe that L satsfes property 4) wth d = 1/n. Indeed changng one coordnate X to some X I changes F n by at most 1/n, and thus the same apples to 5

L n. Applyng Theorem we obtan t ) P L n E[L n ] > t) exp n1/n) t ) n = exp. Thus, we obtan a large devatons type bound on the dfference L n E[L n ]. For our second example we turn to combnatoral optmzaton on random graphs. We wll use the so-called Max-Cut problem as an example, though the approach works for many other optmzaton and constrant satsfacton problems as well. Consder a smple undrected graph G = V, E). V s the set of nodes, denoted by 1,,..., n. And E s the set of edges whch we descrbe as a lst of pars 1, j 1 ),..., E, j E ), where 1,..., E, j 1,..., j E are nodes. The graph s undrected, whch means that the edges 1, j 1 ) and j 1, 1 ) are dentcal. We can also represent the graph as an n n zero-one matrx A, where A,j = 1 f, j) E) and A,j = 0 otherwse. Then A s a symmetrc matrx, namely A T = A, where A T s a transpose of A. A cut n ths graph s a partton σ of nodes nto two groups, encoded by functon a functon σ : V {0, 1}. The value MCσ) of the cut assocated wth σ s the number of edges between the two groups. Formally, MCσ) = {, j) E : σ) = σj)}. Clearly MCσ) E. At the same tme, a random assgnment σ) = 0 wth probablty 1/ and = 1 wth probablty 1/ gves a cut wth expected value MCσ) 1/) E. In fact there s a smple algorthm to construct such a cut explctly. Now denote by MCG) the maxmum possble value of the cut: MCG) = max σ MCσ). Thus 1/ MCG)/ E 1. Further, suppose we delete an arbtrary edge from the graph G and obtan a new graph G I. Observe that n ths case MCG I ) MGG) 1 - the Max-Cut value ether stays the same or goes down by at most one. Smlarly, when we add an edge, the Max-Cut value ncreases by at most one. Puttng ths together, f we replace an arbtrary edge e E by a dfferent edge e I and leave all the other edges ntact, the value of the Max-Cut changes by at most one. Now suppose the graph G = Gn, dn) s a random Erd os-r eny graph wth E = dn edges. Specfcally, suppose we choose every edges E 1,..., E dn unformly at random from the total set of edges, ndependently for these nd choces. Denote by MC n the value of the maxmum cut MCGn, dn)) on ths random graph. Snce the graph s random, we have that MC n s a random varable. Furthermore, as we have just establshed, d/ MC n /n d. One of the major open problems n the theory of random graphs s computng the scalng lmt E[MC n ]/n as n. However, we can easly obtan bounds pn 6

on the concentraton of MC n around ts expectaton, usng Azuma-Hoeffdng nequalty. For ths goal, thnk of p random edges E 1,..., E dn as..d. random varables n the space 1,,..., n correspondng to the space of all possble edges on n nodes. Let ge 1,..., E dn ) = MC n. Observe that ndeed g s a functon of dn..d. random varables. By our observaton, replacng one edge E by a dfferent edge E I changes MC n by at most one. Thus we can apply Theorem whch gves t ) P MC n E[MC n ] t) exp. dn In partcular, takng t = rn, where r > 0 s a constant, we obtan a large n devatons type bound exp r d ). Takng nstead t = r n, we obtan Gaus san type bound exp r d ). Namely, MC n = E[MC n ] + Θ n). Ths s a meanngful concentraton around the mean snce, as we have dscussed above E[MC n ] = Θn). 7

MIT OpenCourseWare http://ocw.mt.edu 15.070J / 6.65J Advanced Stochastc Processes Fall 013 For nformaton about ctng these materals or our Terms of Use, vst: http://ocw.mt.edu/terms.