P exp(tx) = 1 + t 2k M 2k. k N

Similar documents
MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Dimensionality Reduction Notes 1

Lecture 10 Support Vector Machines II

Lecture 4: September 12

The Expectation-Maximization Algorithm

Differentiating Gaussian Processes

Computing MLE Bias Empirically

Vapnik-Chervonenkis theory

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

REAL ANALYSIS I HOMEWORK 1

Errors for Linear Systems

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

More metrics on cartesian products

Strong Markov property: Same assertion holds for stopping times τ.

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

Generalized Linear Methods

MAT 578 Functional Analysis

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

A note on almost sure behavior of randomly weighted sums of φ-mixing random variables with φ-mixing weights

Lecture 4. Instructor: Haipeng Luo

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Another converse of Jensen s inequality

NUMERICAL DIFFERENTIATION

Feature Selection: Part 1

s: 1 (corresponding author); 2

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Eigenvalues of Random Graphs

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Lecture 3. Ax x i a i. i i

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

1 Definition of Rademacher Complexity

An (almost) unbiased estimator for the S-Gini index

Maximizing the number of nonnegative subsets

Lecture 3 January 31, 2017

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

} Often, when learning, we deal with uncertainty:

Solutions Homework 4 March 5, 2018

MATH 281A: Homework #6

Excess Error, Approximation Error, and Estimation Error

Expected Value and Variance

Communication Complexity 16:198: February Lecture 4. x ij y ij

System in Weibull Distribution

COS 511: Theoretical Machine Learning

Analysis of Discrete Time Queues (Section 4.6)

Computational and Statistical Learning theory Assignment 4

Math 702 Midterm Exam Solutions

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Introduction to Algorithms

Estimation: Part 2. Chapter GREG estimation

Complete subgraphs in multipartite graphs

CSCE 790S Background Results

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Limited Dependent Variables

Random Partitions of Samples

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

= z 20 z n. (k 20) + 4 z k = 4

An introduction to chaining, and applications to sublinear algorithms

k t+1 + c t A t k t, t=0

One-sided finite-difference approximations suitable for use with Richardson extrapolation

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

Expectation propagation

Notes on Frequency Estimation in Data Streams

APPENDIX A Some Linear Algebra

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Supplement to Clustering with Statistical Error Control

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

Perfect Competition and the Nash Bargaining Solution

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Bayesian epistemology II: Arguments for Probabilism

Numerical Heat and Mass Transfer

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

Geometry of Müntz Spaces

Lecture 13 APPROXIMATION OF SECOMD ORDER DERIVATIVES

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Finding Dense Subgraphs in G(n, 1/2)

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Ensemble Methods: Boosting

First day August 1, Problems and Solutions

The internal structure of natural numbers and one method for the definition of large prime numbers

Math 217 Fall 2013 Homework 2 Solutions

The Order Relation and Trace Inequalities for. Hermitian Operators

Computing Correlated Equilibria in Multi-Player Games

The lower and upper bounds on Perron root of nonnegative irreducible matrices

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

6.854J / J Advanced Algorithms Fall 2008

Exercise Solutions to Real Analysis

The Degrees of Nilpotency of Nilpotent Derivations on the Ring of Matrices

Randić Energy and Randić Estrada Index of a Graph

Assortment Optimization under MNL

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

Foundations of Arithmetic

Machine learning: Density estimation

HANSON-WRIGHT INEQUALITY AND SUB-GAUSSIAN CONCENTRATION

Linear Feature Engineering 11

Bezier curves. Michael S. Floater. August 25, These notes provide an introduction to Bezier curves. i=0

Transcription:

1. Subgaussan tals <1> Defnton. Say that a random varable X has a subgaussan dstrbuton wth scale factor σ< f P exp(tx) exp(σ 2 t 2 /2) for all real t. For example, f X s dstrbuted N(,σ 2 ) then t s subgaussan. <2> Example. Suppose X s a bounded random varable wth a symmmetrc dstrbuton. That s, X M for some constant M and X has the same dstrbuton as X. Then P exp(tx) = 1 + t k PX k k! k N By symmetry, PX k = for each odd k. For even k, bound PX k by M k, leavng P exp(tx) = 1 + t 2k M 2k exp(m 2 t 2 /2) (2k)! k N because (2k)! 2 k k! for each k n N. The argument for boundng the maxmum of normal random varables carres over to subgaussans. <3> Theorem. Suppose X 1,...,X n are subgaussan wth scale factors bounded by a constant σ. Then P max n X 3 2 σ 1 + log(2n). Proof. For each t >, exp(tp max X ) P max exp(t X ) ( Pe tx + Pe tx) 2n exp( 1 n n 2 σ 2 t 2 ) n Choose t = log(2n)/σ. In fact, we could mprove the nequalty to gve smlar bounds for varous L p norms of max n X by choosng slghtly dfferent convex functons nstead of x exp(tx). I won t derve these bounds explctly because there s an even better nequalty obtanable from another characterzaton of subgaussanty. <4> Theorem. Suppose PX =. Then X s subgaussan f and only f there exsts a fnte constant C for whch P exp(x 2 /C 2 )<. Proof. If P exp(tx) exp(σ 2 t 2 /2) for all real t then P exp(x 2 /4σ 2 ) 1 = P {X 2 /4σ 2 t }e t dt ( X t P exp σ ) t dt ( P exp(x t/σ ) + exp( X t/σ ) 2e t/2 dt <. ) e t dt Conversely, f P exp(x 2 /C 2 ) = D < then, from the nequalty ab (a 2 + b 2 )/2, we get ( X 2 P exp(tx) P exp C + C2 t 2 ) = D exp(c 2 t 2 /4). 2 4 13 January 25 Asymptopa, verson: 13jan5 c Davd Pollard 1

Ths bound s not qute what we need for subgaussanty. If we bound t away from zero we can elmnate the D: f D exp(mc 2 2 ) for some constant M then P exp(tx) exp((m + 1)C 2 t 2 ) for t. If s small enough, the Taylor expanson gves, for small enough, P exp(tx) = 1 + tpx + 1 2 t 2 PX 2 + o(t 2 ) exp ( 1 2 t 2 (1 + PX 2 ) ) when t. The subgaussanty bound follows. Subgaussan random varables can also be characterzed by an exponental tal bound. Take t = x/σ 2 n the nequalty P{X x} exp( tx)p exp(tx) exp( tx + σ 2 t 2 /2) to deduce that P{X x} exp( x 2 /2σ 2 ) for x. Replace X by X, whch s also subgaussan, then add, to derve the analogous two-sded bound. Conversely, f P{ X x} C exp( x 2 /2σ 2 ) then P exp(x 2 /9σ 2 ) 1 = P = whch, va Theorem <4>, gves subgaussanty. {X 2 9σ 2 t }e t dt P{ X 3σ 2 t}e t dt C exp( 9t/2 + t) dt < 2. Orlcz norms The convexty argument used to prove Theorem <3> also works for hgher moments. ( ) p P max X P max X p P X p N max P X p. N N N N Thus <5> P max X N max X N N 1/p max X p for p 1. p N More generally, f s a nonnegatve, convex, strctly ncreasng functon on R +, then, for each σ>, ( ) ( ) X P max P max N σ X N σ ( ) X P σ N ( ) N max P X. N σ If σ s such that P( X /σ ) 1 for each then we have P max X σ 1 (N). N 2 13 January 25 Asymptopa, verson: 13jan5 c Davd Pollard

<6> Defnton. An Orlcz functon s a convex, ncreasng functon on R + Most authors actually requre wth () <1. Defne the Orlcz norm X (semnorm actually, unless () = one dentfes random varables that are almost everywhere equal) by X = nf{c > :P( X /c) 1}, wth the understandng that X = f the nfmum runs over an empty set. It s not hard to show (Pollard 21, Problems 2.22 through 2.24) that X < f and only f P( X /C) < for at least one fnte constant C. The nfmum defnng X s acheved when the norm s fnte. <7> Example. Let (x) = exp(x 2 ) 1. Then X < f and only f X PX s subgaussan. Notce that a bound on an Orlcz norm, X σ, automatcally gves a tal bound, P{ X x} P( X /σ )/(x/σ ) 1/(x/σ ) for x. For example, f (x) = 1 2 exp(x 2 ) then we get a subgaussan tal bound. Sometmes t s possble to fnd such that P( X /) K, for a constant K > 1. It then follows from convexty of that <8> X /θ where θ = 1 () K (), because P (θ X /) θp ( X /) + (1 θ)() θ K + (1 θ)() = 1. <9> Example. (Compare wth page 96 of van der Vaart & Wellner (1996).) Let be an Orlcz functon (such as exp(x 2 ) 1, as n Problem [1]) for whch there exsts a fnte constant C such that (α)(β) (C αβ) for (α) (β) 1. Then <1> max X N C 1 (N) max X where C := 2 () N 1 () C To prove the asserton, defne D = C 1 (N) and = max N X. Notce that (D/C ) = N 1. When (max X /D) 1, ( ) ( ) ( ) max X D max X ( ) X. D C That s, ( ) ( max X mn 1, N ( ) ) 1 X D Take expectatons. ( ) max X P 1 + N 1 D ( ) X P 2. Invoke nequalty <8>. Fnally, notce that f X = σ for (x) = exp(x 2 ) 1 then P X 2p p!p exp(x 2 /σ 2 ) 2p!. σ 2p A bound on the Orlcz norm, for ths partcular, gves a bound on moments of all orders. 13 January 25 Asymptopa, verson: 13jan5 c Davd Pollard 3

<11> Example. For each event A wth >, wrte P A for the condtonal expectaton gven A. Suppose X <. From Jensen s nequalty and the defnton of the Orlcz norm we get (P A X /) P A ( X /) = P( X /)A from whch t follows that <12> P A X X 1 (1/). 1, Wth cunnng choces of A, ths nequalty wll delver a useful maxmal nequalty for fnte collectons of random varables, namely, <13> P A max X 1 (N/) f max N X. N Indeed, f A 1,...,A N denotes a partton of A nto subsets, such that X s the largest of the X j on the set A,then P A max X = P A X A = N P A X. Inequalty <12> and concavty of the functon 1 bound the last sum by ( ) ( ) 1 ( ) 1 1 1 N = 1. The bound <13> wll turn out to be much more powerful than one mght at frst glance suspect. If we choose A ={max N X ɛ} then we get lower bound for 1/. The full power of ths trck wll appear n the Chapter on channg. 3. Problems [1] Show that (exp(x 2 ) 1)(exp(y 2 ) 1) exp(2x 2 y 2 ) 1forx y 1. [2] Suppose X has a symmetrc dstrbuton. Show that t s subgaussan f and only f there exsts some constant c for whch X k c k for each k n N. Hnts: Note that X k s an ncreasng functon of k. Fork even, try to show that X k k P exp(tx) nf k! t t k [3] Let X and Y be dentcally dstrbuted random varables wth PX = PY =. () Let H be a convex functon. [Any other regularty condtons?] Show that PH(X) = PH(X PY ) PH(X Y ). () Show that X X Y 2 X for each Orlcz functon. () Generalze the result from Problem [2]: Show that the moment characterzaton of subgaussanty stll holds f replace the symmetry assumpton on X by the assumpton that PX =. 4. Notes Acknowledge Ledoux & Talagrand (1991) for several of the deas used n ths Chapter, ncludng Example <11> Cte Aad van der Vaart (personal communcaton, or van der Vaart & Wellner 1996) for mprovement on the method used n Pollard (199, Secton 3). 4 13 January 25 Asymptopa, verson: 13jan5 c Davd Pollard

Who frst got the characterzaton n Problems [2] and [3]? I got t from a sharper result n Lugos (23, Secton 2), but t must be older. Gve some hstory of earler work: Dudley, Pser? References Ledoux, M. & Talagrand, M. (1991), Probablty n Banach Spaces: Isopermetry and Processes, Sprnger, New York. Lugos, G. (23), Concentraton-of-measure nequaltes, Notes from the Summer School on Machne Learnng, Australan Natonal Unversty. Avalable at http://www.econ.upf.es/ lugos/. Pollard, D. (199), Emprcal Processes: Theory and Applcatons, Vol.2 of NSF-CBMS Regonal Conference Seres n Probablty and Statstcs, Insttute of Mathematcal Statstcs, Hayward, CA. Pollard, D. (21), A User s Gude to Measure Theoretc Probablty, Cambrdge Unversty Press. van der Vaart, A. W. & Wellner, J. A. (1996), Weak Convergence and Emprcal Process: Wth Applcatons to Statstcs, Sprnger-Verlag. 13 January 25 Asymptopa, verson: 13jan5 c Davd Pollard 5