princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

Similar documents
11 Tail Inequalities Markov s Inequality. Lecture 11: Tail Inequalities [Fa 13]

E Tail Inequalities. E.1 Markov s Inequality. Non-Lecture E: Tail Inequalities

Lecture 4: Universal Hash Functions/Streaming Cont d

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Notes on Frequency Estimation in Data Streams

18.1 Introduction and Recap

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Lecture 3: Probability Distributions

Expected Value and Variance

Randomness and Computation

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Lecture Randomized Load Balancing strategies and their analysis. Probability concepts include, counting, the union bound, and Chernoff bounds.

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Lecture 4 Hypothesis Testing

Lecture 3 January 31, 2017

CSCE 790S Background Results

Vapnik-Chervonenkis theory

Lecture 4: November 17, Part 1 Single Buffer Management

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

TAIL BOUNDS FOR SUMS OF GEOMETRIC AND EXPONENTIAL VARIABLES

Introduction to Algorithms

Stat 543 Exam 2 Spring 2016

Stat 543 Exam 2 Spring 2016

Eigenvalues of Random Graphs

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 2: Prelude to the big shrink

Part (a) (Number of collisions) Recall we showed that if we throw m balls in n bins, the average number of. Use Chebyshev s inequality to show that:

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

Exercises of Chapter 2

Lecture 4: September 12

Edge Isoperimetric Inequalities

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

P exp(tx) = 1 + t 2k M 2k. k N

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Convergence of random processes

Chapter 3 Describing Data Using Numerical Measures

U-Pb Geochronology Practical: Background

Multiple Choice. Choose the one that best completes the statement or answers the question.

Probability and Random Variable Primer

Problem Set 9 Solutions

Basically, if you have a dummy dependent variable you will be estimating a probability.

Estimation: Part 2. Chapter GREG estimation

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

6.842 Randomness and Computation February 18, Lecture 4

Definition. Measures of Dispersion. Measures of Dispersion. Definition. The Range. Measures of Dispersion 3/24/2014

A random variable is a function which associates a real number to each element of the sample space

Engineering Risk Benefit Analysis

Lecture 10: May 6, 2013

NP-Completeness : Proofs

Limited Dependent Variables

Section 8.3 Polar Form of Complex Numbers

Mining Data Streams-Estimating Frequency Moment

Lecture 3. Ax x i a i. i i

x = , so that calculated

1 The Mistake Bound Model

Chapter 2 - The Simple Linear Regression Model S =0. e i is a random error. S β2 β. This is a minimization problem. Solution is a calculus exercise.

e i is a random error

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Chapter 4: Regression With One Regressor

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MATH 5707 HOMEWORK 4 SOLUTIONS 2. 2 i 2p i E(X i ) + E(Xi 2 ) ä i=1. i=1

THE SUMMATION NOTATION Ʃ

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 6 Luca Trevisan September 12, 2017

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Complete subgraphs in multipartite graphs

Lecture 5 September 17, 2015

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

STAT 511 FINAL EXAM NAME Spring 2001

Lecture 14 (03/27/18). Channels. Decoding. Preview of the Capacity Theorem.

Calculation of time complexity (3%)

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Lecture 4. Instructor: Haipeng Luo

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Kernel Methods and SVMs Extension

COS 511: Theoretical Machine Learning

Introduction to Random Variables

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

Chapter 1. Probability

Some basic inequalities. Definition. Let V be a vector space over the complex numbers. An inner product is given by a function, V V C

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

A note on almost sure behavior of randomly weighted sums of φ-mixing random variables with φ-mixing weights

Econ Statistical Properties of the OLS estimator. Sanjaya DeSilva

SELECTED PROOFS. DeMorgan s formulas: The first one is clear from Venn diagram, or the following truth table:

Basic Statistical Analysis and Yield Calculations

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Economics 130. Lecture 4 Simple Linear Regression Continued

COS 521: Advanced Algorithms Game Theory and Linear Programming

Lecture Notes on Linear Regression

arxiv: v1 [math.co] 1 Mar 2014

Lecture 4: Constant Time SVD Approximation

Lecture 21: Numerical methods for pricing American type derivatives

Generalized Linear Methods

Finding Dense Subgraphs in G(n, 1/2)

Finding Primitive Roots Pseudo-Deterministically

Transcription:

prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable devates from ts mean by a lot? Recall that a random varable X s a mappng from a probablty space to R. The expectaton or mean s denoted E[X] or sometmes as µ. In many settngs we have a set of n random varables X 1, X 2, X 3,..., X n defned on the same probablty space. To gve an example, the probablty space could be that of all possble outcomes of n tosses of a far con, and X s the random varable that s 1 f the th toss s a head, and s 0 otherwse, whch means E[X ] = 1/2. The frst observaton we make s that of the Lnearty of Expectaton, vz. E[ X ] = E[X ] It s mportant to realze that lnearty holds regardless of the whether or not the random varables are ndependent. Can we say somethng about E[X 1 X 2 ]? In general, nothng much but f X 1, X 2 are ndependent events (formally, ths means that for all a, b Pr[X 1 = a, X 2 = b] = Pr[X 1 = a] Pr[X 2 = b]) then E[X 1 X 2 ] = E[X 1 ] E[X 2 ]. Note that f the X s are parwse ndependent (.e., each par are mutually ndependent) then ths means that var[ X ] = var[x ]. 1 Three progressvely stronger tal bounds Now we gve three methods that gve progressvely stronger bounds. 1.1 Markov s Inequalty (aka averagng) The frst of a number of nequaltes presented today, Markov s nequalty says that any non-negatve random varable X satsfes Pr (X k E[X]) 1 k. Note that ths s just another way to wrte the trval observaton that E[X] k Pr[X k]. Can we gve any meanngful upperbound on Pr[X < c E[X]] where c < 1, n other words the probablty that X s a lot less than ts expectaton? In general we cannot. However, f we know an upperbound on X then we can. For example, f X [0, 1] and E[X] = µ then for any c < 1 we have (smple exercse) Pr[X cµ] 1 µ 1 cµ. Sometmes ths s also called an averagng argument. 1

2 Example 1 Suppose you took a lot of exams, each scored from 1 to 100. If your average score was 90 then n at least half the exams you scored at least 80. 1.2 Chebyshev s Inequalty The varance of a random varable X s one measure (there are others too) of how spread out t s around ts mean. It s defned as E[(x µ) 2 ] = E[X 2 ] µ 2. A more powerful nequalty, Chebyshev s nequalty, says Pr[ X µ kσ] 1 k 2, where µ and σ 2 are the mean and varance of X. Recall that σ 2 = E[(X µ) 2 ] = E[X 2 ] µ 2. Actually, Chebyshev s nequalty s just a specal case of Markov s nequalty: by defnton, and so, E [ X µ 2] = σ 2, Pr [ X µ 2 k 2 σ 2] 1 k 2. Here s smple fact that s used a lot: If Y 1, Y 2,..., Y t are d (whch s jargon for ndependent and dentcally dstrbuted) then the varance of ther average 1 k Y t s exactly 1/t tmes the varance of one of them. Usng Chebyshev s nequalty, ths already mples that the average of d varables converges sort-of strongly to the mean. 1.2.1 Example: Load balancng Suppose we toss m balls nto n bns. You can thnk of m jobs beng randomly assgned to n processors. Let X = number of balls assgned to the frst bn. Then E[X] = m/n. What s the chance that X > 2m/n? Markov s nequalty says ths s less than 1/2. To use Chebyshev we need to compute the varance of X. For ths let Y be the ndcator random varable that s 1 ff the th ball falls n the frst bn. Then X = Y. Hence E[X 2 ] = E[ Y 2 + 2 <j Y Y j ] = E[Y 2 ] + <j E[Y Y j ]. Now for ndependent random varables E[Y Y j ] = E[Y ] E[Y j ] so E[X 2 ] = m n + m(m 1). n 2 Hence the varance s very close to m/n, and thus Chebyshev mples that the probablty that Pr[X > 2 m n ] < n m. When m > 3n, say, ths s stronger than Markov. 1.3 Large devaton bounds When we toss a con many tmes, the expected number of heads s half the number of tosses. How tghtly s ths dstrbuton concentrated? Should we be very surprsed f after 1000 tosses we have 625 heads? The Central Lmt Theorem says that the sum of n ndependent random varables (wth bounded mean and varance) converges to the famous Gaussan dstrbuton (popularly

3 known as the Bell Curve). Ths s very useful n algorthm desgn: we maneuver to desgn algorthms so that the analyss bols down to estmatng the sum of ndependent (or somewhat ndependent) random varables. To do a back of the envelope calculaton, f all n con tosses are far (Heads has probablty 1/2) then the Gaussan approxmaton mples that the probablty of seeng N heads where N n/2 > a n s at most e a2 /2. The chance of seeng at least 625 heads n 1000 tosses of an unbased con s less than 5.3 10 7. These are pretty strong bounds! Of course, for fnte n the sum of n random varables need not be an exact Gaussan and that s where Chernoff bounds come n. (By the way these bounds are also known by other names n dfferent felds snce they have been ndependently dscovered.) Frst we gve an nequalty that works for general varables that are real-valued n [ 1, 1]. (To apply t to more general bounded varables just scale them to [ 1, 1] frst.) Theorem 1 (Quanttatve verson of CLT due to H. Chernoff) If X 1, X 2,..., X n are ndependent random varables and each X [ 1, 1]. Let µ = E[X ] and σ 2 = var[x ]. Then X = X satsfes where µ = µ and σ 2 = σ2. Pr[ X µ > kσ] 2 exp( k2 4n ), Instead of provng the above we prove a smpler theorem for bnary valued varables whch showcases the basc dea. Theorem 2 Let X 1, X 2,..., X n be ndependent 0/1-valued random varables and let p = E[X ], where 0 < p < 1. Then the sum X = n =1 X, whch has mean µ = n =1 p, satsfes where c δ s shorthand for [ e δ (1+δ) (1+δ) ]. Pr[X (1 + δ)µ] (c δ ) µ Remark: There s an analogous nequalty that bounds the probablty of devaton below the mean, whereby δ becomes negatve and the n the probablty becomes and the c δ s very smlar. Proof: Surprsngly, ths nequalty also s proved usng the Markov nequalty. We ntroduce a postve dummy varable t and observe that E[exp(tX)] = E[exp(t X )] = E[ exp(tx )] = E[exp(tX )], (1) where the last equalty holds because the X r.v.s are ndependent. Now, E[exp(tX )] = (1 p ) + p e t, therefore, E[exp(tX )] = [1 + p (e t 1)] exp(p (e t 1)) = exp( p (e t 1)) = exp(µ(e t 1)), (2)

4 as 1 + x e x. Fnally, apply Markov s nequalty to the random varable exp(tx), vz. Pr[X (1 + δ)µ] = Pr[exp(tX) exp(t(1 + δ)µ)] E[exp(tX)] exp(t(1 + δ)µ) = exp((et 1)µ) exp(t(1 + δ)µ), usng lnes (1) and (2) and the fact that t s postve. Snce t s a dummy varable, we can choose any postve value we lke for t. The rght hand sze s mnmzed f t = ln(1+δ) just dfferentate and ths leads to the theorem statement. 2 Applcaton 1: Samplng/Pollng Opnon polls and statstcal samplng rely on tal bounds. Suppose there are n arbtrary numbers n [0, 1] If we pck t of them randomly (wth replacement!) then the sample mean s wthn (1±ɛ]) of the true mean wth probablty at least 1 δ f t > Ω( 1 ɛ 2 log 1/δ). (Verfy ths calculaton!) In general, Chernoff bounds mples that takng k ndependent estmates and takng ther mean ensures that the value s hghly concentrated about ther mean; large devatons happen wth exponentally small probablty. 3 Balls and Bns revsted: Load balancng Suppose we toss m balls nto n bns. You can thnk of m jobs beng randomly assgned to n processors. Then the expected number of balls n each bn s m/n. When m = n ths expectaton s 1 but we saw n Lecture 1 that the most overloaded bn has Ω(log n/ log log n) balls. However, f m = cn log n then the expected number of balls n each bn s c log n. Thus Chernoff bounds mply that the chance of seeng less than 0.5c log n or more than 1.5c log n s less than γ c log n for some constant γ (whch depends on the 0.5, 1.5 etc.) whch can be made less than say 1/n 2 by choosng c to be a large constant. Moral: f an offce boss s tryng to allocate work farly, he/she should frst create more work and then do a random assgnment. 4 What about the medan? Gven n numbers n [0, 1] can we approxmate the medan va samplng? Ths wll be part of your homework. Exercse: Show that t s mpossble to estmate the value of the medan wthn say 1.1 factor wth o(n) samples. But what s possble s to produce a number that s an approxmate medan: t s greater than at least n/2 n/t numbers below t and less than at least n/2 n/t numbers. The dea s to take a random sample of a certan sze and take the medan of that sample. (Hnt: Use balls and bns.) One can use the approxmate medan algorthm to descrbe a verson of qucksort wth very predctable performance. Say we are gven n numbers n an array. Recall that (random) qucksort s the sortng algorthm where you randomly pck one of the n numbers as a pvot,

then partton the numbers nto those that are bgger than and smaller than the pvot (whch takes O(n) tme). Then you recursvely sort the two subsets. Ths procedure works n expected O(n log n) tme as you may have learnt n an undergrad course. But ts performance s uneven because the pvot may not dvde the nstance nto two exactly equal peces. For nstance the chance that the runnng tme exceeds 10n log n tme s qute hgh. A better way to run qucksort s to frst do a quck estmaton of the medan and then do a pvot. Ths algorthm runs n very close to n log n tme, whch s optmal. 5