Basics of Probability Theory (for Theory of Computation courses)

Similar documents
An Introduction to Randomized Algorithms

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Problem Set 2 Solutions

Advanced Stochastic Processes.

The standard deviation of the mean

Distribution of Random Samples & Limit theorems

EE 4TM4: Digital Communications II Probability Theory

Random Variables, Sampling and Estimation

On Random Line Segments in the Unit Square

This section is optional.

Lecture 2: Concentration Bounds

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Statisticians use the word population to refer the total number of (potential) observations under consideration

Convergence of random variables. (telegram style notes) P.J.C. Spreij

7.1 Convergence of sequences of random variables

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

1 Convergence in Probability and the Weak Law of Large Numbers


Lecture 2. The Lovász Local Lemma

Approximations and more PMFs and PDFs

Learning Theory: Lecture Notes

Fall 2013 MTH431/531 Real analysis Section Notes

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

7.1 Convergence of sequences of random variables

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Application to Random Graphs

Chapter 5. Inequalities. 5.1 The Markov and Chebyshev inequalities

Expectation and Variance of a random variable

62. Power series Definition 16. (Power series) Given a sequence {c n }, the series. c n x n = c 0 + c 1 x + c 2 x 2 + c 3 x 3 +

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

Lecture 6: Coupon Collector s problem

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 19

Lecture 12: November 13, 2018

Topic 9: Sampling Distributions of Estimators

4. Partial Sums and the Central Limit Theorem

Lecture 9: Expanders Part 2, Extractors

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 6 9/24/2008 DISCRETE RANDOM VARIABLES AND THEIR EXPECTATIONS

Mathematical Statistics - MS

Probability 2 - Notes 10. Lemma. If X is a random variable and g(x) 0 for all x in the support of f X, then P(g(X) 1) E[g(X)].

Topic 9: Sampling Distributions of Estimators

Unbiased Estimation. February 7-12, 2008

Disjoint Systems. Abstract

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Lecture 01: the Central Limit Theorem. 1 Central Limit Theorem for i.i.d. random variables

Stochastic Simulation

Topic 9: Sampling Distributions of Estimators

f X (12) = Pr(X = 12) = Pr({(6, 6)}) = 1/36

Axioms of Measure Theory

Lecture 19: Convergence

( ) = p and P( i = b) = q.

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

CS166 Handout 02 Spring 2018 April 3, 2018 Mathematical Terms and Identities

Random Walks on Discrete and Continuous Circles. by Jeffrey S. Rosenthal School of Mathematics, University of Minnesota, Minneapolis, MN, U.S.A.

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

5. INEQUALITIES, LIMIT THEOREMS AND GEOMETRIC PROBABILITY

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Information Theory and Statistics Lecture 4: Lempel-Ziv code

Ada Boost, Risk Bounds, Concentration Inequalities. 1 AdaBoost and Estimates of Conditional Probabilities

Introduction to Probability. Ariel Yadin

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

ECONOMETRIC THEORY. MODULE XIII Lecture - 34 Asymptotic Theory and Stochastic Regressors

Simulation. Two Rule For Inverting A Distribution Function

1 Introduction to reducing variance in Monte Carlo simulations

If a subset E of R contains no open interval, is it of zero measure? For instance, is the set of irrationals in [0, 1] is of measure zero?

CS284A: Representations and Algorithms in Molecular Biology

Math 216A Notes, Week 5

HOMEWORK 2 SOLUTIONS

Discrete Mathematics and Probability Theory Fall 2016 Walrand Probability: An Overview

Notes 5 : More on the a.s. convergence of sums

It is always the case that unions, intersections, complements, and set differences are preserved by the inverse image of a function.

STAT Homework 1 - Solutions

Discrete Mathematics and Probability Theory Spring 2012 Alistair Sinclair Note 15

Lecture 4: April 10, 2013

Rademacher Complexity

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

Topic 8: Expected Values

NOTES ON DISTRIBUTIONS

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Probability and Random Processes

6 Integers Modulo n. integer k can be written as k = qn + r, with q,r, 0 r b. So any integer.

Lecture 12: September 27

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

The Boolean Ring of Intervals

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

Output Analysis and Run-Length Control

Lecture 2: April 3, 2013

Empirical Process Theory and Oracle Inequalities

NANYANG TECHNOLOGICAL UNIVERSITY SYLLABUS FOR ENTRANCE EXAMINATION FOR INTERNATIONAL STUDENTS AO-LEVEL MATHEMATICS

Lecture 2: Monte Carlo Simulation

Massachusetts Institute of Technology

Lecture 3: August 31

Math 525: Lecture 5. January 18, 2018

Properties and Hypothesis Testing

ON POINTWISE BINOMIAL APPROXIMATION

Infinite Sequences and Series

SDS 321: Introduction to Probability and Statistics

Transcription:

Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface. Probability theory plays a cetral role i may areas of computer sciece, ad specifically i cryptography ad complexity theory. I this text, we preset the basic probabilistic otios ad otatios that are used i various courses i the theory of computatio. Specifically, we refer to the otios of discrete probability space ad radom variables, ad to correspodig otatios. Also icluded are overviews of three useful probabilistic iequalities: Markov s Iequality, Chebyshev s Iequality, ad Cheroff Boud. 1 The Very Basics Formally speakig, probability theory is merely a quatitative study of the relatio betwee the sizes of various sets. I the simple case, which uderlies our applicatios, these sets are fiite ad the probability of a evet is merely a shorthad for the desity of a correspodig set with respect to a uderlyig (fiite) uiverse. For example, whe we talk of the probability that a coi flip yields the outcome heads, the uiverse is the set of the two possible outcomes (i.e., heads ad tails) ad the evet we refer to is a subset of this uiverse (i.e., the sigleto set heads). I geeral, the uiverse correspods to all possible basic situatios (which are assumed or postulated to be equally likely), ad evets correspod to specific sets of some of these situatios. Thus, oe may claim that probability theory is just combiatorics; however, as is ofte the case, good otatios (let aloe otios) are istrumetal to more complex studies. Throughout the etire text we refer oly to discrete probability distributios. Such probability distributios refer to a fiite uiverse, called the probability space (or the sample space), ad to subsets of this uiverse. Specifically, the uderlyig probability (or sample) space cosists of a fiite set, deoted Ω, ad evets correspod to subsets of Ω. The probability of a evet A Ω is defied as A / Ω. Ideed, oe should thik of probabilities by referrig to a (metal) experimet i which a elemet i the space Ω is selected with uiform probability distributio (i.e., each elemet is selected with equal probability, 1/ Ω ). Radom variables. Traditioally, radom variables are defied as fuctios from the sample space to the reals. (For example, for ay A Ω, we may cosider a radom variable X : Ω R such that X(e) = 1 of e A ad X(e) = 0 otherwise.) The probability that X is assiged a 1

particular value v is deoted Pr[X = v ad is defied as the probability of the evet {e Ω : X(e)}; that is, Pr[X =v = {e Ω : X(e)=v} / Ω. The support of a radom variable is the (fiite) set of values that are assiged positive probability; that is, the support of X is the set {X(e) : e Ω}. The expectatio of a radom variable X, defied over Ω, is defied as the average value of X whe the uderlyig sample is selected uiformly i Ω (i.e., Ω 1 e Ω X(e)). We deote this expectatio by E[X. A key observatio, referred to as liearity of expectatio, is that for ay two radom variables X, Y : Ω R it holds that E[X + Y = E[X + E[Y. I particular, for ay costats a ad b, it holds that E[aX + b = ae[x + b. The variace of a radom variable X, deoted Var[X, is defied as E[(X E[X) 2. Note that µ def = E[X is a costat, ad thus Var[X = E[(X µ) 2 = E[X 2 2µ X + µ 2 = E[X 2 2µ E[X + µ 2 = E[X 2 E[X 2 which is upper bouded by E[X 2. Note that a radom variable has positive variace if ad oly if it is ot a costat (i.e., Var[X > 0 if ad oly if X assumes at least two differet values). It is commo practice to talk about radom variables without specifyig the uderlyig probability space. I these cases a radom varibale is viewed as a discrete probability distributio over the reals; that is, we fix the probability that this radom variable is assiged ay value. Assumig that this probabilities are all ratioales, a correspodig probability space always exists. For example, whe we cosider a radom variable X such that Pr[X = 1 = 1/3 ad Pr[X = 2 = 2/3, a possible correspodig probability space is {1, 2, 3} such that X(1) = 1 ad X(2) = X(3) = 2. Statistical differece. The statistical distace (a.k.a variatio distace) betwee the radom variables X ad Y is defied as 1 2 Pr[X = v Pr[Y = v = max{pr[x S Pr[Y S}. (1) S v We say that X is δ-close (resp., δ-far) to Y if the statistical distace betwee them is at most (resp., at least) δ. Our Notatioal Covetios Typically, i our courses, the uderlyig probability space will cosist of the set of all strigs of a certai legth l. That is, the sample space is the set of all l-bit log strigs, ad each such strig is assiged probability measure 2 l. Abusig the traditioal termiology, we use the term radom variable also whe referrig to fuctios mappig the sample space ito the set of biary strigs. We ofte do ot specify the probability space, but rather talk directly about radom variables. For example, we may say that X is a radom variable assiged values i the set of all strigs such that Pr[X = 00 = 1 4 ad Pr[X = 111 = 3 4. (Such a radom variable may be defied over the sample space {0, 1}2, so that X(11) = 00 ad X(00) = X(01) = X(10) = 111.) Oe importat case of a radom variable is the output of a radomized process (e.g., a probabilistic polyomial-time algorithm). All our probabilistic statemets refer to radom variables that are defied beforehad. Typically, we may write Pr[f(X) = 1, where X is a radom variable defied beforehad (ad f is a 2

fuctio). A importat covetio is that all occurreces of the same symbol i a probabilistic statemet refer to the same (uique) radom variable. Hece, if B(, ) is a Boolea expressio depedig o two variables, ad X is a radom variable the Pr[B(X, X) deotes the probability that B(x, x) holds whe x is chose with probability Pr[X = x. For example, for every radom variable X, we have Pr[X =X = 1. We stress that if we wish to discuss the probability that B(x, y) holds whe x ad y are chose idepedetly with idetical probability distributio, the we will defie two idepedet radom variables each with the same probability distributio. 1 Hece, if X ad Y are two idepedet radom variables, the Pr[B(X, Y ) deotes the probability that B(x, y) holds whe the pair (x, y) is chose with probability Pr[X =x Pr[Y =y. For example, for every two idepedet radom variables, X ad Y, we have Pr[X = Y = 1 oly if both X ad Y are trivial (i.e., assig the etire probability mass to a sigle strig). We will ofte use U to deote a radom variable uiformly distributed over the set of all strigs of legth. Namely, Pr[U = α equals 2 if α {0, 1} ad equals 0 otherwise. We ofte refer to the distributio of U as the uiform distributio (eglectig to qualify that it is uiform over {0, 1} ). I additio, we occasioally use radom variables (arbitrarily) distributed over {0, 1} or {0, 1} l(), for some fuctio l : N N. Such radom variables are typically deoted by X, Y, Z, etc. We stress that i some cases X is distributed over {0, 1}, whereas i other cases it is distributed over {0, 1} l(), for some fuctio l (which is typically a polyomial). We ofte talk about probability esembles, which are ifiite sequece of radom variables {X } N such that each X rages over strigs of legth bouded by a polyomial i. 2 Three Iequalities The followig probabilistic iequalities are very useful. These iequalities refer to radom variables that are assiged real values ad provide upper-bouds o the probability that the radom variable deviates from its expectatio. 2.1 Markov s Iequality The most basic iequality is Markov s Iequality that applies to ay radom variable with bouded maximum or miimum value. For simplicity, this iequality is stated for radom variables that are lower-bouded by zero, ad reads as follows: Let X be a o-egative radom variable ad v be a o-egative real umber. The Pr [X v E(X) (2) v Equivaletly, Pr[X r E(X) 1 r. The proof amouts to the followig sequece: E(X) = x Pr[X =x x Pr[X =x 0 + Pr[X =x v x<v x v = Pr[X v v 1 Two radom variables, X ad Y, are called idepedet if for every pair of possible values (x, y) it holds that Pr[(X, Y )=(x, y) = Pr[X =x Pr[Y =y. 3

2.2 Chebyshev s Iequality Usig Markov s iequality, oe gets a potetially stroger boud o the deviatio of a radom variable from its expectatio. This boud, called Chebyshev s iequality, is useful whe havig additioal iformatio cocerig the radom variable (specifically, a good upper boud o its variace). For a radom variable X of fiite expectatio, we deote by Var(X) def = E[(X E(X)) 2 the variace of X, ad observe that Var(X) = E(X 2 ) E(X) 2. Chebyshev s Iequality the reads as follows: Let X be a radom variable, ad δ > 0. The Pr [ X E(X) δ Var(X) δ 2. Proof: We defie a radom variable Y def = (X E(X)) 2, ad apply Markov s iequality. We get Pr [ X E(X) δ = Pr [(X E(X)) 2 δ 2 ad the claim follows. E[(X E(X))2 δ 2 (3) Corollary (Pairwise Idepedet Samplig): Chebyshev s iequality is particularly useful i the aalysis of the error probability of approximatio via repeated samplig. It suffices to assume that the samples are picked i a pairwise idepedet maer, where X 1, X 2,..., X are pairwise idepedet if for every i j ad every α, β it holds that Pr[X i =α X j =β = Pr[X i =α Pr[X j =β. The corollary reads as follows: Let X 1, X 2,..., X be pairwise idepedet radom variables with idetical expectatio, deoted µ, ad idetical variace, deoted σ 2. The, for every ɛ > 0, it holds that [ X i Pr µ ɛ σ2 ɛ 2 (4). def Proof: Defie the radom variables X i = X i E(X i ). Note that the X i s are pairwise idepedet, ad each has zero expectatio. Applyig Chebyshev s iequality to the radom variable X i, ad usig the liearity of the expectatio operator, we get [ X i Pr µ ɛ Var [ X i E ɛ 2 [ ( X i = ɛ 2 2 Now (agai usig the liearity of expectatio) ( ) 2 [ E X i = E X 2 i + E [X i X j 1 i j By the pairwise idepedece of the X i s, we get E[X i X j = E[X i E[X j, ad usig E[X i = 0, we get ( ) 2 E X i = σ 2 The corollary follows. ) 2 4

2.3 Cheroff Boud Whe usig pairwise idepedet sample poits, the error probability i the approximatio decreases liearly with the umber of sample poits (see Eq. (4)). Whe usig totally idepedet sample poits, the error probability i the approximatio ca be show to decrease expoetially with the umber of sample poits. (Recall that the radom variables X 1, X 2,..., X are said to be totally idepedet if for every sequece a 1, a 2,..., a it holds that Pr[ X i =a i = Pr[X i =a i.) Probability bouds supportig the foregoig statemet are give ext. The first boud, commoly referred to as Cheroff Boud, cocers 0-1 radom variables (i.e., radom variables that are assiged as values either 0 or 1), ad asserts the followig. Let p 1 2, ad X 1, X 2,..., X be idepedet 0-1 radom variables such that Pr[X i = 1 = p, for each i. The, for every ɛ (0, p, it holds that [ X i Pr p > ɛ < 2 e c ɛ2, where c = max(2, 1 3p ). (5) The more commo formulatio sets c = 2, but the case c = 1/3p is very useful whe p is small ad oe cares about a multiplicative deviatio (e.g., ɛ = p/2). Proof Sketch: We upper-boud Pr[ X i p > ɛ, ad Pr[p X i > ɛ is bouded def similarly. Lettig X i = X i E(X i ), we apply Markov s iequality to the radom variable e λ X i, where λ (0, 1 will be determied to optimize the expressios that we derive. Thus, Pr[ X i > ɛ is upper-bouded by E[e λ X i e λɛ = e λɛ E[e λx i where the equality is due to the idepedece of the radom variables. To simplify the rest of the proof, we establish a sub-optimal boud as follows. Usig a Taylor expasio of e x (e.g., e x < 1 + x + x 2 for x 1) ad observig that E[X i = 0, we get E[e λx i < 1 + λ 2 E[X 2 i, which equals 1 + λ 2 p(1 p). Thus, Pr[ X i p > ɛ is upper-bouded by e λɛ (1 + λ 2 p(1 p)) < exp( λɛ + λ 2 p(1 p)), which is optimized at λ = ɛ/(2p(1 p)) yieldig exp( ɛ2 4p(1 p) ) exp( ɛ 2 ). The foregoig proof strategy ca be applied i more geeral settigs. 2 A more geeral boud, which refers to idepedet radom variables that are each bouded but are ot ecessarily idetical, is give ext (ad is commoly referred to as Hoefdig Iequality). Let X 1, X 2,..., X be idepedet radom variables, each ragig i the (real) iterval [a, b, ad let µ def = 1 E(X i ) deote the average expected value of these variables. The, for every ɛ > 0, [ X i Pr µ > ɛ < 2 e 2ɛ 2 (b a) 2 (6) The special case (of Eq. (6)) that refers to idetically distributed radom variables is easy to derive from the foregoig Cheroff Boud (by recallig Footote 2 ad usig a liear mappig of the iterval [a, b to the iterval [0, 1). This special case is useful i estimatig the average value of a (bouded) fuctio defied over a large domai, especially whe the desired error probability eeds to be egligible (i.e., decrease faster tha ay polyomial i the umber of samples). Such a estimate ca be obtaied provided that we ca sample the fuctio s domai (ad evaluate the fuctio). 2 For example, verify that the curret proof actually applies to the case that X i [0, 1 rather tha X i {0, 1}, by otig that Var[X i p(1 p) still holds. 5

2.4 Pairwise idepedet versus totally idepedet samplig I Sectios 2.2 ad 2.3 we cosidered two Laws of Large Numbers that assert that, whe sufficietly may trials are made, the average value obtaied i these actual trials typically approaches the expected value of a trial. I Sectio 2.2 these trials were performed based o pairwise idepedet samples, whereas i Sectio 2.3 these trials were performed based o totally idepedet samples. I this sectio we shall see that the amout of deviatio (of the average from the expectatio) is approximately the same i both cases, but the probability of deviatio is much smaller i the latter case. To demostrate the differece betwee the samplig bouds provided i Sectios 2.2 ad 2.3, we cosider the problem of estimatig the average value of a fuctio f : Ω [0, 1. I geeral, we say that a radom variable Z provides a (ɛ, δ)-approximatio of a value v if Pr[ Z v > ɛ δ. By Eq. (6), the average value of f evaluated at = O((ɛ 2 log(1/δ)) idepedet samples (selected uiformly i Ω) yield a (ɛ, δ)-approximatio of µ = x Ω f(x)/ Ω. Thus, the umber of sample poits is polyomially related to ɛ 1 ad logarithmically related to δ 1. I cotrast, by Eq. (4), a (ɛ, δ)-approximatio by pairwise idepedet samples calls for settig = O(ɛ 2 δ 1 ). We stress that, i both cases the umber of samples is polyomially related to the desired accuracy of the estimatio (i.e., ɛ). The oly advatage of totally idepedet samples over pairwise idepedet oes is i the depedecy of the umber of samples o the error probability (i.e., δ). 6