Basics of Probability Theory (for Theory of Computation courses)

Basics of Probability Theory (for Theory of Computatio courses) Oded Goldreich Departmet of Computer Sciece Weizma Istitute of Sciece Rehovot, Israel. oded.goldreich@weizma.ac.il November 24, 2008 Preface. Probability theory plays a cetral role i may areas of computer sciece, ad specifically i cryptography ad complexity theory. I this text, we preset the basic probabilistic otios ad otatios that are used i various courses i the theory of computatio. Specifically, we refer to the otios of discrete probability space ad radom variables, ad to correspodig otatios. Also icluded are overviews of three useful probabilistic iequalities: Markov s Iequality, Chebyshev s Iequality, ad Cheroff Boud. 1 The Very Basics Formally speakig, probability theory is merely a quatitative study of the relatio betwee the sizes of various sets. I the simple case, which uderlies our applicatios, these sets are fiite ad the probability of a evet is merely a shorthad for the desity of a correspodig set with respect to a uderlyig (fiite) uiverse. For example, whe we talk of the probability that a coi flip yields the outcome heads, the uiverse is the set of the two possible outcomes (i.e., heads ad tails) ad the evet we refer to is a subset of this uiverse (i.e., the sigleto set heads). I geeral, the uiverse correspods to all possible basic situatios (which are assumed or postulated to be equally likely), ad evets correspod to specific sets of some of these situatios. Thus, oe may claim that probability theory is just combiatorics; however, as is ofte the case, good otatios (let aloe otios) are istrumetal to more complex studies. Throughout the etire text we refer oly to discrete probability distributios. Such probability distributios refer to a fiite uiverse, called the probability space (or the sample space), ad to subsets of this uiverse. Specifically, the uderlyig probability (or sample) space cosists of a fiite set, deoted Ω, ad evets correspod to subsets of Ω. The probability of a evet A Ω is defied as A / Ω. Ideed, oe should thik of probabilities by referrig to a (metal) experimet i which a elemet i the space Ω is selected with uiform probability distributio (i.e., each elemet is selected with equal probability, 1/ Ω ). Radom variables. Traditioally, radom variables are defied as fuctios from the sample space to the reals. (For example, for ay A Ω, we may cosider a radom variable X : Ω R such that X(e) = 1 of e A ad X(e) = 0 otherwise.) The probability that X is assiged a 1

particular value v is deoted Pr[X = v ad is defied as the probability of the evet {e Ω : X(e)}; that is, Pr[X =v = {e Ω : X(e)=v} / Ω. The support of a radom variable is the (fiite) set of values that are assiged positive probability; that is, the support of X is the set {X(e) : e Ω}. The expectatio of a radom variable X, defied over Ω, is defied as the average value of X whe the uderlyig sample is selected uiformly i Ω (i.e., Ω 1 e Ω X(e)). We deote this expectatio by E[X. A key observatio, referred to as liearity of expectatio, is that for ay two radom variables X, Y : Ω R it holds that E[X + Y = E[X + E[Y. I particular, for ay costats a ad b, it holds that E[aX + b = ae[x + b. The variace of a radom variable X, deoted Var[X, is defied as E[(X E[X) 2. Note that µ def = E[X is a costat, ad thus Var[X = E[(X µ) 2 = E[X 2 2µ X + µ 2 = E[X 2 2µ E[X + µ 2 = E[X 2 E[X 2 which is upper bouded by E[X 2. Note that a radom variable has positive variace if ad oly if it is ot a costat (i.e., Var[X > 0 if ad oly if X assumes at least two differet values). It is commo practice to talk about radom variables without specifyig the uderlyig probability space. I these cases a radom varibale is viewed as a discrete probability distributio over the reals; that is, we fix the probability that this radom variable is assiged ay value. Assumig that this probabilities are all ratioales, a correspodig probability space always exists. For example, whe we cosider a radom variable X such that Pr[X = 1 = 1/3 ad Pr[X = 2 = 2/3, a possible correspodig probability space is {1, 2, 3} such that X(1) = 1 ad X(2) = X(3) = 2. Statistical differece. The statistical distace (a.k.a variatio distace) betwee the radom variables X ad Y is defied as 1 2 Pr[X = v Pr[Y = v = max{pr[x S Pr[Y S}. (1) S v We say that X is δ-close (resp., δ-far) to Y if the statistical distace betwee them is at most (resp., at least) δ. Our Notatioal Covetios Typically, i our courses, the uderlyig probability space will cosist of the set of all strigs of a certai legth l. That is, the sample space is the set of all l-bit log strigs, ad each such strig is assiged probability measure 2 l. Abusig the traditioal termiology, we use the term radom variable also whe referrig to fuctios mappig the sample space ito the set of biary strigs. We ofte do ot specify the probability space, but rather talk directly about radom variables. For example, we may say that X is a radom variable assiged values i the set of all strigs such that Pr[X = 00 = 1 4 ad Pr[X = 111 = 3 4. (Such a radom variable may be defied over the sample space {0, 1}2, so that X(11) = 00 ad X(00) = X(01) = X(10) = 111.) Oe importat case of a radom variable is the output of a radomized process (e.g., a probabilistic polyomial-time algorithm). All our probabilistic statemets refer to radom variables that are defied beforehad. Typically, we may write Pr[f(X) = 1, where X is a radom variable defied beforehad (ad f is a 2

fuctio). A importat covetio is that all occurreces of the same symbol i a probabilistic statemet refer to the same (uique) radom variable. Hece, if B(, ) is a Boolea expressio depedig o two variables, ad X is a radom variable the Pr[B(X, X) deotes the probability that B(x, x) holds whe x is chose with probability Pr[X = x. For example, for every radom variable X, we have Pr[X =X = 1. We stress that if we wish to discuss the probability that B(x, y) holds whe x ad y are chose idepedetly with idetical probability distributio, the we will defie two idepedet radom variables each with the same probability distributio. 1 Hece, if X ad Y are two idepedet radom variables, the Pr[B(X, Y ) deotes the probability that B(x, y) holds whe the pair (x, y) is chose with probability Pr[X =x Pr[Y =y. For example, for every two idepedet radom variables, X ad Y, we have Pr[X = Y = 1 oly if both X ad Y are trivial (i.e., assig the etire probability mass to a sigle strig). We will ofte use U to deote a radom variable uiformly distributed over the set of all strigs of legth. Namely, Pr[U = α equals 2 if α {0, 1} ad equals 0 otherwise. We ofte refer to the distributio of U as the uiform distributio (eglectig to qualify that it is uiform over {0, 1} ). I additio, we occasioally use radom variables (arbitrarily) distributed over {0, 1} or {0, 1} l(), for some fuctio l : N N. Such radom variables are typically deoted by X, Y, Z, etc. We stress that i some cases X is distributed over {0, 1}, whereas i other cases it is distributed over {0, 1} l(), for some fuctio l (which is typically a polyomial). We ofte talk about probability esembles, which are ifiite sequece of radom variables {X } N such that each X rages over strigs of legth bouded by a polyomial i. 2 Three Iequalities The followig probabilistic iequalities are very useful. These iequalities refer to radom variables that are assiged real values ad provide upper-bouds o the probability that the radom variable deviates from its expectatio. 2.1 Markov s Iequality The most basic iequality is Markov s Iequality that applies to ay radom variable with bouded maximum or miimum value. For simplicity, this iequality is stated for radom variables that are lower-bouded by zero, ad reads as follows: Let X be a o-egative radom variable ad v be a o-egative real umber. The Pr [X v E(X) (2) v Equivaletly, Pr[X r E(X) 1 r. The proof amouts to the followig sequece: E(X) = x Pr[X =x x Pr[X =x 0 + Pr[X =x v x<v x v = Pr[X v v 1 Two radom variables, X ad Y, are called idepedet if for every pair of possible values (x, y) it holds that Pr[(X, Y )=(x, y) = Pr[X =x Pr[Y =y. 3

2.2 Chebyshev s Iequality Usig Markov s iequality, oe gets a potetially stroger boud o the deviatio of a radom variable from its expectatio. This boud, called Chebyshev s iequality, is useful whe havig additioal iformatio cocerig the radom variable (specifically, a good upper boud o its variace). For a radom variable X of fiite expectatio, we deote by Var(X) def = E[(X E(X)) 2 the variace of X, ad observe that Var(X) = E(X 2 ) E(X) 2. Chebyshev s Iequality the reads as follows: Let X be a radom variable, ad δ > 0. The Pr [ X E(X) δ Var(X) δ 2. Proof: We defie a radom variable Y def = (X E(X)) 2, ad apply Markov s iequality. We get Pr [ X E(X) δ = Pr [(X E(X)) 2 δ 2 ad the claim follows. E[(X E(X))2 δ 2 (3) Corollary (Pairwise Idepedet Samplig): Chebyshev s iequality is particularly useful i the aalysis of the error probability of approximatio via repeated samplig. It suffices to assume that the samples are picked i a pairwise idepedet maer, where X 1, X 2,..., X are pairwise idepedet if for every i j ad every α, β it holds that Pr[X i =α X j =β = Pr[X i =α Pr[X j =β. The corollary reads as follows: Let X 1, X 2,..., X be pairwise idepedet radom variables with idetical expectatio, deoted µ, ad idetical variace, deoted σ 2. The, for every ɛ > 0, it holds that [ X i Pr µ ɛ σ2 ɛ 2 (4). def Proof: Defie the radom variables X i = X i E(X i ). Note that the X i s are pairwise idepedet, ad each has zero expectatio. Applyig Chebyshev s iequality to the radom variable X i, ad usig the liearity of the expectatio operator, we get [ X i Pr µ ɛ Var [ X i E ɛ 2 [ ( X i = ɛ 2 2 Now (agai usig the liearity of expectatio) ( ) 2 [ E X i = E X 2 i + E [X i X j 1 i j By the pairwise idepedece of the X i s, we get E[X i X j = E[X i E[X j, ad usig E[X i = 0, we get ( ) 2 E X i = σ 2 The corollary follows. ) 2 4

2.3 Cheroff Boud Whe usig pairwise idepedet sample poits, the error probability i the approximatio decreases liearly with the umber of sample poits (see Eq. (4)). Whe usig totally idepedet sample poits, the error probability i the approximatio ca be show to decrease expoetially with the umber of sample poits. (Recall that the radom variables X 1, X 2,..., X are said to be totally idepedet if for every sequece a 1, a 2,..., a it holds that Pr[ X i =a i = Pr[X i =a i.) Probability bouds supportig the foregoig statemet are give ext. The first boud, commoly referred to as Cheroff Boud, cocers 0-1 radom variables (i.e., radom variables that are assiged as values either 0 or 1), ad asserts the followig. Let p 1 2, ad X 1, X 2,..., X be idepedet 0-1 radom variables such that Pr[X i = 1 = p, for each i. The, for every ɛ (0, p, it holds that [ X i Pr p > ɛ < 2 e c ɛ2, where c = max(2, 1 3p ). (5) The more commo formulatio sets c = 2, but the case c = 1/3p is very useful whe p is small ad oe cares about a multiplicative deviatio (e.g., ɛ = p/2). Proof Sketch: We upper-boud Pr[ X i p > ɛ, ad Pr[p X i > ɛ is bouded def similarly. Lettig X i = X i E(X i ), we apply Markov s iequality to the radom variable e λ X i, where λ (0, 1 will be determied to optimize the expressios that we derive. Thus, Pr[ X i > ɛ is upper-bouded by E[e λ X i e λɛ = e λɛ E[e λx i where the equality is due to the idepedece of the radom variables. To simplify the rest of the proof, we establish a sub-optimal boud as follows. Usig a Taylor expasio of e x (e.g., e x < 1 + x + x 2 for x 1) ad observig that E[X i = 0, we get E[e λx i < 1 + λ 2 E[X 2 i, which equals 1 + λ 2 p(1 p). Thus, Pr[ X i p > ɛ is upper-bouded by e λɛ (1 + λ 2 p(1 p)) < exp( λɛ + λ 2 p(1 p)), which is optimized at λ = ɛ/(2p(1 p)) yieldig exp( ɛ2 4p(1 p) ) exp( ɛ 2 ). The foregoig proof strategy ca be applied i more geeral settigs. 2 A more geeral boud, which refers to idepedet radom variables that are each bouded but are ot ecessarily idetical, is give ext (ad is commoly referred to as Hoefdig Iequality). Let X 1, X 2,..., X be idepedet radom variables, each ragig i the (real) iterval [a, b, ad let µ def = 1 E(X i ) deote the average expected value of these variables. The, for every ɛ > 0, [ X i Pr µ > ɛ < 2 e 2ɛ 2 (b a) 2 (6) The special case (of Eq. (6)) that refers to idetically distributed radom variables is easy to derive from the foregoig Cheroff Boud (by recallig Footote 2 ad usig a liear mappig of the iterval [a, b to the iterval [0, 1). This special case is useful i estimatig the average value of a (bouded) fuctio defied over a large domai, especially whe the desired error probability eeds to be egligible (i.e., decrease faster tha ay polyomial i the umber of samples). Such a estimate ca be obtaied provided that we ca sample the fuctio s domai (ad evaluate the fuctio). 2 For example, verify that the curret proof actually applies to the case that X i [0, 1 rather tha X i {0, 1}, by otig that Var[X i p(1 p) still holds. 5

2.4 Pairwise idepedet versus totally idepedet samplig I Sectios 2.2 ad 2.3 we cosidered two Laws of Large Numbers that assert that, whe sufficietly may trials are made, the average value obtaied i these actual trials typically approaches the expected value of a trial. I Sectio 2.2 these trials were performed based o pairwise idepedet samples, whereas i Sectio 2.3 these trials were performed based o totally idepedet samples. I this sectio we shall see that the amout of deviatio (of the average from the expectatio) is approximately the same i both cases, but the probability of deviatio is much smaller i the latter case. To demostrate the differece betwee the samplig bouds provided i Sectios 2.2 ad 2.3, we cosider the problem of estimatig the average value of a fuctio f : Ω [0, 1. I geeral, we say that a radom variable Z provides a (ɛ, δ)-approximatio of a value v if Pr[ Z v > ɛ δ. By Eq. (6), the average value of f evaluated at = O((ɛ 2 log(1/δ)) idepedet samples (selected uiformly i Ω) yield a (ɛ, δ)-approximatio of µ = x Ω f(x)/ Ω. Thus, the umber of sample poits is polyomially related to ɛ 1 ad logarithmically related to δ 1. I cotrast, by Eq. (4), a (ɛ, δ)-approximatio by pairwise idepedet samples calls for settig = O(ɛ 2 δ 1 ). We stress that, i both cases the umber of samples is polyomially related to the desired accuracy of the estimatio (i.e., ɛ). The oly advatage of totally idepedet samples over pairwise idepedet oes is i the depedecy of the umber of samples o the error probability (i.e., δ). 6