2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

Size: px

Start display at page:

Download "2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:"

Morris Copeland
5 years ago
Views:

1 Chapter 1 Random Variables 1.1 Elementary Examples We will start with elementary and intuitive examples of probability. The most well-known example is that of a fair coin: if flipped, the probability of getting a head or tail both equals to 1/2. If we perform n independent tosses, then the probability of obtaining n heads is equal to 1/2 n.infact,lets n = X 1 + X X n,where { 1 if the result of the n-th trial is a head X j = 0 if the result of the n-th trial is a tail Then the probability that we get k heads out of n tosses is equal to Prob{S n = k} = 1 ( ) n 2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time: Prob{S 2n = n} = 1 ( ) 2n 2 2n = 1 (2n)! n 2 2n (n!) 2 e1/2 0, πn as n. On the other hand, since we have a fair coin, we do expect to obtain heads roughly half of the time, i.e. S 2n 2n 1 2, for large n. Such a statement is indeed true and is embodied in the law of large numbers that we discuss in the next chapter. For the moment let us simply observe that while the probability that S 2n equals to n goes to zero as n, the probability that S 2n is close n goesto1asn. More precisely, for any ε>0, { S 2n Prob 2n 1 } 2 >ε 0, 5

2 6 CHAPTER 1. RANDOM VARIABLES as n. This is the weak law of large numbers for this particular example. In the example of the fair coin, the number of outcomes in an experiment is finite. In contrast, the second class of elementary examples involve a continuous set of possible outcomes. Consider the orientation of a unit vector τ. Denote by S 2 the unit sphere in R 3. Define f(n), n S 2 as the orientation distribution function, i.e. for A S 2, Prob{τ A} = f(n)dω, where dω is the surface area element on S 2. If τ does not have a preferred orientation, i.e. it has equal probability of pointing at any direction, then f(n) = 1 4π, On the other hand, if τ does have a preferred orientation, say n 0, then we expect f(n) tobepeakedatn Probability Space It is useful to put these intuitive notions of probability on a firm mathematical basis, as was done by Kolmogorov. For this purpose, we define the notion of probability space, often written as a triplet (Ω, F, P), where Ω is a set called the sample space and F is a collection of subsets of Ω satisfying the requirements that (i) Ω F, F; (ii) if A F,thenĀ F,whereĀ =Ω/A is the complement of A in Ω; (iii) if A 1,A 2,... F,then n N A n F. Such a collection of sets is called a σ-algebra, andthepair(ω, F) is called a measurable space. We often use ω to denote elements in Ω, which call smaple points, and each set in F is called an event. P : F [0, 1] is the probability measure or, in short, the probability, defined on F; it satisfies (a) P( ) =0,P(Ω) = 1; (b) if A 1,A 2,... F are pairwise disjoint, A i A j = if i j, then A P( A n )= P(A n ). n N n N Example (Fair coin). The probability space for the outcome of one trial can be defined as follows. Ω = {head, tail}, F = all subsets of Ω = {, {head}, {tail}, Ω}

3 1.3. CONDITIONAL PROBABILITY 7 and P( ) =0, P({head}) =P({tail}) = 1 2, P(Ω) = 1. For n tosses, we can take Ω = {head, tail} n, F = all subsets of Ω, and P(A) = 1 2 n Card(A), where Card(A) is the cardinality of the set A. Within this framework, the standard rule of set theory are used to answer probability questions. For instance, if both A, B F, the probalility that A and B occur is given by P(A B), the probability that A or B occur, by P(A B), the probability that A but not B occur by P(A/B), etc. It is also elementary to show intuitive properties like A B P(A) P(B), since B = A (B/A), implying by (b) above that P(B) =P(A) +P(B/A) P(B), or P(A B) P(A)+P(B) since A = A (B A c ), implying by (b) above that P(A B) =P(A)+P(B A c ) P(A)+P(B). This inequality is known as Boole s inequality. Another useful notion is that of independence. Two events A, B Fare independent if P(A B) =P(A)P(B). This generalizes to a sequence {A n } n N of independent events as P( A n )= P(A n ). n N n N 1.3 Conditional probability Let A, B F and assume that P(B) 0. Then the conditional probability of A given B is defined as P(A B) P(A B) =. P(B) Clearly this corresponds to the proportion of events where both A and B occur given that B has occured. For instance, the probability to obtain two tails in two tosses of a fair coin is 1 4, but the conditional probability to obtain two tails is 1 2 given that the first toss is a tail, and is zero given that the first toss is a head. One has P(Ω B) =1,

4 8 CHAPTER 1. RANDOM VARIABLES and therefore if A 1,A 2, are disjoint sets such that n A n =Ω,then P(A n B) =1. Therefore n P(A j B) = P(A j B) n P(A n B). This is known as Baye s rule. Since P(A B) = P(A B)P(B) by definition, we also have by iteration and so on. P(A B C) =P(A B C)P(B C)P(C), 1.4 Discrete Distributions Making n tosses of a fair coin is an example of an experiment where the total number of outcomes is finite. More generally, if the number of elements in Ω={A 1,A 2,...} is finite or enumerable, and associated with each of these events there is a numerical value, X j, the function X such that X(A j )=X j is called a discrete random variable. It is usually convenient to take simply X j {1, 2,...,N} in the finite case, and X j N or X j Z in the numerable case, so that X :Ω {1, 2,...,N}, X :Ω N, orx :Ω Z, and we will restrict to these cases. The probability distribution of X is the probability that X takes the value j, i.e and it obviously satisfies p(j) = Prob{X = j} = P(A j ), j =1,... p(j) 0, p(j) =1. (The second conditions also implies that p(j) 1.) Given a function f of X, its expectation is given by j Ef(X) = i f(i)p(i), if the sum is well-defined. In particular, the pth moment of the distribution is defined as m p = j p p(j), j if the sum is well-defined. The first moment is also called the mean of the random variable and is denoted by mean(x). Another important quantity is its variance defined as var(x) =m 2 m 2 1 = (j m 1 ) 2 p(j). j

5 1.5. CONTINUOUS DISTRIBUTIONS. 9 Exercise S n, the number of head in an outcome of n tosses, is an example of random variable with a distribution is given by p(j) = Prob{S n = j} = 1 ( ) n 2 n. j Show that the mean and the variance of this random variable are mean(x) = n 2, var(x) = n 4, implying that var(x)/(mean(x)) 2 0asn. Poisson distribution. This is the simplest example for the distribution of randomly scattered points. Consider a set of randomly scattered points on the plane. Let A be a set in R 2. X A (ω) be the number of points in A. X A has Poisson distribution if P{X A (ω) =n} = λn n! e λ, where λ may depend on A according to the density of the points. If the points are uniformly distributed on the plane, then λ is equal to the area of A. Notice that in this case λ =mean(x A ) = var(x A ). 1.5 Continuous Distributions. Consider now the general case when Ω is not necessarily numerable. A function X :Ω R n is called a random variable on the probability space (Ω, F, P) if the set {ω : X(ω) U} is in F for all open set U R n. The distribution of the random variable X is a probability measure µ on R n, defined for any set B R n by µ(b) = Prob{X B} = P{ω Ω:X(ω) B}. If there exists an integrable function ρ(x) such that µ(b) = ρ(x)dx, then ρ is called the probability density of X. Given its distribution, the expectation of f(x) is defined as Ef(X) = f(x)µ(dx), R n if f is continuous and the right hand is well-defined (i.e. f(x) µ(dx) < ). R n Two important expectation are the mean of X, B mean(x) =EX,

6 10 CHAPTER 1. RANDOM VARIABLES and the variance of X, var(x) =E ( (X EX)(X EX) T ). The expectation of f(x) can also be written as Ef(X) = f(x(ω))dp(ω), which allows us to make connection with L p -space. For p 1let L p = L p (Ω, F, P) ={X(ω) :E X p < }. Then L p is a Banach space with norm X p =(E X p ) 1/p, and L 2 is a Hilbert space with scalar product X, Y = E(X, Y ), Ω where (X, Y ) denotes the standrad scalar product in R d. Scwartz s inequality E (X, Y ) X 2 Y 2. More generally, we have Hölder s inequality In L 2, we have E (X, Y ) X p Y q, p > 1, 1/p +1/q =1,X L p,y L q. The triangle inequality in L p is called Minkowski s inequality X + Y p X p + Y p, p 1, X,Y L p. Exercise (Jensen s inequality). Let X be a one dimensional variable such that M(λ) =Ee λx < for some λ R. Show that M(λ) e λex. Two random variables X and Y are called independent if Ef(X)g(Y )=Ef(X)Eg(Y ), for all continuous functions f and g. This means that the joint distributions of X and Y is simply the product measure of the distributions of X and Y, i.e. µ(dx, dy) =µ x (dx)µ y (dy). A weaker notion is the following. X and Y are uncorrelated if cov(x, Y )=0, where cov(x, Y )is the covariance matrix of X and Y, cov(x, Y )=E(X EX)(Y EY ) T.

7 1.6. EXAMPLES OF CONTINUOUS DISTRIBUTION 11 Exercise Pairwise independence does not imply independence. Let X, Y be two independent random variables such that P(X = ±1) = P(Y = ±1) = 1 2, and let Z = XY. independent, i.e. Check that X, Y, Z are pairwise independent but not Ef(X)g(Y )h(z) =Ef(X)Eg(Y )Eh(Z), does not hold for all continuous functions f, g, h. 1.6 Examples of Continuous Distribution We list a few important continuous distributions. 1. Uniform distribution 1 if x B ρ(x) = vol(b) 0 otherwise In one dimension for B =[0, 1] this reduces to ρ(x) = { 1 if x [0, 1] 0 otherwise Pseudo-random numbers generated on the computer typically have this distribution. 2. Exponential distribution ρ(x) = { 0 if x<0 λe λx if x 0 Special cases include the distribution of waiting time for continuous time Markov process (λ is the rate of the process) and the Boltzmann distribution f(e) = { e βe if E 0 0 if E<0, where β =1/k B T, k B is the Boltzmann constant, and T is the temperature.

8 12 CHAPTER 1. RANDOM VARIABLES 3. Normal distribution ρ(x) = e 1 2 (x x)t A 1 (x x) (2π) n/2 (det A) 1/2, where A is a symmetric positive definite matrix, det A is the determinant of A; x is the mean, A is the variance, and the random variable with the density above is denoted by N( x, A). Such random variables are also called Gaussian random variables. In dimension one, the density of a normal variable reduces to e (x x)2 /2σ 2 ρ(x) =, 2πσ 2 where x is the mean, σ 2 is the variance. Exercise Let X =(X 1,...,X n )bean-dimensional Gaussian random variable and let Y = c 1 X 1 + c 2 X c n X n,wherec 1,...,c n are constants. Show that Y is also Gaussian. 1.7 Conditional Expectation The concept of conditional probability extends straightforwardly to discrete random variables. Let X and Y be two discrete random variables, not necessarily independent, with joint probability p(i, j) =P(X = i, Y = j). Since i p(i, j) =P(Y = j), the conditional probability that X = i given that Y = j is given by p(i, j) p(i j) = i p(i, j). if i p(i, j) > 0 and conventionaly taken to be zero if i p(i, j) = 0. From this, the natural definition of the conditional expectation of f(x) giventhaty = j is E(f(X) Y = j) = f(i)p(i j). i A difficulty arises when one tries to generalizes this concept to continuous random variables. Indeed, given two continuous random variables X and Y with joint probability density µ(dx, dy), the probability that Y = y is zero for most values of y. Therefore, the ratio involved in the definition of the conditional probability measure is not defined. An obvious way to try to fix this problem is to take limits, i.e. define the conditional probability of X B given that Y = y as µ(b y) = lim δ 0 µ(b,b δ (y)) R n µ(dx, B δ (y)),

9 1.8. BOREL-CANTELLI LEMMA 13 where B δ (y) is the ball of radius δ centered around y. However, requiring that this limit exists for every y turns out to be very restrictive, and it is better to proceed differently. The is done as follows. One says that a measure ν is absolutely continuous with rescpect to a measure µ, whichisdenotedasν µ, if for every open set B such that µ(b) = 0, one also has ν(b) = 0. It can be shown that when ν µ, there exists a function ϕ(z) such that for any B, ν(b) = ϕ(z)µ(dz). B The function ϕ is unique up to equivalence, i.e. it can be modified at most on a set of zero measure with respect to µ(dz), and it is called the Radon-Nikodym derivative of ν with respect to µ. This is denoted as ϕ = dν dµ. Note that within this terminology, the fact that µ has a density ρ(z) means that µ is absolutely continuous with respect to the Lesbegue measure dz, with Radon-Nykodym derivative ρ(z). The conditional probability of x B knowing that Y = y, which we denote as µ(b y), can be defined as the Radon-Nikodym derivative of µ(dx, dy) with respect to µ y (B) = R µ(dx, B) (the marginal probability of Y ), i.e. it satisfies n for every open sets A, B µ(a, B) = µ(a y)µ y (dy). B Thus µ(a y) is only defined up to a set of measure zero with respect to µ y (dy), which is fine in practice and is why this definition is less restrictive than the one above in terms of limit. If the pair (X, Y )hasadensity,ρ(x, y), then µ(a y) = A ρ(x, y) ρ(y) dx, where ρ(y) = R n ρ(x, y)dx is the marginal density of Y. 1.8 Borel-Cantelli Lemma Many questions in probability, like the convergence of the normalized sum of independent variables to their mean (i.e the law of large number treated below), involve tail event. A tail event is an event defined on a sequence of events, say {A n } n=n, which is such that its probability do not depends on the first A 1,...,A k no matter how large k is. For instance if S n /n is the proprtion of heads in n tosses of a fair coin, the event that S n /n 1/2 asn is a tail event because the probability that S n n = 1 n X j 1 n 2 j=1

10 14 CHAPTER 1. RANDOM VARIABLES is the same as the probability that 1 n n X j 1 2 j=k for any k 1. An important example of a tail event defined on {A n } n N is the event that A n occur infinitely many otfen, i.e. {A n i.o.} = {ω : ω A n i.o.} = lim k The probability of such an event is specified by: Lemma (Borel-Cantelli Lemma). k n A k = n N 1. If n=1 P(A n) <, thenp{a n i.o.} =0. 2. If the A n s are independent and n=1 P(A n)=, thenp{a n i.o.} =1. Proof. 1. P{ n=1 k=n A k} P{ k=n A k} k=n P(A k) for any n, but the last term goes to 0 as n since k=1 P(A k) < by assumption. 2. Using independence, one has P( A k )=1 P( A c k )=1 P(A c k )=1 (1 P(A k )). k=n k=n Using 1 x e x, this gives P( A k ) 1 k=n k=n k=n k=n k n e P(A k) =1 e P k=n P(A k) =1. Since the left hand-side is a probability, the bound must be sharp and we are done. As an example of application of this result, we prove Lemma Let {X n } n N be a sequence of identically distributed (not necessarily independent) random variables, such that E X n <. Then X n lim =0 a.s., n n Here and below a.s. stands for almost surely, i.e. for almost all ω Ω except possibly a set of zero probability (see definition 1.9.1). Proof. For any ε>0, define A ε n = {ω Ω: X n(ω)/n >ε}. A k

11 1.8. BOREL-CANTELLI LEMMA 15 Then P(A ε n)= P{ X n >nε} n=1 = = = n=1 P{ X 1 >nε} n=1 n=1 k=n P{kε < X 1 (k +1)ε} kp{kε < X 1 (k +1)ε} k=1 = k µ(dx) k=1 kε< x (k+1)ε 1 x µ(dx) ε = 1 ε k=1 ε< x kε< x (k+1)ε x µ(dx) 1 ε E X 1 <. Therefore if we define B ε = {ω Ω:ω A ε n i.o.}, then P(B ε ) = 0. Let B = n N B 1.ThenP(B) = 0, and n X n (ω) lim =0, if ω B. n n Note that this proof relies on a special case of a useful inequality which we state as a lemma. Lemma (Chebyshev s Inequality). Let X be a random variable such that E X k <, for some integer k. Then for any positive constant λ. Proof. For any λ>0, E X k = x k µ(dx) R n P{ X >λ} 1 λ k E X k, x λ x k µ(dx) λ k x λ µ(dx) =λ k P{ X λ}.

12 16 CHAPTER 1. RANDOM VARIABLES 1.9 Notions of Convergence Let {X n } n=1 be a sequence of random variables defined on a probability space (Ω, F, P), and let µ n be the distribution of X n. Let X be another random variable with distribution µ. We will discuss four notions of convergence: almost sure convergence, convergence in probability, convergence in distribution, and convergence in L p. Definition (Almost sure convergence). X n converges to X almost surely as n,if P{ω : X n (ω) X(ω)} =1. We express almost sure convergence as X n X, a.s. Definition (Convergence in probability). X n converges to X in probability if for any ε>0, P{ω : X n (ω) X(ω) >ε} 0, as n. Definition (Convergence in distribution). X n converges to X in distribution if for any continuous function f, Ef(X n ) Ef(X). d d This is denoted as X n X, orµn µ, whereµn, µ are the distributions of X n and X respectively. Definition (Convergence in L p ). Let {X n } n N be a sequence of random variables such that X n L p. X n converges to X L p in L p (or in pth mean) if E X n X p 0. For p = 1 we speak about convergence in mean; for p = 2, convergence in mean square. From real analysis, we know that convergence in L p implies convergence in L q if p q, almost sure convergence implies convergence in probability, convergence in probability implies that there is a subsequence that converges almost surely, and convergence in probability implies convergence in distribution. Convergence in L p implies convergence in probability Characteristic Functions The characteristic function of a probability measure µ is defined as f(ξ) =Ee i(ξ,x) = e i(x,ξ) µ(dx). R n It has the following obvious properties

13 1.10. CHARACTERISTIC FUNCTIONS ξ R n, f(ξ) 1, f(ξ) =f( ξ); 2. f is uniformly continuous on R 1. We also have Theorem Let {µ n } n N be a sequence of probability measures, and {f n } n N be their corresponding characteristic functions. Assume that 1. f n converges everywhere on R n to a limiting function f. 2. f is continuous at ξ =0. d Then there exists a probability distribution µ such that µ u µ. Moreover f is the characteristic function of µ. d Conversely, if µ n µ, where µ is some probability distribution then fn converges to f uniformly in every finite interval, where f is the characteristic function of µ. For a proof, see [2] As in Fourier transforms, one can also define the inverse transform ρ(x) = 1 (2π) n e i(ξ,x) f(ξ)dξ. R n An interesting question arises as to when this gives the density of a probability measure. To answer this we define Definition A function f is called positive semi-definite if for any finite set of values {ξ 1,...,ξ n }, n N, the matrix (f(ξ i ξ j )) is positive semi-definite, i.e. f(ξ i ξ j )v i v j 0, i,j for any {v 1,...,v n }. Theorem (Khinchin). If f is a positive semi-definite uniformly continuous function and f(0) = 1, then it is the characteristic function of a probability measure. Exercise (Stable laws). A one-dimensional distribution µ(dx) is stable if given any two independent random variables X and Y following this distribution, there exists a and b such that a(x + Y b) has distribution µ(dx). Show that f(ξ) =e ξ α is the characteristic function of a stable distribution for 0 α 2, and is not a characteristic function for other values of α. Exercise (Girko s circular law for random matrices). If A has i.i.d. entries with mean zero and variance σ 2, then the eigenvalues of A/ nσ are asymptotically uniformly distributed in the unit disk in the complex plane. Investigate numerically.

14 18 CHAPTER 1. RANDOM VARIABLES

Lecture Notes in Applied Stochastic Analysis

1 Lecture Notes in Applied Stochastic Analysis Weinan E and Eric Vanden-Eijnden This is a rough preliminary version of a textbook. Do not circulate! 2 Contents 1 Random Variables 5 1.1 Elementary Examples......................