1 ACM 116: Lectures 3 4 Joint distributions The multivariate normal distribution Conditional distributions Independent random variables Conditional distributions and Monte Carlo: Rejection sampling Variance of a random variable Covariance Conditional Expectation
Joint distributions 2 Two random variables X and Y. Interested in their joint outcome. (X, Y ) = (x, y) X = x and Y = y. Outcomes for Y Outcomes for X Example : Server on the web X : Y : # customers hitting your server in the next hour # customers in the next 10 minutes or # customers who will purchase an item in the next hour
3 Joint frequency function X, Y discrete r.v. s taking on values x 1, x 2,... and y 1, y 2,... resp. Joint frequency (distribution) function : p(x i, y j ) = P (X = x i, Y = y j ) Marginal probabilities P (X = x i ) = j P (X = x i, Y = y j ) Why? Addition rule
4 Joint frequency function Several random variables : similar story X 1,... X m r.v. s defined on the same sample space p(x 1,... x m ) = P (X 1 = x 1,..., X m = x m ) p X1 (x 1 ) = p X1,X 2 (x 1, x 2 ) = p(x 1, x 2,..., x m ) x 2,...,x m p(x 1, x 2, x 3,..., x m ) x 3,...,x m
5 Example: the multinomial distribution n fixed independent experiments resulting in one of r possible outcomes. Probabilities p 1,..., p r X i, 1 i r, # outcomes of type i. Joint distribution of the X i s? P (X 1 = x 1,..., X r = x r ) = n! x 1!... x r! px 1 1... px r r Why?
6 Example : multinomial distribution Outcome of experiments such that (X 1 = x 1,..., X r = x r ) has prob. p x 1 1... px r r (all equally likely). Indeed, First x 1 Type 1 p x 1 1 Next x 2 Type 2 p x 2 2... Last x r Type r p x r r How many configurations? ( )( ) n n x1... x 1 x 2 ( xr x r ) = n! x 1!... x r!
Joint distributions: the continuous case 7 X and Y continuous r.v. s Joint density function p(x, y) s.t. P ((x, y) A) = for any reasonable set A. A f(x, y) dxdy A P ((X, Y ) A) = this volume
8 Interpretation Joint distributions: the continuous case P (x X x + dx, y Y y + dy) f(x, y) dxdy f(x,y) x y Marginal density of X : f X (x) = f XY (x, y) dy
9 Joint distributions: the continuous case Example Disk of radius 1. 1 f(x, y) = 1 π if x 2 + y 2 1 0 otherwise R: distance from point to origin. Density of R?
Joint distributions: the continuous case 10 R: distance from point to origin. Density of R? X : first coordinate. Density of X? P (R r) = πr2 π = r2 2r if 0 r 1 f R (r) = 0 otherwise f X (x) = f XY (x, y) dy = 1 x 2 1 x 2 1 π dy = 2 π 1 x 2 f X (x) = 2 π 1 x 2 if 1 x 1 0 otherwise
11 Bivariate normal density 5 parameters : < µ x, µ y <, 0 < σ x, σ y <, 1 ρ 1 Extensively used in modeling Galto, mid 19th century : modeling of the heights of fathers and sons, e.g. X = father s height Y = son s height Density x = (x, y), Σ = σ2 x ρσ x σ y ρσ x σ y σ 2 y µ = (µ x, µ y ) is the vector of means is the covariance matrix f XY (x, y) = f X (x) = 1 2πdet(Σ) e 1 2 (x µ)t Σ 1 (x µ)
12 Bivariate normal density Level sets = ellipses Interesting calculation : marginals X N(µ x, σ 2 x ) Y N(µ y, σ 2 y )
13 Independent random variables Two random variables are independent if P (X A, Y B) = P (X A)P (Y B) for any subsets A and B. Discrete r.v. s : Independence iff P (X = x, Y = y) = P (X = x)p (Y = y) Continuous r.v. s : Independence iff f XY (x, y) = f X (x)f Y (y) (Complicated)
14 Example Node of a communication network If two packets of info arrive within time τ of each other they collide and then have to be retransmitted. Times of arrival independent U[0, T ]. What is the probability that they collide?
T 1, T 2 independent U[0, T ]. Independent random variables 15 Joint distribution : f(t 1, t 2 ) = 1 T 2 uniform distribution over the square τ τ P (Collision) = 1 (1 τ T )2 = 2 τ T ( τ T )2 = τ T (2 τ T )
Conditional distributions 16 Discrete case : X and Y joint discrete r.v. s Conditional prob. of X = x i given Y = y j is P (X = x i Y = y j ) = P (X = x i, Y = y j ) P (Y = y j ) Notation : p X Y Continuous case : X and Y jointly continuous r.v. s Conditional density of Y given X is defined to be NB : f Y X (y x) = f XY (x, y) f X (x) P (y Y y + dy x X x + dx) = f XY (x, y) dxdy f X (x) dx = f XY (x, y) dy f X (x)
17 Conditional distributions Independence f Y X = f Y Joint density = conditional marginal f XY (x, y) = f Y X (y x)f X (x) Law of total probability f Y (y) = f Y X (y x)f X (x) dx NB : p X Y (x Y = y) is a distribution : p X Y (x Y = y) 0 all x p X Y (x Y = y) = 1 i.e. conditional distribution.
18 Example Imperfect particle detector Detects each incoming particle w.p. p only. N true number of particles. Poisson distribution X: detected number of particles λ λn P (N = n) = e n! What is the distribution of the detected number of particles?
19 Conditional distributions p X N=n Binomial(n, p) P (X = k) = P (N = n)p (X = k N = n) n=0 ( n )p k (1 p) n k λn e λ = n k k n! = (λp)k k! e λ n k n k (1 p)n k λ (n k)! = (λp)k k! λp (λp)k = e k! e λ e λ(1 p) Poi(λp)
20 Conditional distributions Example: Bivariate normal density X and Y bivariate normal distribution Distribution of Y X = x is normal too : Linear regression! N(µ y + ρ σ y σ x (x µ x ), (1 ρ 2 )σ 2 y )
21 Sampling & Monte Carlo Density function f we wish to sample from. f is nonzero on an interval [a, b] and zero outside the interval (a and b may be infinite). Let M be a function such that M(x) f(x) on [a, b] and let m(x) = M(x) b a M(x)dx The idea is to choose M so that it is easy to generate random variables from m. If [a, b] is finite, m can be chosen to be the uniform distribution.
22 Rejection sampling algorithm. Step 1: Generate T with the density m. Step 2: Generate U, uniform on [0, 1] and independent of T. If M(T ) U f(t ) accept, X = T. Otherwise reject, goto step 1 y REJECT ACCEPT a T b x Remark. High efficiency if algorithm accepts with high probability, i.e. M close to f.
Why does this work? 23 To show P (X A) = A f(t) dt P (X A) = P (T A Accept) = Condition on T = t (I = b a M(t)dt) P (T A and Accept) P (Accept) P (T A and Accept) = = = b a b a A P (T A and Accept T = t)f T (t) dt P (U f(t)/m(t) and t A)m(t) dt f(t) M(t) m(t) dt = 1 I A f(t) dt Similarly P (Accept) = b a P (Accept T = t)m(t) dt = b a f(t) M(t) m(t) dt = 1 I.
Example 24 Suppose we want to sample from a density whose graph is shown below. 1.8 Density f 1.6 1.4 1.2 1 f(x) 0.8 0.6 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x In this case we let M(T ) be the maximum of f over the interval [0, 1], namely M(x) = max(f), 0 x 1 so that m is the uniform density over the interval [0, 1].
25 Implementation N = 1000; t = (1:N)/N; M = max(f(t)); n = 2000; x = rejection_sampling(n,@f,m);
26 This routine will sample n iid samples from the density f function x = rejection_sampling(n,fun,m) x = []; for k = 1:n, OK = 0; while not(ok); T = rand(1); % Generate T U = rand(1); % Generate U if (M*U <= fun(t)), OK = 1; x = [x T]; end end end
27 Visualization of the results 250 Histogram of Sampled Data, Sample Size = 5000 200 150 frequency 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x Figure 1: Histogram of the Sampled Data
Variance and Standard Deviation 28 The Expected value can be viewed as an indication of the central value of the density or frequency function. The variance or standard deviation gives an indication about the dispersion of the probability distribution. The variance of random variable X is the mean square deviation from E[X] = µ Var(X) = E[(X µ) 2 ] = E[(X E[X]) 2 ]. The standard deviation is the square root of the variance SD(X) = Var(X). We often write µ = E[X] and σ 2 = Var(X) so that σ = SD(X). We may also calculate the variance as follows: Var(X) = E[X 2 ] (E[X]) 2.
Examples 29 Bernouilli: X Ber(p) Normal: X N(µ, σ 2 ) Var(X) = Var(X) = p(1 p) 1 2πσ = σ2 2π = σ 2. (x µ) 2 e (x µ)2 /(2σ 2) dx u 2 e u2 /2 du Remark: Y = a + bx Var(Y ) = b 2 Var(X) Example: X standard normal µ = 0, σ = 1 Y = µ + σx N(µ, σ 2 ) Then Var(Y ) = σ 2 Var(X) = σ 2.
30 Covariance and correlation The variance is a measure of the variability of a r.v. The covariance is a measure of the joint variability of two r.v s. The covariance of two jointly distributed r.v s (X, Y ) is given by Cov(X, Y ) = E[(X E[X])(Y E[Y ])] = E[(X µ X )(Y µ Y )]. Another formula Cov(X, Y ) = E[XY ] E[X]E[Y ].
31 Interpretation Positive covariance When X is larger than its mean, Y tends to be larger than its mean as well. When X is smaller than its mean, Y tends to be smaller than its mean as well. Negative covariance When X is larger than its mean, Y tends to be smaller than its mean. When X is smaller than its mean, Y tends to be larger than its mean.
Properties 32 1. Cov(X, X) = Var(X) 2. X and Y two r.v s: Var(X + Y ) = Var(X) + 2Cov(X, Y ) + Var(Y ) 3. If X and Y are independent, then Cov(X, Y ) = 0(= E[XY ] E[X]E[Y ]). 4. Let X 1, X 2,...X m be m r.v s and Y 1, Y 2,...Y n be n r.v s, let S 1 = m and S 2 = n i=1 Y i. Then X i i=1 m Var(S 1 ) = Var( X i ) = i=1 m i=1 Var(X i ) + 2 i<j Cov(X i, X j ), and Cov(S 1, S 2 ) = m n Cov(X i, Y j ). i=1 j=1
33 5. Suppose X 1, X 2,...X m are independent: Var(X 1 +... + X m ) = Var(X 1 ) +... + Var(X m ).
34 Example The variance of the binomial X Bin(n, p). X is the sum of n independent Bernouilli s and therefore X = I 1 +... + I n Var(X) = n Var(I i ) = i=1 n p(1 p) = np(1 p). i=1
Correlation 35 The correlation between two r.v s X and Y is defined by ρ = Cov(X, Y ) Var(X)Var(Y ). The correlation is a dimensionless quantity The correlation ρ between X and Y obeys 1 ρ 1. The correlation is a measure of the strength of the linear relationship existing between X and Y. ρ ± 1 ax + b = Y. Remark: (X, Y ) bivariate normal: the correlation between X and Y is ρ and Cov(X, Y ) = ρσ X σ Y
36 Conditional expectation The conditional expectation of Y given X = x is the mean of the conditional distribution of Y given X = x. For example, in the discrete case we have E[Y X = x] = y yp Y X (y x) and more generally the conditional expectation of h(y ) given X = x is E[h(Y ) X = x] = y h(y)p Y X (y x).
Example 37 X Binomial(n, p). Set m n and let Y be the number of successes in the first m trials. What are the conditional distribution and mean of Y given X = x? Conditional distribution: P (Y = y X = x) = ( m )( n m ) y x y ( n x). Conditional mean: Y = I 1 +... I m and E(Y X = x) = E(I 1 X = x) +... + E(I m X = x) = P (I 1 X = x) +... + P (I m X = x) = x/n +... + x/n = m n x
38 Conditional expectation as a random variable Assuming that the conditional expectation of Y given X = x exists for every x, it is a well defined function of X and, hence, a random variable. (The expectation of Y given X = x is a function depending on x : E[Y X = x] = g(x). E[Y X] = g(x) is r.v.) Example: E(Y X) = m n X. This random variable has an expectation, and a variance (proviso absolute convergence, etc.) Its expectation is E[E(Y X)]
39 Iterated Expectation Theorem E(Y ) = E[E(Y X)] Interpretation: the expectation of Y can be calculated by first conditioning on X, finding E(Y X) and then averaging this quantity with over X: E[Y ] = x E[Y X = x]p X (x) where E[Y X = x] = y yp Y X (y x). Example: E(Y X) = m n X and E(X) = np give E(Y ) = m E(X) = mp. n
40 Proof We need to show E(Y ) = x E(Y X = x)p (X = x) where E(Y X = x) = y yp Y X (y x). We have E(Y X = x)p X (x) = y y x P Y X (y x)p X (x) x = y yp Y (y) = E(Y ).
41 Random Sums Consider sums of the type where T = N X i, i=1 N is a r.v. with finite expectation E(N) <, the X i s are independent of N, with comon mean E[X i ] = E[X], i. Such sums arise in a variety of applications Insurance companies might receive N claims in a given period of time and the amounts of the individual claims may be modeled as r.v s X 1, X 2,.... N is the number of jobs in a single server queue and X i the service time for the ith job. T is the time to serve all the jobs in the queue.
42 E(T ) = all n E(T N = n)p (N = n) and i.e. ( n ) E(T N = n) = E X i = ne(x). i=1 E(T ) = ne(x)p (N = n) = E(N) E(X). n In other words, the average time to complete n jobs when n is random is the average value of N times the average time amount to complete a job.
43 The Moment Generating Function The moment generating function (mgf) of a random variable X is given by M(t) = E[e tx ] (In the continuous case, M(t) = e tx f(x) dx, Laplace transform of f) The mgf may not exist If X is Cauchy distributed (f(x) = (π(1 + x 2 )) 1 ), the mgf does not exist for any t 0. The mgf of a normal random variable exists for all t. Important property: if the mgf exists for t in an open interval containing zero, it uniquely determines the prob. distribution.
44 Why the Name MGF? e tx = 1 + tx + t2 X 2 E(e tx ) = 1 + te(x) + t2 E(X 2 ) 2 2 +... + tr X r r! +... +... + tr E(X r ) r! +... Suppose that the mgf exists in an open interval containing 0. Then M(0) = 1, M (0) = E(X),..., M (r) (0) = E(X r ). May be useful to compute all the moments of a distribution.
45 Example Mgf of the Poisson distribution This gives M(t) = k 0 = k 0 = e λ e λet. e tk λ λk e k! e λ (et λ) k k! M (t) = λe t M(t), M (t) = λe t M(t) + λ 2 e 2t M(t). It follows that if X is Poisson E(X) = λ, E(X 2 ) = λ 2 + λ Var(X) = λ.
46 Mgf of Sums of Independent Random Variables Very useful property: If X and Y are RV s with mgf s M X and M Y S = X + Y, then M Z (t) = M X (t)m Y (t) and on the common interval where both mgf s exist. Why? M S (t) = E(e t(x+y ) ) = E(e tx e ty ) = E(e tx )E(e ty ) = M X (t)m Y (t). Obvious extensions to sums of random variables.
47 Example The sum of two independent Poisson rv s is a Poisson rv. X Poi(λ) and Y Poi(µ), then X + Y Poi(λ + µ) M X+Y (t) = e λ e λet e µ e µet = e (λ+µ) e (λ+µ)et but this is the mgf of a Poisson rv with parameter λ + µ.