Lecture 25: Review Statistics 104 Colin Rundel April 23, 2012
Joint CDF F (x, y) = P [X x, Y y] = P [(X, Y ) lies south-west of the point (x, y)] Y (x,y) X Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 1 / 24
Joint CDF, cont. The joint Cumulative distribution function follows the same rules as the univariate CDF, Univariate definition: x F (x) = P (X x) = f(z)dz lim F (x) = 0 x lim F (x) = 1 x x y F (x) F (y) Bivariate definition: y x F (x, y) = P (X x, Y y) = f(x, y) dx dy lim F (x, y) = 0 lim x,y F (x, y) = 1 x x,y x, y y F (x, y) F (x, y ) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 2 / 24
Marginal Distributions We can define marginal distributions based on the CDF by setting one of the values to infinity: F (x, ) = P (X x, Y ) = = P (X x) = F X (x) x f(x, y) dy dy F (, y) = P (X, Y y) = = P (Y y) = F Y (y) y f(x, y) dx dy Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 3 / 24
Joint pdf Similar to the CDF the probability density function follows the same general rules except in two dimensions, Univariate definition: f(x) 0 for all x f(x) = d dx F (x) f(x)dx = 1 Bivariate definition: f(x, y) 0 for all (x, y) f(x, y) = F (x, y) x y f(x, y) dx dy = 1 Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 4 / 24
Marginal pdfs Marginal probability density functions are defined in terms of integrating out one of the random variables. f X (x) = f Y (x) = f(x, y) dy f(x, y) dx Previously we defined independence in terms of E(XY ) = E(X)E(Y ) X and Y are independent. This is equivalent in the joint case of f(x, y) = f X (x)f Y (y) X and Y are independent. Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 5 / 24
Probability and Expectation Univariate definition: Bivariate definition: P (X A) = f(x) dx A E[g(X)] = g(x) f(x) dx P (X A, Y B) = f(x, y) dx dy A B E[g(X, Y )] = g(x, y) f(x, y) dx dy Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 6 / 24
Bivariate Unit Normal If X and Y have independent unit normal distributions then their joint distribution f(x, y) is given by f(x, y) = φ(x)φ(y) = 1 2π e 1 2 (x2 +y 2 ) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 7 / 24
Rayleigh Distribution The distance from the origin, R, of a point (x, y) derived from X, Y N (0, 1) is called the Rayleigh distribution. f R(r) = re 1 2 r2 F r(r) = r re 1 2 r2 0 = e 1 2 r2 r V ar(r) = 4 π 2 0 dr = 1 e 1 2 r2 E(R) = π 2 Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 8 / 24
Sum of Independent Normals Let X N (µ, σ 2 ) and Y N (λ, τ) then M X(t) = exp(µt + 1 2 σ2 t 2 ) M Y (t) = exp(λt + 1 2 τ 2 t 2 ) M X+Y (t) = exp(µt + 1 2 σ2 t 2 ) exp(λt + 1 2 τ 2 t 2 ) = exp((µ + λ)t + 1 2 (σ2 + τ 2 )t 2 Therefore X + Y N (µ + λ, σ 2 + τ 2 ), which can be generalized to any number of normals. Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 9 / 24
Operation Distributions If X, Y have a joint density f(x, y), then Z = g(x, Y ) has a density f g(x,y ) (z) which is determined by integrating over all values of X or Y and then considering a change of variables for the other to Z. Distribution of Sums - Z = X + Y : / dz f X+Y (z) = f(x, z/x) dy dx = f(x, z x) dx Distribution of Ratios - Z = Y/X: / dz f Y/X) (z) = f(x, z/x) dy dx = x f(x, zx) dx Product Distribution - Z = XY : / dz f XY (z) = f(x, z/x) dy dx = f(x, z/x)/ x dx Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 10 / 24
Conditional Expectation If X and Y are independent random variables then we define the conditional expectation as follows E(X Y = y) = all y xf(x y) dx E(X Y = y) = xf(x y) dx We can think of the conditional expectation as being a function of the random variable Y, thereby making E(X Y ) itself a random variable E(E(X Y )) = E(X) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 11 / 24
Conditional Variance Similar to conditional expectation we can also define conditional variance V ar(y X) = E [(Y E(Y X)) 2 ] X = E(Y 2 X) E(Y X) 2 there is also an equivalent to the law of total probability for expectations, the Law of Total Probability for Variance: V ar(y ) = E[V ar(y X)] + V ar[e(y X)] Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 12 / 24
Covariance We have previously discussed Covariance in relation to the variance of the sum of two random variables (Review Lecture 8). V ar(x + Y ) = V ar(x) + V ar(y ) + 2Cov(X, Y ) Specifically, covariance is defined as Cov(X, Y ) = E[(X E(X))(Y E(Y ))] = E(XY ) E(X)E(Y ) this is a generalization of variance to two random variables and generally measures the degree to which X and Y tend to be large (or small) at the same time or the degree to which one tends to be large while the other is small. Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 13 / 24
Properties of Covariance Cov(X, Y ) = E[(X µ x )(Y µ y )] = E(XY ) µ x µ y Cov(X, Y ) = Cov(Y, X) Cov(X, Y ) = 0 if X and Y are independent Cov(X, c) = 0 Cov(X, X) = V ar(x) Cov(aX, by ) = ab Cov(X, Y ) Cov(X + a, Y + b) = Cov(X, Y ) Cov(X, Y + Z) = Cov(X, Y ) + Cov(X, Z) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 14 / 24
Correlation Since Cov(X, Y ) depends on the magnitude of X and Y we would prefer to have a measure of association that is not effected by arbitrary changes in the scales of the random variables. The most common measure of linear association is correlation which is defined as ρ(x, Y ) = Cov(X, Y ) σ X σ Y 1 < ρ(x, Y ) < 1 Where the magnitude of the correlation measures the strength of the linear association and the sign determines if it is a positive or negative relationship. Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 15 / 24
General Bivariate Normal Let Z 1, Z 2 N (0, 1), which we will use to build a general bivariate normal distribution. f(z 1, z 2 ) = 1 [ 2π exp 1 ] 2 (z2 1 + z2) 2 We want to transform these unit normal distributions to have the follow arbitrary parameters: µ X, µ Y, σ X, σ Y, ρ X = σ X Z 1 + µ X Y = σ Y [ρz 1 + 1 ρ 2 Z 2 ] + µ Y [ ( 1 1 (x µx ) 2 f(x, y) = 2πσ X σ Y (1 ρ 2 exp ) 1/2 2(1 ρ 2 ) σ X 2 + (y µ Y ) 2 σ 2 Y 2ρ (x µ )] X ) (y µ Y ) σ X σ Y Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 16 / 24
General Bivariate Normal - Marginals and Cov The marginal distributions of X and Y are given by X = σ X Z 1 + µ X = σ X N (0, 1) + µ X = N (µ X, σ 2 X ) Y = σ Y [ρz 1 + 1 ρ 2 Z 2 ] + µ Y = σ Y [ρn (0, 1) + 1 ρ 2 N (0, 1)] + µ Y = σ Y [N (0, ρ 2 ) + N (0, 1 ρ 2 )] + µ Y = σ Y N (0, 1) + µ Y = N (µ Y, σ 2 Y ) Cov(X, Y ) = E [(X E(X))(Y E(Y ))] [ = E (σ X Z 1 + µ X µ X )(σ Y [ρz 1 + ] 1 ρ 2 Z 2 ] + µ Y µ Y ) [ = E (σ X Z 1 )(σ Y [ρz 1 + ] 1 ρ 2 Z 2 ]) = σ X σ Y E [ρz 1 2 + ] 1 ρ 2 Z 1 Z 2 = σ X σ Y ρe[z 2 1 ] = σ X σ Y ρ ρ(x, Y ) = Cov(X, Y ) σ X σ Y = ρ Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 17 / 24
General Bivariate Normal - Density (Matrix Notation) Obviously, the density for the Bivariate Normal is ugly, and it only gets worse when we consider higher dimensional joint densities of normals. We can write the density in a more compact form using matrix notation, f(x) = 1 [ 2π (det Σ) 1/2 exp 1 ] 2 (x µ)t Σ 1 (x µ) In the bivariate case, ( ) x x = y µ = ( µx µ Y ) ( σ 2 Σ = X ρσ X σ Y ρσ X σ Y σ 2 Y ) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 18 / 24
Conditional Expectation of the Bivariate Normal Using X = µ X + σ X Z 1 and Y = µ Y + σ Y [ρz 1 + (1 ρ 2 ) 1/2 Z 2 ] where Z 1, Z 2 N (0, 1) we can find E(Y X). Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 19 / 24
Conditional Expectation of the Bivariate Normal Using X = µ X + σ X Z 1 and Y = µ Y + σ Y [ρz 1 + (1 ρ 2 ) 1/2 Z 2 ] where Z 1, Z 2 N (0, 1) we can find E(Y X). ) ] E[Y X = x] = E [µ Y + σ Y (ρz 1 + (1 ρ 2 ) 1/2 Z 2 X = x ( = E [µ Y + σ Y ρ x µ ) X + (1 ρ 2 ) 1/2 Z 2 X = x] σ X ( = µ Y + σ Y ρ x µ ) X + (1 ρ 2 ) 1/2 E[Z 2 X = x] σ X ( ) x µx = µ Y + σ Y ρ σ X By symmetry, ( ) y µy E[X Y = y] = µ X + σ X ρ σ Y Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 19 / 24
Conditional Variance of the Bivariate Normal Using X = µ X + σ X Z 1 and Y = µ Y + σ Y [ρz 1 + (1 ρ 2 ) 1/2 Z 2 ] where Z 1, Z 2 N (0, 1) we can find V ar(y X). Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 20 / 24
Conditional Variance of the Bivariate Normal Using X = µ X + σ X Z 1 and Y = µ Y + σ Y [ρz 1 + (1 ρ 2 ) 1/2 Z 2 ] where Z 1, Z 2 N (0, 1) we can find V ar(y X). ) ] V ar[y X = x] = V ar [µ Y + σ Y (ρz 1 + (1 ρ 2 ) 1/2 Z 2 X = x ( = V ar [µ Y + σ Y ρ x µ ) X + (1 ρ 2 ) 1/2 Z 2 X = x] σ X = V ar[σ Y (1 ρ 2 )Z 2 X = x] = σ 2 Y (1 ρ 2 ) By symmetry, V ar[x Y = y] = σ 2 X(1 ρ 2 ) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 20 / 24
Bayesian Inference Bayes Theorem states P (A B) = P (B A)P (A) P (B) We are interested in estimating the model parameters based on the observed data and any prior belief about the parameters, which we setup as follows P (θ X) = P (X θ) P (X) π(θ) P (X θ) π(θ) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 21 / 24
Bayesian Inference - Terminology Elements of the Bayesian Model: π(θ) - Prior distribution - This distribution reflects any preexisting information / belief about the distribution of the parameter(s). P (X θ) - Likelihood / Sampling distribution - Distribution of the data given the parameters, which is the probability model believed to have generated the data. P (X) - Marginal distribution of the data - Distribution of the observed data marginalized over all possible values of the parameter(s). P (θ X) - Posterior distribution - Distribution of the parameter(s) after taking the observed data into account. Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 22 / 24
Maximum Likelihood The maximum likelihood estimates parameter(s) by maximizing the likelihood function given the observed data. n L(θ X) = f(x θ) = f(x i θ) i=1 l(θ X) = log L(θ X) = i = 1 n log f(x i θ) The maximum likelihood estimate (estimator), MLE, is therefore given by MLE = ˆθ MLE = arg max θ L(θ X) = arg max θ l(θ X) Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 23 / 24
Why use the MLE? The maximum likelihood estimator has the following properties: Consistency - as the sample size tends to infinity the MLE tends to the true value of the parameter(s) Asymptotic normality - as the sample size increases, the distribution of the MLE tends to the normal distribution Efficiency - as the sample size tends to infinity, there are not any other unbiased estimator with a lower mean squared error Common issues: Uniqueness - estimate may not be unique Computation - results may not have a closed form Statistics 104 (Colin Rundel) Lecture 25 April 23, 2012 24 / 24