Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Random vectors Recall that a random vector X = X X 2 is made up of, say, k random variables X k A random vector has a joint distribution, eg a density f(x), that gives probabilities P(X A) = f(x)dx Just as a random variable X has a mean E(X) and variance var(x), a random vector also has a mean vector E(X) and a covariance matrix cov(x) A

Let X = (X,,X k ) be a random vector with density f(x,,x k ) or pmf p(x,,x k ) The mean of X is the vector of marginal means E(X) = E X X 2 = E(X ) E(X 2 ) X k E(X k ) The covariance matrix of X is given by cov(x) = cov(x,x ) cov(x, X 2 ) cov(x,x k ) cov(x 2,X ) cov(x 2, X 2 ) cov(x 2,X k ) cov(x k, X ) cov(x k, X 2 ) cov(x k, X k ) 2

Multivariate normal distribution The normal distribution generalizes to multiple dimensions We ll first look at two jointly distributed normal random variables, then discuss three or more The bivariate normal density for (X, X 2 ) is given by f(x, x 2 ) = 2πσ σ 2 ρ 2 exp 2( ρ 2 ) x µ σ 2 2ρ x µ σ x 2 µ 2 σ 2 + x 2 µ 2 2 σ 2 There are 5 parameters: (µ, µ 2, σ, σ 2, ρ) Read: pp 8 84, 45 46, 48, 567 568 3

This density jointly defines X and X 2, which live in R 2 = (, ) (, ) Marginally, X N(µ, σ 2 ) and X 2 N(µ 2, σ 2 2) The correlation between X and X 2 is given by corr(x, X 2 ) = ρ For jointly normal random variables, if the correlation is zero then they are independent This is not true in general for jointly defined random variables (eg homework 5 problem) E(X) = µ µ 2, cov(x) = σ2 σ σ 2 ρ σ σ 2 ρ σ2 2 Next slide: µ = 0, ; µ 2 = 0, 2; σ 2 = σ 2 2 = ; ρ = 0, 09, 06 4

4 0 2 4 4 0 2 4 4 2 0 2 4 4 2 0 2 4 4 0 2 4 4 0 2 4 4 2 0 2 4 4 2 0 2 4 Figure : Bivariate normal PDF level curves 5

Proof that X indpendent X 2 when ρ = 0 When ρ = 0 the joint density for (X, X 2 ) simplifies to f(x, x 2 ) = { [ exp (x ) 2 ( ) ]} 2 µ x2 µ 2 + 2πσ σ 2 2 σ σ 2 = [ e 05 x µ 2 σ e 05 x 2 µ 2 2 σ2 2πσ 2πσ2 Since these are each respectively functions of x and x 2 only, and the range of (X, X 2 ) factors into the produce of two sets, X and X 2 are independent and in fact X N(µ, σ 2 ) and X 2 N(µ 2, σ 2 2) 6

Conditional distributions [X X 2 = x 2 ] and [X 2 X = x ] The conditional distribution of X given X 2 = x 2 is ( [X X 2 = x 2 ] N µ + σ ) ρ(x 2 µ 2 ), σ 2 σ ( ρ 2 ) 2 Similarly, [X 2 X = x ] N ( µ 2 + σ ) 2 ρ(x µ ), σ 2 σ 2( ρ 2 ) This ties directly to linear regression: To predict X 2 X = x, we have [ E(X 2 X = x ) = µ 2 σ ] 2 ρµ σ + [ ] σ2 ρ x = β 0 + β x σ 7

Bivariate normal distribution as data model Here we assume X i X i2 iid N2 µ µ 2, σ σ 2 σ 2 σ 22, or succinctly, X i iid N 2 (µ,σ) If the bivariate normal model is appropriate for paired outcomes, it provides a convenient probability model with some nice properties The sample mean X = n n i= X i is the MLE of µ and the sample covariance matrix Σ = n n i= (X i X)(X i X) is the MLE for Σ It can be shown that X N 2 ( µ, n Σ ) The matrix n Σ has an inverted Wishart distribution 8

Say n outcome pairs are to be recorded: {(X, X 2 ), (X 2, X 22 ),,(X n, X n2 )} The i th pair is (X i, X i2 ) The sample mean vector is given elementwise by X = X X 2 = n n n i= X i n i= X i2, and the sample covariance matrix is given elementwise by S = n n n i= (X i X ) 2 n n i= (X i X )(X i2 X 2 ) n n i= (X i X )(X i2 X 2 ) n i= (X i2 X 2 ) 2 9

The sample mean vector X estimates µ = covariance matrix S estimates µ µ 2 and the sample Σ = σ 2 ρσ σ 2 = σ σ 2 ρσ σ 2 σ 2 2 σ 2 σ 22 We will place hats on parameter estimators based on the data So ˆµ = X, ˆµ 2 = X 2, ˆσ 2 = n n (X i X ) 2, ˆσ 2 2 = n i= n (X i2 X 2 ) 2 i= Also, cov(x, X 2 ) = n n (X i X)(X i2 X 2 ) i= 0

So a natural estimate of ρ is then ˆρ = cov(x, X 2 ) = ˆσ ˆσ 2 n n n i= (X i X )(X i2 X 2 ) n i= (X i X ) 2 n n i= (X i X ) 2 This is in fact the MLE estimate based on the bivariate normal model It is also a plug-in estimator based on the method-of-moments too It is commonly referred to as the Pearson correlation coefficient You can get it as, eg, cor(age,gesell) in R This estimate of correlation can be unduly influenced by outliers in the sample An alternative measure of linear association is the Spearman correlation based on ranks This correlation estimates something a bit different than the Pearson correlation > cor(age,gesell) [] -064029 > cor(age,gesell,method="spearman") [] -0366224

Gesell data: Recall: X is age in months a child speaks his/her first word and let Y is Gesell adaptive score, a measure of a child s aptitude Question: how does the child s aptitude change with how long it takes them to speak? Here, n = 2 In R we find E(X) = 438 9367 Also, cov(x) = 604 6778 6778 8632 Assuming a bivariate model, we plug in the MLEs and obtain the estimated PDF for (X, Y ): f(x, y) = exp( 6022+3006x 0034x 2 +09520y 00098xy 00043y 2 ) We can further find from Y N(9367, 8632), f Y (y) = exp( 3557 000256(y 9367) 2 ) 2

0 0 0005 000 00005 0 20 20 00 30 80 60 Figure 2: 3D plot of f(x, y) for (X, Y ) based on MLEs 3

0035 003 0025 002 005 00 0005 80 00 20 40 Figure 3: Solid is f Y (y); left dashed is f Y X (y 25) the right dashed is f Y X (y 0) As the age in months of first words X = x increases, the distribution of Gesell Adaptive Scores Y decreases 4

Gesell 40 60 80 00 20 0 0 20 30 40 50 age Figure 4: MLE estimate of density with actual data 5

R code to get estimates µ and Σ, plot level curves of PDF, and data age<-c(5,26,0,9,5,20,8,,8,20,7,9,0,,,0,2,42,7,,0) Gesell<-c(95,7,83,9,02,87,93,00,04,94,3,96,83,84,02,00,05,57,2,86,00) data=matrix(c(age,gesell),ncol=2) # make data matrix m=c(mean(age),mean(gesell)) # mean vector s=cov(data) # covariance matrix x=seq(min(age)-sd(age),max(age)+sd(age),length=200) # grid of representative age values x2=seq(min(gesell)-sd(gesell),max(gesell)+sd(gesell),length=200) # grid of Gesell values f=function(x,y){ r=s[,2]/sqrt(s[,]*s[2,2]) term=(x-m[])/sqrt(s[,]); term2=(y-m[2])/sqrt(s[2,2]); term3=-2*r*term*term2 exp(-05*(term^2+term2^2+term3)/(-r^2))/(2*34*sqrt(s[,]*s[2,2]*(-r^2))) } # function that gives the best fitting bivariate normal density to the data z=outer(x,x2,f) # compute the joint pdf over the grid contour(x,x2,z,nlevels=7,xlab="age",ylab="gesell") # make contour plot points(age,gesell,pch=9) # superimpose filled points 6

Checking bivariate normality There are no multivariate boxplots There are multivariate histograms (and smoothed histograms), but they re typically not that useful for checking normality Easiest thing to do is to check that marginal boxplots/histograms of X,,X n and X 2,,X 2n are approximately normal This does not guarantee joint normality though One can additionally look at a simple scatterplot of the data, checking to see that it shows an approximately elliptical cloud of points with no glaringly obvious outlying values 7

Multivariate normal distribution In general, a k-variate normal is defined through the mean and covariance matrix: X µ σ σ 2 σ k X 2 µ 2 σ 2 σ 22 σ 2k N k, σ k σ k2 σ kk X k µ k Succinctly, X N k (µ,σ) Recall that if Z N(0, ), then X = µ + σz N(µ, σ 2 ) The definition of the multivariate normal distribution just extends this idea 8

Instead of one standard normal, we have a list of k independent standard normals Z = (Z,,Z k ), and consider the same sort of transformation in the multivariate case using matrices and vectors Let Z,,Z k iid N(0, ) The joint pdf of (Z,,Z k ) is given by f(z,,z k ) = k exp( 05zi 2 )/ 2π i= Let µ = µ µ 2 and Σ = σ σ 2 σ k σ 2 σ 22 σ 2k, µ k σ k σ k2 σ kk where Σ is symmetric (ie Σ T = Σ, which implies σ ij = σ ji for all i, j k) 9

Let Σ /2 be any matrix such that Σ /2 Σ /2 = Σ Then X = µ + Σ /2 Z is said to have a multivariate normal distribution with mean vector µ and covariance matrix Σ, written Written in terms of matrices X N k (µ,σ) /2 X = X X 2 = µ µ 2 + σ σ 2 σ k σ 2 σ 22 σ 2k Z Z 2 X k µ k σ k σ k2 σ kk Z k 20

Using some math, it can be shown that the pdf of the new vector X = (X,,X k ) is given by f(x,,x k µ,σ) = 2πΣ /2 exp{ 05(x µ) Σ (x µ)} In the one-dimensional case, this simplifies to our old friend f(x µ, σ 2 ) = (2πσ 2 ) /2 exp{ 05(x µ)(σ 2 ) (x µ)}, the pdf of a N(µ, σ 2 ) random variable X There are several important properties of multivariate normal vectors 2

Let X N k (µ,σ) Then For each X i in X = (X,,X k ), E(X i ) = µ i and var(x i ) = σ ii That is, marginally, X i N(µ i, σ ii ) 2 For any r k matrix M, MX N r (Mµ,MΣM ) 3 For any two (X i, X j ) where i < j k, cov(x i, X j ) = σ ij The off-diagonal elements of Σ give the covariance between two elements of (X,,X k ) Note then ρ(x i, X j ) = σ ij / σ ii σ jj 4 For any k vector m = (m,,m k ) and Y N k (µ,σ), m + Y N k (m + µ,σ) 22

As a simple example, let X X 2 X 3 N 3 2 5 0, 2 3 4 Define M = 0 3 3 3 and Y = Y Y 2 = MX = 0 3 3 3 Then X2 N(5, 3), cov(x2, X3) = and 0 3 3 3 X X 2 X 3 N 2 0 3 3 3 2 5 0, 0 3 3 3 2 3 4 23 X X 2 X 3 3 0 3 3,

or simplifying, Y Y 2 = 0 3 3 3 X X 2 X 3 N 2 2, 4 0 0 9 Note that for the transformed vector Y = (Y, Y 2 ), cov(y, Y 2 ) = 0 and therefore Y and Y 2 are uncorrelated, ie ρ(y, Y 2 ) = 0 We will use these properties later on, when we discuss the large sample distribution of MLE vectors ˆθ θ ˆθ 2 θ = θ 2 N p, ĉov( θ) ˆθ p θ p 24

For the linear model (eg simple linear regression or the two-sample model) Y = Xβ + e, the MLE that describes the mean parameters has exactly a multivariate normal distribution (p 574) β N p (β, (X X) σ 2 ) p = 2 is the number of mean parameters The MLE ˆσ 2 has a gamma distribution Let s try an experiment where we know β = We will generate 200 independent experiments, each with a sample size of n = 20 The model is Y i = β 0 + β x i + e i, i =,,20, where β 0 = 5, β = 3, and σ = 02 Take x i to be equally spaced from iid x = 0 to x 20 = The errors satisfy e,,e 20 N(0, 02 2 ) 5 3 25

Here s R code to sample the m = 200 β s, each obtained from a sample of size n = 20: sigma=02 m=200 # number of betahat s sampled n=20 # sample size going into each experiment y=rep(0,n) beta0hat=rep(0,m) betahat=rep(0,m) x=seq(0,,length=n) # predictor values evenly spaced over [0,] beta=c(5,3) # true (but usually unknown) beta vector mean=beta[]+beta[2]*x # true mean function on grid of x values for(i in :m){ y=mean+rnorm(n,0,sigma) fit=lm(y~x) beta0hat[i]=fit$coefficients[] betahat[i]=fit$coefficients[2] } Xdesign=matrix(c(rep(,n),x),ncol=2) m=beta # exact mean vector s=solve(t(xdesign)%*%xdesign)*sigma^2 # exact covariance x=seq(47,53,length=200) # grid of representative betahat0 values x2=seq(25,34,length=200) # grid of representative betahat values 26

f=function(x,y){ r=s[,2]/sqrt(s[,]*s[2,2]) term=(x-m[])/sqrt(s[,]); term2=(y-m[2])/sqrt(s[2,2]); term3=-2*r*term*term2 exp(-05*(term^2+term2^2+term3)/(-r^2))/(2*34*sqrt(s[,]*s[2,2]*(-r^2))) } # exact bivariate normal density according to theory g0=rep(beta[],200); g=rep(beta[2],200) z=outer(x,x2,f) # compute the joint pdf over the grid contour(x,x2,z,nlevels=5,xlab="beta0hat",ylab="betahat") # make contour plot points(beta0hat,betahat) # superimpose betahat s from m experiments of sample size n lines(x,g,lty=2); lines(g0,x2,lty=2) # dotted lines crossing at true beta Exactly, the theory tells us ˆβ 0 ˆβ N 2 5 3, 000743 00086 00086 0027 Let s see how a plot of the 200 β s looks superimposed on the exact sampling distribution 27

betahat 26 28 30 32 34 47 48 49 50 5 52 53 beta0hat Figure 5: Theoretical distribution (pdf) with 200 sampled MLE s 28