Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Similar documents
18 Bivariate normal distribution I

Chapter 5 continued. Chapter 5 sections

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Introduction to Normal Distribution

Covariance. Lecture 20: Covariance / Correlation & General Bivariate Normal. Covariance, cont. Properties of Covariance

Exam 2. Jeremy Morris. March 23, 2006

Bivariate distributions

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

STOR Lecture 16. Properties of Expectation - I

Notes on Random Vectors and Multivariate Normal

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

Chapter 5. Chapter 5 sections

Multiple Random Variables

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Random Variables and Their Distributions

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Machine learning - HT Maximum Likelihood

Notes on the Multivariate Normal and Related Topics

Matrices and vectors A matrix is a rectangular array of numbers. Here s an example: A =

Multivariate Gaussian Distribution. Auxiliary notes for Time Series Analysis SF2943. Spring 2013

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

[y i α βx i ] 2 (2) Q = i=1

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Introduction to Probability and Stocastic Processes - Part I

TAMS39 Lecture 2 Multivariate normal distribution

Quick Tour of Basic Probability Theory and Linear Algebra

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

Outline Properties of Covariance Quantifying Dependence Models for Joint Distributions Lab 4. Week 8 Jointly Distributed Random Variables Part II

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

4. Distributions of Functions of Random Variables

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Joint p.d.f. and Independent Random Variables

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

ENGG2430A-Homework 2

Lecture 2: Review of Probability

STT 843 Key to Homework 1 Spring 2018

MAS223 Statistical Inference and Modelling Exercises

Algorithms for Uncertainty Quantification

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Multivariate Random Variable

Multivariate Time Series

STAT Chapter 5 Continuous Distributions

BASICS OF PROBABILITY

UCSD ECE153 Handout #34 Prof. Young-Han Kim Tuesday, May 27, Solutions to Homework Set #6 (Prepared by TA Fatemeh Arbabjolfaei)

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

Problem Selected Scores

The Multivariate Gaussian Distribution

Chapter 5 Class Notes

Lecture 2: Repetition of probability theory and statistics

Next tool is Partial ACF; mathematical tools first. The Multivariate Normal Distribution. e z2 /2. f Z (z) = 1 2π. e z2 i /2

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

matrix-free Elements of Probability Theory 1 Random Variables and Distributions Contents Elements of Probability Theory 2

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

MATH Notebook 4 Fall 2018/2019

ACM 116: Lectures 3 4

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

Linear Methods for Prediction

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

STA 2201/442 Assignment 2

Elements of Probability Theory

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

The Multivariate Gaussian Distribution [DRAFT]

A Probability Review

Lecture 1: August 28

Lecture 11. Multivariate Normal theory

15 Discrete Distributions

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Correlation analysis 2: Measures of correlation

Gaussian Models (9/9/13)

UC Berkeley Department of Electrical Engineering and Computer Science. EE 126: Probablity and Random Processes. Problem Set 8 Fall 2007

MIT Spring 2015

Part 6: Multivariate Normal and Linear Models

The generative approach to classification. A classification problem. Generative models CSE 250B

Mathematical statistics

3. Probability and Statistics

Bivariate Distributions. Discrete Bivariate Distribution Example

Mathematical statistics

HT Introduction. P(X i = x i ) = e λ λ x i

Continuous Random Variables

SOLUTION FOR HOMEWORK 12, STAT 4351

Solutions to Homework Set #6 (Prepared by Lele Wang)

Statistics STAT:5100 (22S:193), Fall Sample Final Exam B

Simple Linear Regression

Stat 5101 Notes: Brand Name Distributions

Covariance and Correlation

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

UNIVERSITY OF MASSACHUSETTS. Department of Mathematics and Statistics. Basic Exam - Applied Statistics. Tuesday, January 17, 2017

Multivariate Distributions (Hogg Chapter Two)

The Multivariate Normal Distribution. In this case according to our theorem

7. The Multivariate Normal Distribution

Master s Written Examination - Solution

f X, Y (x, y)dx (x), where f(x,y) is the joint pdf of X and Y. (x) dx

Lecture 3. Probability - Part 2. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. October 19, 2016

Transcription:

Random vectors Recall that a random vector X = X X 2 is made up of, say, k random variables X k A random vector has a joint distribution, eg a density f(x), that gives probabilities P(X A) = f(x)dx Just as a random variable X has a mean E(X) and variance var(x), a random vector also has a mean vector E(X) and a covariance matrix cov(x) A

Let X = (X,,X k ) be a random vector with density f(x,,x k ) or pmf p(x,,x k ) The mean of X is the vector of marginal means E(X) = E X X 2 = E(X ) E(X 2 ) X k E(X k ) The covariance matrix of X is given by cov(x) = cov(x,x ) cov(x, X 2 ) cov(x,x k ) cov(x 2,X ) cov(x 2, X 2 ) cov(x 2,X k ) cov(x k, X ) cov(x k, X 2 ) cov(x k, X k ) 2

Multivariate normal distribution The normal distribution generalizes to multiple dimensions We ll first look at two jointly distributed normal random variables, then discuss three or more The bivariate normal density for (X, X 2 ) is given by f(x, x 2 ) = 2πσ σ 2 ρ 2 exp 2( ρ 2 ) x µ σ 2 2ρ x µ σ x 2 µ 2 σ 2 + x 2 µ 2 2 σ 2 There are 5 parameters: (µ, µ 2, σ, σ 2, ρ) Read: pp 8 84, 45 46, 48, 567 568 3

This density jointly defines X and X 2, which live in R 2 = (, ) (, ) Marginally, X N(µ, σ 2 ) and X 2 N(µ 2, σ 2 2) The correlation between X and X 2 is given by corr(x, X 2 ) = ρ For jointly normal random variables, if the correlation is zero then they are independent This is not true in general for jointly defined random variables (eg homework 5 problem) E(X) = µ µ 2, cov(x) = σ2 σ σ 2 ρ σ σ 2 ρ σ2 2 Next slide: µ = 0, ; µ 2 = 0, 2; σ 2 = σ 2 2 = ; ρ = 0, 09, 06 4

4 0 2 4 4 0 2 4 4 2 0 2 4 4 2 0 2 4 4 0 2 4 4 0 2 4 4 2 0 2 4 4 2 0 2 4 Figure : Bivariate normal PDF level curves 5

Proof that X indpendent X 2 when ρ = 0 When ρ = 0 the joint density for (X, X 2 ) simplifies to f(x, x 2 ) = { [ exp (x ) 2 ( ) ]} 2 µ x2 µ 2 + 2πσ σ 2 2 σ σ 2 = [ e 05 x µ 2 σ e 05 x 2 µ 2 2 σ2 2πσ 2πσ2 Since these are each respectively functions of x and x 2 only, and the range of (X, X 2 ) factors into the produce of two sets, X and X 2 are independent and in fact X N(µ, σ 2 ) and X 2 N(µ 2, σ 2 2) 6

Conditional distributions [X X 2 = x 2 ] and [X 2 X = x ] The conditional distribution of X given X 2 = x 2 is ( [X X 2 = x 2 ] N µ + σ ) ρ(x 2 µ 2 ), σ 2 σ ( ρ 2 ) 2 Similarly, [X 2 X = x ] N ( µ 2 + σ ) 2 ρ(x µ ), σ 2 σ 2( ρ 2 ) This ties directly to linear regression: To predict X 2 X = x, we have [ E(X 2 X = x ) = µ 2 σ ] 2 ρµ σ + [ ] σ2 ρ x = β 0 + β x σ 7

Bivariate normal distribution as data model Here we assume X i X i2 iid N2 µ µ 2, σ σ 2 σ 2 σ 22, or succinctly, X i iid N 2 (µ,σ) If the bivariate normal model is appropriate for paired outcomes, it provides a convenient probability model with some nice properties The sample mean X = n n i= X i is the MLE of µ and the sample covariance matrix Σ = n n i= (X i X)(X i X) is the MLE for Σ It can be shown that X N 2 ( µ, n Σ ) The matrix n Σ has an inverted Wishart distribution 8

Say n outcome pairs are to be recorded: {(X, X 2 ), (X 2, X 22 ),,(X n, X n2 )} The i th pair is (X i, X i2 ) The sample mean vector is given elementwise by X = X X 2 = n n n i= X i n i= X i2, and the sample covariance matrix is given elementwise by S = n n n i= (X i X ) 2 n n i= (X i X )(X i2 X 2 ) n n i= (X i X )(X i2 X 2 ) n i= (X i2 X 2 ) 2 9

The sample mean vector X estimates µ = covariance matrix S estimates µ µ 2 and the sample Σ = σ 2 ρσ σ 2 = σ σ 2 ρσ σ 2 σ 2 2 σ 2 σ 22 We will place hats on parameter estimators based on the data So ˆµ = X, ˆµ 2 = X 2, ˆσ 2 = n n (X i X ) 2, ˆσ 2 2 = n i= n (X i2 X 2 ) 2 i= Also, cov(x, X 2 ) = n n (X i X)(X i2 X 2 ) i= 0

So a natural estimate of ρ is then ˆρ = cov(x, X 2 ) = ˆσ ˆσ 2 n n n i= (X i X )(X i2 X 2 ) n i= (X i X ) 2 n n i= (X i X ) 2 This is in fact the MLE estimate based on the bivariate normal model It is also a plug-in estimator based on the method-of-moments too It is commonly referred to as the Pearson correlation coefficient You can get it as, eg, cor(age,gesell) in R This estimate of correlation can be unduly influenced by outliers in the sample An alternative measure of linear association is the Spearman correlation based on ranks This correlation estimates something a bit different than the Pearson correlation > cor(age,gesell) [] -064029 > cor(age,gesell,method="spearman") [] -0366224

Gesell data: Recall: X is age in months a child speaks his/her first word and let Y is Gesell adaptive score, a measure of a child s aptitude Question: how does the child s aptitude change with how long it takes them to speak? Here, n = 2 In R we find E(X) = 438 9367 Also, cov(x) = 604 6778 6778 8632 Assuming a bivariate model, we plug in the MLEs and obtain the estimated PDF for (X, Y ): f(x, y) = exp( 6022+3006x 0034x 2 +09520y 00098xy 00043y 2 ) We can further find from Y N(9367, 8632), f Y (y) = exp( 3557 000256(y 9367) 2 ) 2

0 0 0005 000 00005 0 20 20 00 30 80 60 Figure 2: 3D plot of f(x, y) for (X, Y ) based on MLEs 3

0035 003 0025 002 005 00 0005 80 00 20 40 Figure 3: Solid is f Y (y); left dashed is f Y X (y 25) the right dashed is f Y X (y 0) As the age in months of first words X = x increases, the distribution of Gesell Adaptive Scores Y decreases 4

Gesell 40 60 80 00 20 0 0 20 30 40 50 age Figure 4: MLE estimate of density with actual data 5

R code to get estimates µ and Σ, plot level curves of PDF, and data age<-c(5,26,0,9,5,20,8,,8,20,7,9,0,,,0,2,42,7,,0) Gesell<-c(95,7,83,9,02,87,93,00,04,94,3,96,83,84,02,00,05,57,2,86,00) data=matrix(c(age,gesell),ncol=2) # make data matrix m=c(mean(age),mean(gesell)) # mean vector s=cov(data) # covariance matrix x=seq(min(age)-sd(age),max(age)+sd(age),length=200) # grid of representative age values x2=seq(min(gesell)-sd(gesell),max(gesell)+sd(gesell),length=200) # grid of Gesell values f=function(x,y){ r=s[,2]/sqrt(s[,]*s[2,2]) term=(x-m[])/sqrt(s[,]); term2=(y-m[2])/sqrt(s[2,2]); term3=-2*r*term*term2 exp(-05*(term^2+term2^2+term3)/(-r^2))/(2*34*sqrt(s[,]*s[2,2]*(-r^2))) } # function that gives the best fitting bivariate normal density to the data z=outer(x,x2,f) # compute the joint pdf over the grid contour(x,x2,z,nlevels=7,xlab="age",ylab="gesell") # make contour plot points(age,gesell,pch=9) # superimpose filled points 6

Checking bivariate normality There are no multivariate boxplots There are multivariate histograms (and smoothed histograms), but they re typically not that useful for checking normality Easiest thing to do is to check that marginal boxplots/histograms of X,,X n and X 2,,X 2n are approximately normal This does not guarantee joint normality though One can additionally look at a simple scatterplot of the data, checking to see that it shows an approximately elliptical cloud of points with no glaringly obvious outlying values 7

Multivariate normal distribution In general, a k-variate normal is defined through the mean and covariance matrix: X µ σ σ 2 σ k X 2 µ 2 σ 2 σ 22 σ 2k N k, σ k σ k2 σ kk X k µ k Succinctly, X N k (µ,σ) Recall that if Z N(0, ), then X = µ + σz N(µ, σ 2 ) The definition of the multivariate normal distribution just extends this idea 8

Instead of one standard normal, we have a list of k independent standard normals Z = (Z,,Z k ), and consider the same sort of transformation in the multivariate case using matrices and vectors Let Z,,Z k iid N(0, ) The joint pdf of (Z,,Z k ) is given by f(z,,z k ) = k exp( 05zi 2 )/ 2π i= Let µ = µ µ 2 and Σ = σ σ 2 σ k σ 2 σ 22 σ 2k, µ k σ k σ k2 σ kk where Σ is symmetric (ie Σ T = Σ, which implies σ ij = σ ji for all i, j k) 9

Let Σ /2 be any matrix such that Σ /2 Σ /2 = Σ Then X = µ + Σ /2 Z is said to have a multivariate normal distribution with mean vector µ and covariance matrix Σ, written Written in terms of matrices X N k (µ,σ) /2 X = X X 2 = µ µ 2 + σ σ 2 σ k σ 2 σ 22 σ 2k Z Z 2 X k µ k σ k σ k2 σ kk Z k 20

Using some math, it can be shown that the pdf of the new vector X = (X,,X k ) is given by f(x,,x k µ,σ) = 2πΣ /2 exp{ 05(x µ) Σ (x µ)} In the one-dimensional case, this simplifies to our old friend f(x µ, σ 2 ) = (2πσ 2 ) /2 exp{ 05(x µ)(σ 2 ) (x µ)}, the pdf of a N(µ, σ 2 ) random variable X There are several important properties of multivariate normal vectors 2

Let X N k (µ,σ) Then For each X i in X = (X,,X k ), E(X i ) = µ i and var(x i ) = σ ii That is, marginally, X i N(µ i, σ ii ) 2 For any r k matrix M, MX N r (Mµ,MΣM ) 3 For any two (X i, X j ) where i < j k, cov(x i, X j ) = σ ij The off-diagonal elements of Σ give the covariance between two elements of (X,,X k ) Note then ρ(x i, X j ) = σ ij / σ ii σ jj 4 For any k vector m = (m,,m k ) and Y N k (µ,σ), m + Y N k (m + µ,σ) 22

As a simple example, let X X 2 X 3 N 3 2 5 0, 2 3 4 Define M = 0 3 3 3 and Y = Y Y 2 = MX = 0 3 3 3 Then X2 N(5, 3), cov(x2, X3) = and 0 3 3 3 X X 2 X 3 N 2 0 3 3 3 2 5 0, 0 3 3 3 2 3 4 23 X X 2 X 3 3 0 3 3,

or simplifying, Y Y 2 = 0 3 3 3 X X 2 X 3 N 2 2, 4 0 0 9 Note that for the transformed vector Y = (Y, Y 2 ), cov(y, Y 2 ) = 0 and therefore Y and Y 2 are uncorrelated, ie ρ(y, Y 2 ) = 0 We will use these properties later on, when we discuss the large sample distribution of MLE vectors ˆθ θ ˆθ 2 θ = θ 2 N p, ĉov( θ) ˆθ p θ p 24

For the linear model (eg simple linear regression or the two-sample model) Y = Xβ + e, the MLE that describes the mean parameters has exactly a multivariate normal distribution (p 574) β N p (β, (X X) σ 2 ) p = 2 is the number of mean parameters The MLE ˆσ 2 has a gamma distribution Let s try an experiment where we know β = We will generate 200 independent experiments, each with a sample size of n = 20 The model is Y i = β 0 + β x i + e i, i =,,20, where β 0 = 5, β = 3, and σ = 02 Take x i to be equally spaced from iid x = 0 to x 20 = The errors satisfy e,,e 20 N(0, 02 2 ) 5 3 25

Here s R code to sample the m = 200 β s, each obtained from a sample of size n = 20: sigma=02 m=200 # number of betahat s sampled n=20 # sample size going into each experiment y=rep(0,n) beta0hat=rep(0,m) betahat=rep(0,m) x=seq(0,,length=n) # predictor values evenly spaced over [0,] beta=c(5,3) # true (but usually unknown) beta vector mean=beta[]+beta[2]*x # true mean function on grid of x values for(i in :m){ y=mean+rnorm(n,0,sigma) fit=lm(y~x) beta0hat[i]=fit$coefficients[] betahat[i]=fit$coefficients[2] } Xdesign=matrix(c(rep(,n),x),ncol=2) m=beta # exact mean vector s=solve(t(xdesign)%*%xdesign)*sigma^2 # exact covariance x=seq(47,53,length=200) # grid of representative betahat0 values x2=seq(25,34,length=200) # grid of representative betahat values 26

f=function(x,y){ r=s[,2]/sqrt(s[,]*s[2,2]) term=(x-m[])/sqrt(s[,]); term2=(y-m[2])/sqrt(s[2,2]); term3=-2*r*term*term2 exp(-05*(term^2+term2^2+term3)/(-r^2))/(2*34*sqrt(s[,]*s[2,2]*(-r^2))) } # exact bivariate normal density according to theory g0=rep(beta[],200); g=rep(beta[2],200) z=outer(x,x2,f) # compute the joint pdf over the grid contour(x,x2,z,nlevels=5,xlab="beta0hat",ylab="betahat") # make contour plot points(beta0hat,betahat) # superimpose betahat s from m experiments of sample size n lines(x,g,lty=2); lines(g0,x2,lty=2) # dotted lines crossing at true beta Exactly, the theory tells us ˆβ 0 ˆβ N 2 5 3, 000743 00086 00086 0027 Let s see how a plot of the 200 β s looks superimposed on the exact sampling distribution 27

betahat 26 28 30 32 34 47 48 49 50 5 52 53 beta0hat Figure 5: Theoretical distribution (pdf) with 200 sampled MLE s 28