Notes on the Multivariate Normal and Related Topics

Similar documents
Exercises and Answers to Chapter 1

MATH5745 Multivariate Methods Lecture 07

Exam 2. Jeremy Morris. March 23, 2006

2.6.3 Generalized likelihood ratio tests

Math 181B Homework 1 Solution

Introduction to Normal Distribution

Statistics 3858 : Maximum Likelihood Estimators

Lecture 3. Inference about multivariate normal distribution

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Random Variables and Their Distributions

Math 152. Rumbos Fall Solutions to Assignment #12

Review of Discrete Probability (contd.)

Asymptotic Statistics-III. Changliang Zou

In this course we: do distribution theory when ǫ i N(0, σ 2 ) discuss what if the errors, ǫ i are not normal? omit proofs.

STT 843 Key to Homework 1 Spring 2018

Chapter 17: Undirected Graphical Models

Multivariate Statistical Analysis

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Canonical Correlation Analysis of Longitudinal Data

HT Introduction. P(X i = x i ) = e λ λ x i

If we want to analyze experimental or simulated data we might encounter the following tasks:

Master s Written Examination

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Probability and Distributions

Summary of Chapters 7-9

POLI 8501 Introduction to Maximum Likelihood Estimation

Space Telescope Science Institute statistics mini-course. October Inference I: Estimation, Confidence Intervals, and Tests of Hypotheses

STAT 4385 Topic 01: Introduction & Review

Statistics and Data Analysis

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

Direction: This test is worth 250 points and each problem worth points. DO ANY SIX

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Machine learning - HT Maximum Likelihood

Multivariate Statistics

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Chapter 7. Hypothesis Testing

Chapter 5 continued. Chapter 5 sections

MAS223 Statistical Inference and Modelling Exercises

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Introduction to Machine Learning. Lecture 2

Likelihood Ratio tests

Topic 19 Extensions on the Likelihood Ratio

Answer Key for STAT 200B HW No. 7

STAT 730 Chapter 4: Estimation

Central Limit Theorem ( 5.3)

First Year Examination Department of Statistics, University of Florida

simple if it completely specifies the density of x

This does not cover everything on the final. Look at the posted practice problems for other topics.

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

A732: Exercise #7 Maximum Likelihood

BTRY 4090: Spring 2009 Theory of Statistics

This paper is not to be removed from the Examination Halls

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Some Assorted Formulae. Some confidence intervals: σ n. x ± z α/2. x ± t n 1;α/2 n. ˆp(1 ˆp) ˆp ± z α/2 n. χ 2 n 1;1 α/2. n 1;α/2

Linear Methods for Prediction

Review. December 4 th, Review

Biostat 2065 Analysis of Incomplete Data

Hypothesis Testing: The Generalized Likelihood Ratio Test

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Statistical Inference with Regression Analysis

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

A Likelihood Ratio Test

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Statistics. Statistics

Primer on statistics:

Parametric Techniques

Linear Methods for Prediction

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Spring 2012 Math 541B Exam 1

CS 195-5: Machine Learning Problem Set 1

Ch. 5 Hypothesis Testing

Statistical Inference

Brief Review on Estimation Theory

Simple Linear Regression

Correlation and Regression

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

5.2 Fisher information and the Cramer-Rao bound

Probabilities & Statistics Revision

Hypothesis Testing. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Notes on Random Vectors and Multivariate Normal

Statistical Inference

Loglikelihood and Confidence Intervals

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Lecture 11. Multivariate Normal theory

Topic 17: Simple Hypotheses

Parametric Techniques Lecture 3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Lectures on Statistics. William G. Faris

TAMS39 Lecture 2 Multivariate normal distribution

Lecture 5: ANOVA and Correlation

Applied Regression. Applied Regression. Chapter 2 Simple Linear Regression. Hongcheng Li. April, 6, 2013

Quick Tour of Basic Probability Theory and Linear Algebra

Parameter estimation: ACVF of AR processes

MAS223 Statistical Modelling and Inference Examples

Generalized Linear Models Introduction

Bias Variance Trade-off

Estimation, Inference, and Hypothesis Testing

Transcription:

Version: July 10, 2013 Notes on the Multivariate Normal and Related Topics Let me refresh your memory about the distinctions between population and sample; parameters and statistics; population distributions and sampling distributions One might say that anyone worth knowing, knows these distinctions We start with the simple univariate framework and then move on to the multivariate context A) Begin by positing a population of interest that is operationalized by some random variable, say X In this Theory World framework, X is characterized by parameters, such as the expectation of X, µ = E(X), or its variance, σ 2 = V(X) The random variable X has a (population) distribution, which for us will typically be assumed normal B) A sample is generated by taking observations on X, say, X 1,, X n, considered independent and identically distributed as X, ie, they are exact copies of X In this Data World context, statistics are functions of the sample, and therefore, characterize the sample: the sample mean, ˆµ = 1 n n X i ; the sample variance, ˆσ 2 = 1 n n (X i ˆµ) 2, with some possible variation in dividing by n 1 to generate an unbiased estimator of σ 2 The statistics, ˆµ and ˆσ 2, are point estimators of µ and σ 2 They are random variables by themselves, so they have distributions that are called sampling distributions The general problem of statistical inference is to ask what the sample statistics, ˆµ and ˆσ 2, tell us about their population counterparts, 1

such as µ and σ 2 Can we obtain some notion of accuracy from the sampling distribution, eg, confidence intervals? The multivariate problem is generally the same but with more notation: A) The population (Theory World) is characterized by a collection of p random variables, X = [X 1, X 2,, X p ], with parameters: µ i = E(X i ); σ 2 i = V(X i ); σ ij = Cov(X i, X j ) = E[(X i E(X i ))(X j E(X j ))]; or the correlation between X i and X j, ρ ij = σ ij /σ i σ j B) The sample (Data World) is defined by n independent observations on the random vector X, with the observations placed into an n p data matrix (eg, subject by variable) that we also (with a slight abuse of notation) denote by a bold-face capital letter, X n p : X n p = x 11 x 12 x 1p x 21 x 22 x 2p x n1 x n2 x np The statistics corresponding to the parameters of the population are: ˆµ i = 1 n n k=1 x ki ; ˆσ 2 i = 1 n n k=1 (x ki ˆµ i ) 2, with again some possible variation to a division by n 1 for an unbiased estimate; ˆσ ij = n k=1 (x ki ˆµ i )(x kj ˆµ j ); and ˆρ ij = ˆσ ij /ˆσ iˆσ j 1 n To obtain a good sense of what the estimators tell us about the population parameters, we will have to make some assumption about the population, eg, [X 1,, X p ] has a multivariate normal distribution As we will see, this assumption leads to some very nice results = x 1 x 2 x n 2

01 Developing the Multivariate Normal Distribution Suppose X 1,, X p are p continuous random variables with density functions, f 1 (x 1 ),, f p (x p ), and distribution functions, F 1 (x 1 ),, F p (x p ), where P (X i x i ) = F i (x i ) = x i f i(x i )dx We define a p-dimensional random variable (or random vector) as the vector, X = [X 1,, X p ]; X has the joint cumulative distribution function F (x 1,, x p ) = x p x 1 f(x 1,, x p )dx 1 dx p If the random variables, X 1,, X p are independent, then the joint density and cumulative distribution functions factor: f(x 1,, x p ) = f 1 (x 1 ) f p (x p ) and F (x 1,, x p ) = F 1 (x 1 ) F p (x p ) The independence property will not be assumed in the following; in fact, the whole idea is to investigate what type of dependency exists in the random vector X When X = [X 1, X 2,, X p ] MVN(µ, Σ), the joint density function has the form 1 φ(x 1,, x p ) = (2π) p/2 Σ exp{ 1 1/2 2 (x µ) Σ 1 (x µ)} In the univariate case, the density provides the usual bell-shaped curve: φ(x) = 1 2πσ exp{ 1 2 [(x µ)/σ)2 ]} Given the MVN assumption on the generation of the data matrix, 3

X n p, we know the sampling distribution of the vector of means: ˆµ = ˆµ 1 ˆµ 2 ˆµ p MVN(µ, (1/n)Σ) In the univariate case, we have that ˆµ N(µ, (1/n)σ 2 ) As might be expected, a Central Limit Theorem (CLT) states that these results also hold asymptotically when the MVN assumption is relaxed Also, if we estimate the variance-covariance matrix, Σ, with divisions by n 1 rather than n, and denote the result as ˆΣ, then E(ˆΣ) = Σ Linear combinations of random variables form the backbone of all of multivariate statistics For example, suppose X = [X 1, X 2,, X p ] MVN(µ, Σ), and consider the following q linear combinations: Z 1 = c 11 X 1 + + c 1p X p Z q = c q1 X 1 + + c qp X p Then the vector Z = [Z 1, Z 2,, Z q ] MVN(Cµ, CΣC ), where C q p = c 11 c q1 c 1p c qp These same results hold in the sample if we observe the q linear combinations, [Z 1, Z 2,, Z q ]: ie, the sample mean vector is Cˆµ and the sample variance-covariance matrix is CˆΣC 4

02 Multiple Regression and Partial Correlation (in the Population) Suppose we partition our vector of p random variables as follows: X = [X 1, X 2,, X p ] = [[X 1, X 2,, X k ], [X k+1, X k+2, X p ]] [X 1, X 2] Partitioning the mean vector and variance-covariance matrix in the same way, µ µ = 1 Σ ; Σ = 11 Σ 12 µ 2 Σ 12 Σ, 22 we have X 1 MVN(µ 1, Σ 11 ) ; X 2 MVN(µ 2, Σ 22 ) X 1 and X 2 are statistically independent vectors if and only if Σ 12 = 0 k p, the k p zero matrix What I call the Master Theorem refers to the conditional density of X 1 given X 2 = x 2 ; it is multivariate normal with mean vector of order k 1: µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ), and variance-covariance matrix of order k k: Σ 11 Σ 12 Σ 1 22 Σ 12 If we denote the (i, j) th partial covariance element in this latter matrix ( holding all variables in the second set constant ) as σ ij (k+1) p, then the partial correlation of variables i and j, holding all variables in the second set constant, is ρ ij (k+1) p = σ ij (k+1) p / σ ii (k+1) p σ jj (k+1) p 5

Notice that in the formulas just given, the inverse of Σ 22 must exist; otherwise, we have what is called multicollinearity, and solutions don t exist (or are very unstable numerically if the inverse almost doesn t exist) So, as one moral: don t use both total and subtest scores in the same set of independent variables If the set X 1 contains just one random variable, X 1, then the mean vector of the conditional distribution can be given as E(X 1 ) + σ 12 Σ 1 22 (x 2 µ 2 ), where σ 12 is 1 (p 1) and of the form [Cov(X 1, X 2 ),, Cov(X 1, X p )] This is nothing but our old regression equation written out in matrix notation If we let β = σ 12 Σ 1 22, then the conditional mean vector (predicted X 1 ) is equal to E(X 1 ) + β (x 2 µ 2 ) = E(X 1 ) + β 1 (x 2 µ 2 )+ +β p 1 (x p µ p ) The covariance matrix, Σ 11 Σ 12 Σ 1 22 Σ 12, takes on the form Var(X 1 ) σ 12 Σ 1 22 σ 12, ie, the variation in X 1 that is not explained If we take explained variation, σ 12 Σ 1 22 σ 12, and consider the proportion to the total variance, Var(X 1 ), we define the squared multiple correlation coefficient: ρ 2 1 2 p = σ 12Σ 1 22 σ 12 Var(X 1 ) In fact, the linear combination, β X 2, has the highest correlation of any linear combination with X 1 ; and this correlation is the positive square root of the squared multiple correlation coefficient 03 Moving to the Sample Up to this point, the concepts of regression and kindred ideas, have been discussed only in terms of population parameters We now have 6

the task of obtaining estimates of these various quantities based on a sample; also, associated inference procedures need to be developed Generally, we will rely on maximum-likelihood estimation and the related likelihood-ratio tests To define how maximum-likelihood estimation proceeds, we will first give a series of general steps, and then operationalize for a univariate normal distribution example A) Let X 1,, X n be univariate observations on X (independent and continuous, and depending on some parameters, θ 1,, θ k The density function of X i is denoted as f(x i ; θ 1,, θ k ) B) The likelihood of observing x 1,, x n for values of X 1,, X n is the joint density of the n observations, and because the observations are independent, L(θ 1,, θ k ) n f(x i ; θ 1,, θ k ), ie, we assume x 1,, x n are already observed, and that L is a function of the parameters only Now, choose parameter values such that L(θ 1,, θ k ) is at a maximum ie, the probability of observing that particular sample is maximized by the choice of θ 1,, θ k Generally, it is easier and equivalent to maximize the log-likelihood using l(θ 1,, θ k ) log(l(θ 1,, θ k )) C) Generally, the maximum values are found through differentiation, and by setting the partial derivatives equal to zero Explicitly, 7

we find θ 1,, θ k such that θ j l(θ 1,, θ k ) = 0, for j = 1,, k (We should probably check second derivatives to see if we have a maximum, but this is almost always true, anyways) Example: Suppose X i N(µ, σ 2 ), with density f(x i ; µ, σ 2 ) = 1 σ 2π exp{ (x i µ) 2 /2σ 2 } The likelihood has the form L(µ, σ 2 ) = n 1 σ 2π exp{ (x i µ) 2 /2σ 2 } = (1/ 2π) n (1/σ 2 ) n/2 exp{ (1/2σ 2 ) n (x i µ) 2 } The log-likelihood reduces: l(µ, σ 2 ) = log L(µ, σ 2 ) = log(1/ 2π) n + log(1/σ 2 ) n/2 + ( (1/2σ 2 ) n (x i µ) 2 ) = constant (n/2) log σ 2 (1/2σ 2 ) n (x i µ) 2 The partial derivatives have the form: l(µ, σ 2 ) µ = (1/2σ 2 ) n 2(x i µ)( 1), 8

and l(µ, σ 2 ) σ 2 = (n/2)(1/σ 2 ) (1/2(σ 2 ) 2 )( 1) n (x i µ) 2 Setting these two expressions to zero, gives ˆµ = n x i /n ; ˆσ 2 = (1/n) n (x i ˆµ) 2 Maximum likelihood (ML) estimates are generally consistent, asymptotically normal (for large n), and efficient; a function of the sufficient statistics; and have an invariance to operations performed on them eg, the ML estimate for σ is just ˆσ 2 As the ML estimator for σ 2 shows, ML estimators are not necessarily unbiased Another Example: Suppose X 1,, X n are observations on the Poisson discrete random variable X (having outcomes 0, 1, ): The likelihood is L(λ) = n P (X i = x i ) = exp( λ)λx i x i! and the log-likelihood P (X i = x i ) = exp( nλ)λ i x i i x i! l(λ) = log L(λ) = nλ + log(λ i x i ) log i, x i! The partial derivative l(λ) λ = n + (1/λ) n x i, 9

and when set to zero, gives 031 Likelihood Ratio Tests ˆλ = n x i /n Besides using the likelihood concept to find good point estimates, the likelihood can also be used to find good tests of hypotheses We will develop this idea in terms of a simple example: Suppose X 1 = x 1,, X n = x n denote values obtained on n independent observations on a N(µ, σ 2 ) random variable, and where σ 2 is assumed known The standard test of H 0 : µ = µ 0 versus H 1 : µ µ 0, would compare ( x µ 0 ) 2 /(σ 2 /n) to a χ 2 random variable with one-degree of freedom Or, if we chose the significance level to be 05, we could reject H 0 if x µ 0 /(σ/ n) 196 We now develop this same test using likelihood ideas: The likelihood of observing x 1,, x n, L(µ), is only a function of µ (because σ 2 is assumed to be known): L(µ) = (1/σ 2π) n exp{ (1/2σ 2 ) n (x i µ) 2 } Under H 0, L(µ) is at a maximum for µ = µ 0 ; under H 1, L(µ) is at a maximum for µ = ˆµ, the maximum likelihood estimator If H 0 is true, L(ˆµ) and L(µ 0 ) should be close to each other; If H 0 is false, L(ˆµ) should be much larger than L(µ 0 ) Thus, the decision rule would be to reject H 0 if L(µ 0 )/L(ˆµ) λ, where λ is some number less than 100 and chosen to obtain a particular α level The ratio, (L(µ 0 )/L(ˆµ)) = exp{ (1/2)(ˆµ µ 0 ) 2 }/(σ 2 /n) Thus, we could rephrase the decision rule: reject H 0 if (ˆµ µ 0 ) 2 }/(σ 2 /n) 10

2 log(λ), or if x µ 0 /(σ/ n) 2 log λ Thus, for an 05 level of significance, choose λ so 196 = 2 log λ Generally, we can phrase likelihood ratio tests as: 2 log(likelihood ratio) χ 2 ν ν 0, where ν is the dimension of the parameter space generally, and ν 0 is the dimension under H 0 032 Estimation To obtain estimates of the various quantities we need, merely replace the variances and covariances by their maximum likelihood estimates This process generates sample partial correlations or covariances; sample multiple squared correlations; sample regression parameters; and so on 11