Multivariate Analysis and Likelihood Inference

Similar documents
Lecture 32: Asymptotic confidence sets and likelihoods

MAS223 Statistical Inference and Modelling Exercises

Notes on Random Vectors and Multivariate Normal

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Random Variables and Their Distributions

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

Multivariate Regression

STA 2101/442 Assignment 3 1

Lecture 11. Multivariate Normal theory

STAT 100C: Linear models

Asymptotic Statistics-VI. Changliang Zou

Probability Background

Stat 5101 Lecture Notes

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Statistics & Data Sciences: First Year Prelim Exam May 2018

The Multivariate Normal Distribution 1

The Multivariate Normal Distribution 1

Maximum Likelihood Estimation

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Asymptotic Theory. L. Magee revised January 21, 2013

Lecture 28: Asymptotic confidence sets

Lecture 1: Introduction

Spring 2012 Math 541B Exam 1

Exam 2. Jeremy Morris. March 23, 2006

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

8 - Continuous random vectors

Maximum Likelihood Estimation

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Regression and Statistical Inference

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

The Multivariate Gaussian Distribution

1 Data Arrays and Decompositions

1. Density and properties Brief outline 2. Sampling from multivariate normal and MLE 3. Sampling distribution and large sample behavior of X and S 4.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

y = 1 N y i = 1 N y. E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) and its variance is var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1.

Stat 5101 Notes: Algorithms

Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets

Maximum likelihood estimation

The Multivariate Normal Distribution. In this case according to our theorem

STAT 100C: Linear models

18 Bivariate normal distribution I

Introduction to Estimation Methods for Time Series models Lecture 2

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Course information: Instructor: Tim Hanson, Leconte 219C, phone Office hours: Tuesday/Thursday 11-12, Wednesday 10-12, and by appointment.

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

MLES & Multivariate Normal Theory

Matrix Approach to Simple Linear Regression: An Overview

Lecture 3 September 1

Multivariate Statistics

ST 740: Linear Models and Multivariate Normal Inference

Common-Knowledge / Cheat Sheet

Statistics of stochastic processes

Statistics 351 Probability I Fall 2006 (200630) Final Exam Solutions. θ α β Γ(α)Γ(β) (uv)α 1 (v uv) β 1 exp v }

Lecture 1: August 28

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2008, Mr. Ruey S. Tsay. Solutions to Final Exam

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Chp 4. Expectation and Variance

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

Testing a Normal Covariance Matrix for Small Samples with Monotone Missing Data

Economics 583: Econometric Theory I A Primer on Asymptotics

7 Multivariate Statistical Models

Lecture 2: Review of Probability

δ -method and M-estimation

Notes on the Multivariate Normal and Related Topics

Canonical Correlation Analysis of Longitudinal Data

Introduction to Normal Distribution

Gaussian random variables inr n

Gaussian vectors and central limit theorem

Multivariate Gaussian Analysis

Topics in Probability and Statistics

Asymptotic Statistics-III. Changliang Zou

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

3. Probability and Statistics

Sampling Distributions

1. Let A be a 2 2 nonzero real matrix. Which of the following is true?

Probability Lecture III (August, 2006)

LIST OF FORMULAS FOR STK1100 AND STK1110

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Submitted to the Brazilian Journal of Probability and Statistics

Monte-Carlo MMD-MA, Université Paris-Dauphine. Xiaolu Tan

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Lecture 15. Hypothesis testing in the linear model

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

The Delta Method and Applications

2. Matrix Algebra and Random Vectors

Principal component analysis

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

III - MULTIVARIATE RANDOM VARIABLES

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Table of Contents. Multivariate methods. Introduction II. Introduction I

Transcription:

Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference

Joint density function The distribution of a random vector X =(X 1,...,X k ) T is characterized by its joint density function f such that P {X 1 a 1,..., X k a k } = ak a1 f(x 1,...,x k )dx 1 dx k when X has a differentiable distribution function (which is the left-hand side of (??)), and for the case of discrete X i, f(a 1,...,a k )=P {X 1 = a 1,...,X k = a k }. Change of variables Suppose X has a differentiable distribution function with density function f X. Let g : k k be continuously differentiable and one-to-one. Let Y = g(x). Since g is a one-to-one transformation, we can solve Y = g(x) to obtain X = h(y); h =(h 1,..., h k ) T is the inverse of the transformation g. The density function f Y of Y is given by f Y (y) =f X (h(y)) det( h/ y), where h/ y is the Jacobian matrix ( h i / y j ) 1 i,j k. The inverse function theorem of multivariate calculus gives h/ y =( g/ x) 1, or equivalently x/ y =( y/ x) 1.

Mean and covariance matrix The mean vector of Y =(Y 1,...,Y k ) T is E(Y) = (EY 1,..., EY k ) T. The covariance matrix of X is Cov(Y) = (Cov(Y i,y j )) 1 i,j k = E { (Y E(Y))(Y E(Y)) T }. The following linearity (and bilinearity) properties hold for EY (and Cov(Y)). For nonrandom p k matrix A and p 1 vector B, E(AY + B) =AEY + B, Cov(AY + B) =ACov(Y)A T. Eigenvalue and eigenvector Definition 1 Let V be a p p matrix. A complex number λ is called an eigenvalue of V if there exists a p 1 vector a 0 such that Va = λa. Such a vector a is called an eigenvector of V corresponding to the eigenvalue λ. We can rewrite Va = λa as (V λi)a =0. Since a 0, this implies that λ is a solution of the equation det(v λi) = 0. Since det(v λi) is a polynomial of degree p in λ, there are p eigenvalues, not necessarily distinct. If V is symmetric, then all its eigenvalues are real and can be ordered as λ 1 λ p. Moreover, tr(v) =λ 1 + + λ p, det(v) =λ 1...λ p. If a is an eigenvector of V corresponding to the eigenvalue λ, then so is ca for c 0. Moreover, premultiplying λa = Va by a T yields λ = a T Va / a 2.

Eigenvalue and eigenvector We shall now focus on the case where V is the covariance matrix of a random vector x =(X 1,...,X p ) T. Proposition 1 The eigenvalues of V are real and nonnegative since V is symmetric and nonnegative definite. Consider the linear combination a T x with a =1that has the largest variance over all such linear combinations. Maximizing a T Va (= Var(a T x)) over a with a =1yields that Va = λa. Since a 0, this implies that λ is an eigenvalue of V and a is the corresponding eigenvector, and that λ = a T Va. Denote λ 1 = max a: a =1 a T Va and a 1 be the corresponding eigenvector with a 1 =1. Eigenvalue and eigenvector Next consider a T x that maximizes Var(a T x)=a T Va subject to a T 1 a =0and a =1. Its solution should satisfy {a T Va + λ(1 a T a)+ηa T 1 a} =0for i =1,..., p. a i This implies that λ is an eigenvalue of V with corresponding unit eigenvector a 2 that is orthogonal to a 1. Proceeding inductively in this way, we obtain the eigenvalue λ 1 λ 2 λ p of V with the optimization characterization λ k+1 = max a T Va. a: a =1,a T a j =0 for 1 j k The maximizer a k+1 of the above equation is an eigenvector corresponding to the eigenvalue λ k+1. Definition 2 a T i x is called the ith principal component of x.

Principal components (a) From the preceding derivation, λ i = Var(a T i x). (b) The elements of the eigenvectors a i are called factor loadings in PCA. Note that (a 1,..., a p ) is an orthogonal matrix, i.e., Moreover, I =(a 1,..., a p )(a 1,...,a p ) T = a 1 a T 1 + + a pa T p. V = λ 1 a 1 a T 1 + + λ p a p a T p. (c) Since V = Cov(x) and x =(X 1,...,X p ) T, tr(v) = p i=1 Var(X i). Hence it follows from the fact tr(v) = p i=1 λ i that λ 1 + + λ p = p Var(X i ). i=1 PCA of sample covariance matrix Suppose x 1,..., x n are n independent observations from a multivariate population with mean µ and covariance matrix V. The mean vector µ can be estimated by x = n i=1 x i/n, and the covariance matrix can be estimated by the sample covariance matrix, i.e., V = n (x i x)(x i x) T / (n 1). i=1 Let â j =(â 1j,...,â pj ) T be the eigenvector corresponding to the jth largest eigenvalue λ j of the sample covariance matrix V.

PCA of sample covariance matrix For fixed p, the following asymptotic results have been established under the assumption λ 1 > > λ p > 0 as n : (i) n(â i a i ) has a limiting normal distribution with mean 0 and covariance matrix λ i λ h (λ i λ h ) 2 a ha T h. h i Note that â i and a i are chosen so that they have the same sign. Approximating covariance structure Even when p is large, if many eigenvalues λ h are small compared with the largest λ i s, the covariance structure of the data can be approximated by a few principal components with large eigenvalues. PCA of sample covariance matrix For fixed p, the following asymptotic results have been established under the assumption λ 1 > > λ p > 0 as n : (i) n(â i a i ) has a limiting normal distribution with mean 0 and covariance matrix λ i λ h (λ i λ h ) 2 a ha T h. h i Note that â i and a i are chosen so that they have the same sign. (ii) n( λ i λ i ) has a limiting N(0, 2λ 2 i ) distribution. Moreover, for i j, the limiting distribution of n( λ i λ i, λ j λ j ) is that of two independent normals.

Representation of data in terms of principal components Let x i =(x i1,...,x ip ) T, i =1,..., n, be the multivariate sample. Let X k =(x 1k,..., x nk ) T, 1 k p, and define Y j = â 1j X 1 + + â pj X p, 1 j p, where â j =(â 1j,...,â pj ) T is the eigenvector corresponding to the jth largest eigenvalue λ j of the sample covariance matrix V with â j =1. Since the matrix  := (â ij) 1 i,j p is orthogonal (i.e., ÂÂT = I), it follows that the observed data X k can be expressed in terms of the principal components Y j as X k = â k1 Y 1 + + â kp Y p. Moreover, the sample correlation matrix of the transformed data y tj is the identity matrix. PCA of correlation matrices An alternative to Cov(x) is the correlation matrix Corr(x). Since a primary goal of PCA is to uncover low-dimensional subspaces around which x tends to concentrate, scaling the components of x by their standard deviations may work better for this purpose, and applying PCA to sample correlation matrices may give more stable results.

Example: PCA of U.S. Treasury-LIBOR swap rates 7 6 5 4 3 2 swp1y swp4y swp10y 4 swp30y 1 07/05/00 05/03/01 03/11/02 01/09/03 11/07/03 09/10/04 07/15/05 Figure 1: Daily swap rates of four of the eight maturities T =1, 2, 3, 4, 5, 7, 10, 30 from July 3, 2000 to July 15, 2005. We now apply PCA to account for the variance of daily changes in swap rates with a few principal components (or factors). Example: PCA of U.S. Treasury-LIBOR swap rates 0.2 0!0.2!0.4 0.2 0!0.2!0.4 07/05/00 05/03/01 03/11/02 01/09/03 11/07/03 09/10/04 07/15/05 Figure 2: The differenced series for 1-year (top) and 30-year (bottom) swap rates. Let d kt = r kt r k,t 1 denote the daily changes in the k-year swap rates during the period.

Example: PCA of U.S. Treasury-LIBOR swap rates The sample mean vector and the sample covariance and correlation matrices of the difference data {(d 1t,...,d 8t ), 1 t 1256} are bµ = ` 2.412 2.349 2.293 2.245 2.196 2.158 2.094 1.879 T 10 3. 0 bσ = B @ 0.233 0.300 0.438 0.303 0.454 0.488 0.297 0.453 0.492 0.508 0.292 0.451 0.494 0.514 0.543 0.268 0.421 0.466 0.490 0.513 0.498 0.240 0.384 0.431 0.458 0.477 0.472 0.467 0.169 0.278 0.318 0.344 0.362 0.365 0.369 0.322 1 10 2, C A Example: PCA of U.S. Treasury-LIBOR swap rates The sample mean vector and the sample covariance and correlation matrices of the difference data {(d 1t,...,d 8t ), 1 t 1256} are bµ = ` 2.412 2.349 2.293 2.245 2.196 2.158 2.094 1.879 T 10 3. 0 br = B @ 1.000 0.941 1.000 0.899 0.983 1.000 0.862 0.961 0.988 1.000 0.821 0.925 0.960 0.979 1.000 0.787 0.901 0.945 0.973 0.986 1.000 0.729 0.850 0.902 0.941 0.948 0.978 1.000 0.618 0.742 0.803 0.852 0.866 0.912 0.951 1.000 1. C A

Example: PCA of U.S. Treasury-LIBOR swap rates Table 1: PCA of covariance matrix of swap rate data. PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 10 4 λ i 324.1 18.44 3.486 1.652 0.984 0.473 0.292 0.253 Proportion 0.926 0.053 0.010 0.005 0.003 0.001 0.001 0.001 factor loadings 0.231 0.491 0.535 0.580 0.081 0.275 0.006 0.030 0.351 0.431 0.150 0.279 0.049 0.725 0.050 0.246 0.381 0.263 0.078 0.456 0.057 0.265 0.001 0.706 0.393 0.087 0.175 0.299 0.069 0.534 0.062 0.652 0.404 0.054 0.486 0.447 0.414 0.131 0.455 0.054 0.387 0.211 0.238 0.262 0.089 0.052 0.818 0.038 0.365 0.395 0.127 0.034 0.735 0.154 0.344 0.103 0.279 0.541 0.588 0.138 0.513 0.032 0.003 0.027 Example: PCA of U.S. Treasury-LIBOR swap rates Table 2: PCA of correlation matrix of swap rate data. PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 λ i 7.265 0.548 0.103 0.041 0.022 0.011 0.006 0.005 Proportion 0.908 0.067 0.013 0.005 0.003 0.001 0.001 0.001 factor loadings 0.324 0.599 0.586 0.402 0.025 0.175 0.002 0.013 0.356 0.352 0.037 0.436 0.004 0.717 0.044 0.205 0.364 0.187 0.215 0.432 0.007 0.349 0.007 0.691 0.368 0.043 0.253 0.206 0.076 0.533 0.045 0.681 0.365 0.064 0.385 0.455 0.499 0.151 0.482-0.056 0.365 0.192 0.216 0.342 0.014 0.075 0.811 0.052 0.356 0.348 0.059 0.154 0.768 0.143 0.324 0.098 0.328 0.565 0.589 0.267 0.393 0.025 0.002 0.020

Example: PCA of U.S. Treasury-LIBOR swap rates Let x tk =(d tk µ k ) / σ k, where σ k is the sample standard deviation of d tk. We represent X k =(x 1k,...,x nk ) T in terms of the principal components Y 1,...,Y 8, X k = â k1 Y 1 + + â kp Y p. Tables 1 and 2 shows the PCA results using the sample covariance matrix Σ and the sample correlation matrix R. From Table 2, the proportion of the overall variance explained by the first three principal components are 90.8%, 6.7% and 1.3%, respectively. Hence 98.9% of the overall variance is explained by the first three principal components, yielding the approximation (d k µ k ) / σ k. = âk1 Y 1 + â k2 Y 2 + â k3 Y 3 for the differenced swap rates. Example: PCA of U.S. Treasury-LIBOR swap rates 0.6 the 1st the 2nd the 3rd 0.4 0.2 0!0.2!0.4!0.6 1 2 3 4 5 6 7 8 Figure 3: Eigenvectors of the first three principal components, which represent the parallel shift, tilt, and convexity components of swap rate movements.

Example: PCA of U.S. Treasury-LIBOR swap rates The graphs of the factor loadings show the following constituents of interest rate or yield curve movements documented in the literature: (a) Parallel shift component. The factor loadings of the first principal component are roughly constant over different maturities. This means that a change in the swap rate for one maturity is accompanied by roughly the same change for other maturities. (b) Tilt component. The factor loadings of the second principal component have a monotonic change with maturities. Changes in short-maturity and long-maturity swap rates in this component have opposite signs. (c) Curvature component. The factor loadings of the third principal component are different for the midterm rates and the average of short- and long-term rates, revealing a curvature of the graph that resembles the convex shape of the relationship between the rates and their maturities. Joint, marginal and conditional distributions An m 1 random vector Y is said to have a multivariate normal distribution N(µ, V) if its density function is f(y) = 1 { (2π) m/2 det(v) exp 1 } 2 (y µ)t V 1 (y µ), y m. Suppose that Y N(µ, V) is partitioned as ( ) ( ) ( ) Y1 µ1 V11 V Y =, µ =, V = 12, Y 2 µ 2 V 21 V 22 where Y 1 and µ 1 have dimension r(< m) and V 11 is r r. Then Y 1 N(µ 1, V 11 ), Y 2 N(µ 2, V 22 ), and the conditional distribution of Y 1 given Y 2 = y 2 is N ( µ 1 + V 12 V 1 22 (y 2 µ 2 ), V 11 V 12 V 1 22 V 21).

Orthogonality and independence Let ζ be a standard normal random vector of dimension p and let A, B be r p and q p nonrandom matrices such that AB T = 0. Then the components of U 1 := Aζ are uncorrelated with those of U 2 := Bζ because E(U 1 U T 2 )=AE(ζζ T )B T = 0. Note that ( ) ( ) U1 A U = = ζ U 2 B has a multivariate normal distribution with mean 0 and covariance matrix ( ) AA T 0 V = 0 BB T, Then the density function of U is a product of the density functions of U 1 and U 2, so U 1 and U 2 are independent. Note also that U T 1 U 2 = 0; i.e., they are orthogonal vectors. Sample covariance matrix and Wishart distribution Let Y 1,...,Y n be independent m 1 random N(µ, Σ) vectors with n > m and positive definite Σ. Define Ȳ = 1 n n Y i, W = i=1 n (Y i Ȳ)(Y i Ȳ)T. i=1 It can be shown that the sample mean vector Ȳ N(µ, Σ/n), and Ȳ and the sample covariance matrix W/(n 1) are independent. We next generalize n t=1 (y t ȳ) 2/ σ 2 χ 2 n 1 to the multivariate case. Wishart distribution Suppose Y 1,..., Y n m are independent N(0, Σ) random vectors. Then the random matrix W = Σ n i=1 Y iyi T is said to have a Wishart distribution, denoted by W m (Σ,n).

Wishart distribution The density function of the Wishart distribution W m (Σ,n) generalizes this to f(w) = det(w)(n m 1)/2 exp { 1 2 tr( Σ 1 W )}, W > 0, [2 m det(σ)] n/2 Γ m (n/2) in which W > 0 denotes that W is positive definite and Γ m ( ) denotes the multivariate gamma function Γ m (t) =π m(m 1)/4 m i=1 ( Γ t i 1 2 Note that the usual gamma function corresponds to the case m =1. ). Wishart distribution The following properties of the Wishart distribution are generalizations of some well-known properties of the chi-square distribution and the gamma(α, β) distribution. (a) If W W m (Σ,n), then E(W) =nσ. (b) Let W 1,..., W k be independently distributed with W j W m (Σ,n j ), j =1,..., k. Then k j=1 W j W m (Σ, k j=1 n j). (c) Let W W m (Σ,n) and A be a nonrandom m m nonsingular matrix. Then AWA T W m (AΣA T,n). In particular, a T Wa (a T Σa)χ 2 n for all nonrandom vectors a 0.

Multivariate-t distribution Multivariate-t distribution If Z N(0, Σ) and W W m (Σ,k) such that the m 1 vector Z and the m m matrix W are independent, then ( W/k ) 1/2 Z is said to have the m-variate t-distribution with k degrees of freedom. The m-variate t-distribution with k degrees of freedom has the density function f(t) = Γ( (k + m)/2 ) ( (πk) m/2 Γ(k/2) 1+ tt t k ) (k+m)/2, t m. If t has the m-variate t-distribution with k degrees of freedom such that k m, then k m +1 km tt t has the F m,k m+1 -distribution. (1) Multivariate-t distribution Applying (1) to Hotelling s T 2 -statistic T 2 = n(ȳ µ)t ( W / (n 1) ) 1 ( Ȳ µ), (2) where Ȳ and W are defined in (??), yields noting that n m m(n 1) T 2 F m,n m, (3) [ / 1/2 [ n( ] [ / 1/2N(0, W (n 1)] Ȳ µ) = W m (Σ,n 1) (n 1)] Σ) has the multivariate t-distribution.

Multivariate-t distribution In the definition of the multivariate t-distribution, it is assumed that Z and W share the same Σ. More generally, we can consider the case where Z N(0, V) instead. By considering V 1/2 Z instead of Z, we can assume that V = I m. Then the density function of ( W/k ) 1/2 Z, with independent Z N(0, I m ) and W W m (Σ,k), has the general form Γ ( (k + m)/2 ) ( f(t) = (πk) m/2 Γ(k/2) 1+ tt Σ 1 t ) (k+m)/2, t m. det(σ) k This density function can be used to define multivariate t-copulas. Method of maximum likelihood (MLE) Suppose X 1,...,X n have joint density function f θ (x 1,...,x n ), θ p. The likelihood function based on the observations X 1,..., X n is L(θ) =f θ (X 1,...,X n ), and the MLE θ of θ is the maximizer of L(θ) or the log-likelihood l(θ) = log f θ (X 1,..., X n ). Example: MLE in Gaussian regression models Consider the regression model y t = β T x t + ɛ t, in which the regressors x t F t 1 = {y t 1, x t 1,..., y 1, x 1 }, and ɛ t are i.i.d. N(0, σ 2 ). Letting θ =(β, σ 2 ), the likelihood function is n { 1 [ ]} L(θ) = exp (y t β T x t ) 2 /(2σ 2 ). 2πσ t=1 It then follows that the MLE of β is the same as the OLS estimator β that minimizes n t=1 (y t β T x t ) 2, and the MLE of σ 2 is σ 2 = n 1 n t=1 (y t β T x t ) 2.

Consistency of MLE The classical asymptotic theory of maximum likelihood deals with i.i.d. X t so that l(θ) = n t=1 log f θ(x t ). Let θ 0 denote the true parameter value. By the law of large numbers, n 1 l(θ) E θ0 {log f θ (X 1 )} as n, (4) with probability 1, for every fixed θ. Jensen s inequality Suppose X has a finite mean and ϕ : is convex, i.e., λϕ(x) + (1 λ)ϕ(y) ϕ(λx + (1 λ)y) for all 0 < λ < 1 and x, y. (If the inequality above is strict, then ϕ is called strictly convex.) Then Eϕ(X) ϕ(ex), and the inequality is strict if ϕ is strictly convex unless X is degenerate (i.e., it takes only one value). Consistency of MLE Since log x is a (strictly) convex function, it follows from Jensen s inequality that ( ) ( ) E θ0 log f θ(x 1 ) f log E θ (X 1 ) f θ0 (X 1 ) θ0 = log 1 = 0; f θ0 (X 1 ) the inequality above is strict unless the random variable f θ (X 1 ) /f θ0 (X 1 ) is degenerate under P θ0, which means P θ0 {f θ (X 1 )=f θ0 (X 1 )} =1since E θ0 {f θ (X 1 )/f θ0 (X 1 )} =1. Thus, except in the case where f θ (X 1 )=f θ0 (X 1 ) with probability 1, E θ0 log f θ (X 1 ) <E θ0 log f θ0 (X 1 ) for θ θ 0. This suggests that the MLE is consistent (i.e., it converges to θ 0 with probability 1 as n ). A rigorous proof of consistency of the MLE involves sharpening the argument in (4) by considering (instead of fixed θ) sup θ U n 1 l(θ) over neighborhoods U of θ θ 0.

Asymptotic normality of MLE Suppose f θ (x) is smooth in θ. Then we can find the MLE by solving the equation l(θ) =0, where l is the (gradient) vector of partial derivatives l/ θ i. For large n, the consistent estimate θ is near θ 0, and therefore Taylor s theorem yields 0 = l( θ) = l(θ 0 )+ 2 l(θ )( θ θ 0 ), (5) where θ lies between θ and θ 0 (and is therefore close to θ 0 ) and ( 2 2 ) l l(θ) = θ i θ j 1 i,j p is the (Hessian) matrix of second partial derivatives. From (5) it follows that θ θ 0 = ( 2 l(θ ) ) 1 l(θ0 ). (7) (6) Asymptotic normality of MLE The central limit theorem (CLT) can be applied to n l(θ 0 )= log f θ (X t ) θ=θ0, (8) t=1 noting that the gradient vector log f θ (X t ) θ=θ0 has mean 0 and covariance matrix { }) 2 I(θ 0 )= (E θ log f θ i θ θ (X t ) θ=θ0. (9) j 1 i,j p The matrix I(θ 0 ) is called the Fisher information matrix. By the law of large numbers, n 1 2 l(θ 0 )= n 1 with probability 1. n t=1 2 log f θ (X t ) θ=θ0 I(θ 0 ) (10)

Asymptotic normality of MLE Moreover, n 1 2 l(θ )=n 1 2 l(θ 0 )+o(1) with probability 1 by the consistency of θ. From (7) and (10), it follows that n( θ θ0 )= ( n 1 2 l(θ ) ) 1 ( l(θ0 )/ n ) (11) converges in distribution to N(0, I 1 (θ 0 )) as n, in view of the following theorem. Slutsky s theorem If X n converges in distribution to X and Y n converges in probability to some nonrandom C, then Y n X n converges in distribution to CX. It is often more convenient to work with the observed Fisher information matrix 2 l( θ) than the Fisher information matrix ni(θ 0 ) and to use the following variant of the limiting normal distribution of (11): ( θ θ 0 ) T ( 2 l( θ))( θ θ 0 ) has a limiting χ 2 p-distribution. (12) Asymptotic normality of MLE The preceding asymptotic theory has assumed that X t are i.i.d. so that the law of large numbers and CLT can be applied. More generally, without assuming that X t are independent, the log-likelihood function can still be written as a sum: l n (θ) = n log f θ (X t X 1,...,X t 1 ). (13) t=1 It can be shown that { l n (θ 0 ),n 1} is a martingale, or equivalently, } X1 E { log f θ (X t X 1,..., X t 1 ) θ=θ0,...,x t 1 = 0, and martingale strong laws and CLTs can be used to establish the consistency and asymptotic normality of θ under certain conditions, so (12) still holds in this more general setting.

Asymptotic inference confidence ellipsoid In view of (12), an approximate 100(1 α)% confidence ellipsoid for θ 0 is { θ :( θ θ) T ( 2 l n ( θ) ) } ( θ θ) χ 2 p;1 α, (14) where χ 2 p;1 α is the (1 α)th quantile of the χ2 p -distribution. Asymptotic inference hypothesis testing To test H 0 : θ Θ 0, where Θ 0 is a q-dimensional subspace of the parameter space with 0 q < p, the generalized likelihood ratio (GLR) statistic is Λ =2 { l n ( θ) sup θ Θ 0 l n (θ) }, (15) where l n (θ) is the log-likelihood function, defined in (13). The derivation of the limiting normal distribution of (11) can be modified to show that Λ has a limiting χ 2 p q distribution under H 0. (16) Hence the GLR test with significance level (type I error probability) α rejects H 0 if Λ exceeds χ 2 p q;1 α.

Asymptotic inference delta method Let g : p. If θ is the MLE of θ, then g( θ) is the MLE of g(θ). If g is continuously differentiable in some neighborhood of the true parameter value θ 0, then g( θ) =g(θ 0 ) + ( g(θ )) T ( θ θ 0 ), where θ lies between θ and θ 0. Therefore g( θ) g(θ 0 ) is asymptotically normal with mean 0 and variance ( ) T ( ) 1 ( ) g( θ) 2 l n ( θ) g( θ). (17) This is called the delta method for evaluating the asymptotic variance of g( θ). The above asymptotic normality of g( θ) g(θ 0 ) can be used to construct confidence intervals for g(θ 0 ). Parametric bootstrap When θ 0 is unknown, the parametric bootstrap draws bootstrap samples (X b,1,...,x b,n ), b =1,..., B, from Fb θ, where θ is the MLE of θ. Then we can use these bootstrap samples to estimate the bias and standard error of θ. To construct a bootstrap confidence region for g(θ 0 ), we can use ( g( θ) g(θ 0 )) / ( ) T ( ) 1 ( ) g( θ) 2 l n ( θ) g( θ) as an approximate pivot whose (asymptotic) distribution does not depend on the unknown θ 0 and use the quantiles of its bootstrap distribution in place of standard normal quantiles.