Advanced Quantitative Methods: maximum likelihood

Similar documents
Advanced Quantitative Methods: maximum likelihood

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

LECTURE 10: NEYMAN-PEARSON LEMMA AND ASYMPTOTIC TESTING. The last equality is provided so this can look like a more familiar parametric test.

Introduction to Estimation Methods for Time Series models Lecture 2

POLI 8501 Introduction to Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Greene, Econometric Analysis (6th ed, 2008)

Maximum Likelihood (ML) Estimation

Statistics 3858 : Maximum Likelihood Estimators

Statistics and econometrics

Introduction Large Sample Testing Composite Hypotheses. Hypothesis Testing. Daniel Schmierer Econ 312. March 30, 2007

8. Hypothesis Testing

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

The outline for Unit 3

Master s Written Examination - Solution

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

ECON 4160, Autumn term Lecture 1

Chapter 3: Maximum Likelihood Theory

Bootstrap Testing in Nonlinear Models

Practical Econometrics. for. Finance and Economics. (Econometrics 2)

MEI Exam Review. June 7, 2002

Week 7: Binary Outcomes (Scott Long Chapter 3 Part 2)

Chapter 4. Theory of Tests. 4.1 Introduction

Econometrics II - EXAM Answer each question in separate sheets in three hours

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Multiple Regression Analysis

ECON 5350 Class Notes Nonlinear Regression Models

Topic 12 Overview of Estimation

Exercises and Answers to Chapter 1

Statistics: Learning models from data

Notes on the Multivariate Normal and Related Topics

Likelihood Inference in the Presence of Nuisance Parameters

Linear Methods for Prediction

Maximum Likelihood Estimation

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Mathematical statistics

Lecture 17: Likelihood ratio and asymptotic tests

Likelihood-Based Methods

Parameter Estimation

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Graduate Econometrics I: Maximum Likelihood I

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Generalized Linear Models. Kurt Hornik

Outline of GLMs. Definitions

Logistic regression model for survival time analysis using time-varying coefficients

1.1 Basis of Statistical Decision Theory

Linear Regression Models P8111

Maximum Likelihood Estimation

Association studies and regression

Likelihood Inference in the Presence of Nuisance Parameters

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

ECE531 Lecture 10b: Maximum Likelihood Estimation

Parametric Techniques Lecture 3

Statistical Estimation

Problem Selected Scores

Graduate Econometrics I: Maximum Likelihood II

Advanced Econometrics

Mixed models in R using the lme4 package Part 4: Theory of linear mixed models

Loglikelihood and Confidence Intervals

System Identification, Lecture 4

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Model comparison and selection

Generalized Linear Models

System Identification, Lecture 4

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Lecture 6: Hypothesis Testing

Lecture 3 September 1

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

HT Introduction. P(X i = x i ) = e λ λ x i

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Lecture 16 Solving GLMs via IRWLS

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Statistics, inference and ordinary least squares. Frank Venmans

DA Freedman Notes on the MLE Fall 2003

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Financial Econometrics

Introduction to Maximum Likelihood Estimation

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Discrete Dependent Variable Models

Generalized Linear Models

Gaussian Processes 1. Schedule

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

CHAPTER 1: BINARY LOGIT MODEL

Heteroskedasticity. Part VII. Heteroskedasticity

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

A Very Brief Summary of Bayesian Inference, and Examples

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Biostat 2065 Analysis of Incomplete Data

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Parametric Techniques

Transcription:

Advanced Quantitative Methods: Maximum Likelihood University College Dublin 4 March 2014

1 2 3 4 5 6

Outline 1 2 3 4 5 6

of straight lines y = 1 2 x + 2 dy dx = 1 2

of curves y = x 2 4x + 5

of curves y = x 2 4x + 5 d(ax b ) dx d(a + b) dx = bax b 1 = da dx + db dx

of curves y = x 2 4x + 5 d(ax b ) dx d(a + b) dx = bax b 1 = da dx + db dx dy dx = 2x 4

Finding location of peaks and valleys A maximum or minimum value of a curve can be found by setting the derivative equal to zero.

Finding location of peaks and valleys A maximum or minimum value of a curve can be found by setting the derivative equal to zero. y = x 2 4x + 5 dy dx = 2x 4 = 0 2x = 4 x = 4 2 = 2

Finding location of peaks and valleys A maximum or minimum value of a curve can be found by setting the derivative equal to zero. y = x 2 4x + 5 dy dx = 2x 4 = 0 2x = 4 x = 4 2 = 2 So at x = 2 we will find a (local) maximum or minimum value - the plot shows that in this case it is a minimum.

in regression This concept of finding the minimum or maximum is crucial for regression analysis. With ordinary least squares (OLS), we estimate the β-coefficients by finding the minimum of the sum of squared errors. With (ML), we estimate the β-coefficients by finding the maximum point of the (log)likelihood function.

of matrices The derivative of a matrix consists of the matrix containing the derivatives of each element of the original matrix.

of matrices The derivative of a matrix consists of the matrix containing the derivatives of each element of the original matrix. x 2 2x 3 M = 2x + 4 3x 3 2x 3 2 4x 2x 2 0 dm dx = 2 9x 2 2 0 0 4

of matrices d(v x) = v dx d(x Ax) = (A + A )x dx d(bx) = B dx d 2 (x Ax) dxdx = A + A

Outline 1 2 3 4 5 6

Preliminaries Ordinary least squares and its variations (GLS, WLS, 2SLS, etc.) is very flexible and applicable in many circumstances, but for some models (e.g. limited dependent variable models) we need more flexible estimation procedures. The most prominent alternative estimators are maximum likelihood (ML) and Bayesian estimators. This lecture is about the former.

Logarithms Some rules about logarithms, which are important for understanding ML: log ab = log a + log b log e a = a log a b = b log a log a > log b iff a > b d(log a) = 1 da a d(log f (a)) da = d(f (a))/da f (a)

Likelihood: intuition Dice example.

Likelihood: intuition Dice example. Imagine a dice is thrown with unknown number of sides k. We know after throwing the dice that the outcome is 5. How likely is this outcome if k = 0? And k = 1? Continue until k = 10. Plot of this is the likelihood function.

Likelihood Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes. (Wolfram Mathworld)

Preliminaries Looking first at just univariate models, we can express the model as the probability density function (PDF) f (y, θ), where y is the dependent variable, θ the set of parameters, and f ( ) expresses the model specification.

Preliminaries Looking first at just univariate models, we can express the model as the probability density function (PDF) f (y, θ), where y is the dependent variable, θ the set of parameters, and f ( ) expresses the model specification. ML is purely parametric - we need to make strong assumptions about the shape of f ( ) before we can estimate θ.

Likelihood Given θ, f (, θ) is the PDF of y.

Likelihood Given θ, f (, θ) is the PDF of y. Given y, f (y, ) cannot be interpreted as a PDF and is therefore called the likelihood function.

Likelihood Given θ, f (, θ) is the PDF of y. Given y, f (y, ) cannot be interpreted as a PDF and is therefore called the likelihood function. ˆθ ML is the θ that maximizes this likelihood function. (Davidson & MacKinnon 1999, 393-396)

(Log)likelihood function Generally, observations are assumed to be independent, in which case the joint density of the entire sample is the product of the densities of the individual observations. n f (y, θ) = f (y i, θ) i=1

(Log)likelihood function Generally, observations are assumed to be independent, in which case the joint density of the entire sample is the product of the densities of the individual observations. n f (y, θ) = f (y i, θ) It is easier to work with the log of this function, because sums are easier to deal with than products and the logarithmic transformation is monotonic: n n l(y, θ) log f (y, θ) = log f i (y i, θ) = l i (y i, θ) i=1 i=1 i=1 (Davidson & MacKinnon 1999, 393-396)

Example: exponential distribution For example, take the PDF of y to be f (y, θ) = θe θy y i > 0, θ > 0

Example: exponential distribution For example, take the PDF of y to be f (y, θ) = θe θy y i > 0, θ > 0 This distribution is useful when modeling something which is by definition positive. (Davidson & MacKinnon 1999, 393-396)

Example: exponential distribution Deriving the loglikelihood function: f (y, θ) = l(y, θ) = = n f (y i, θ) = i=1 n i=1 n log f (y i, θ) = i=1 θe θy i n log(θe θy i ) i=1 n (log θ θy i ) = n log θ θ n i=1 i=1 y i (Davidson & MacKinnon 1999, 393-396)

Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ).

Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ). In some cases, such as the above exponential distribution or the linear model, we can find this value analytically, but taking the derivative, and setting equal to zero.

Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ). In some cases, such as the above exponential distribution or the linear model, we can find this value analytically, but taking the derivative, and setting equal to zero. In many other cases, there is no such analytical solution, and we use numerical search algorithms to find the value.

Example: exponential distribution To find the maximum, we take the derivative and set this equal to zero: n l(y, θ) = n log θ θ d(n log θ θ n i=1 y i) dθ n n ˆθ y i = 0 ML i=1 ˆθ ML = = n θ n i=1 n n i=1 y i y i i=1 y i (Davidson & MacKinnon 1999, 393-396)

Numerical methods In many cases, closed form solutions like the examples above do not exist and numerical methods are necessary. A computer algorithm searches for the value that maximizes the loglikelihood.

Numerical methods In many cases, closed form solutions like the examples above do not exist and numerical methods are necessary. A computer algorithm searches for the value that maximizes the loglikelihood. In R: optim()

Newton s Method θ (m+1) = θ (m) H 1 (m) g (m) (Davidson & MacKinnon 1999, 401)

Outline 1 2 3 4 5 6

Model: y = Xβ + ε, ε N(0, σ 2 I)

Model: y = Xβ + ε, ε N(0, σ 2 I) We assume ε to be independently distributed and therefore, conditional on X, y is assumed to be. f i (y i, β, σ 2 ) = 1 (y 2πσ 2 e i X i β) 2σ 2 2 (Davidson & MacKinnon 1999, 396-398)

: loglikelihood function f (y, β, σ 2 ) = l(y, β, σ 2 ) = = n f i (y i, β, σ 2 ) = i=1 n 1 (y 2πσ 2 e i X i β) 2σ 2 i=1 n log f i (y i, β, σ 2 ) = i=1 n ( 1 2 log σ2 1 2 i=1 = n 2 log σ2 n 2 = n 2 log σ2 n 2 n 1 log( (y 2πσ 2 e i X i β) i=1 2 2 2σ 2 ) log 2π 1 2σ 2 (y i X i β) 2 ) log 2π 1 2σ 2 n (y i X i β) 2 i=1 log 2π 1 2σ 2 (y Xβ) (y Xβ)

: loglikelihoodfunction To zoom in on one part of this function: log 1 2πσ 2 = log 1 σ 2π = log 1 log(σ 2π) = 0 (log σ + log 2π) = 1 2 log σ2 1 log 2π, using 2 log σ = 1 log σ2 2

: loglikelihood function f (y, β, σ 2 ) = l(y, β, σ 2 ) = = n f i (y i, β, σ 2 ) = i=1 n 1 (y 2πσ 2 e i X i β) 2σ 2 i=1 n log f i (y i, β, σ 2 ) = i=1 n ( 1 2 log σ2 1 2 i=1 = n 2 log σ2 n 2 = n 2 log σ2 n 2 n 1 log( (y 2πσ 2 e i X i β) i=1 2 2 2σ 2 ) log 2π 1 2σ 2 (y i X i β) 2 ) log 2π 1 2σ 2 n (y i X i β) 2 i=1 log 2π 1 2σ 2 (y Xβ) (y Xβ)

: loglikelihood function f (y, β, σ 2 ) = l(y, β, σ 2 ) = = n f i (y i, β, σ 2 ) = i=1 n 1 (y 2πσ 2 e i X i β) 2σ 2 i=1 n log f i (y i, β, σ 2 ) = i=1 n ( 1 2 log σ2 1 2 i=1 = n 2 log σ2 n 2 = n 2 log σ2 n 2 n 1 log( (y 2πσ 2 e i X i β) i=1 2 2 2σ 2 ) log 2π 1 2σ 2 (y i X i β) 2 ) log 2π 1 2σ 2 n (y i X i β) 2 i=1 log 2π 1 2σ 2 (y Xβ) (y Xβ)

: estimating σ 2 l(y, β, σ 2 ) σ 2 = n 2σ 2 + 1 2σ 4 (y Xβ) (y Xβ) 0 = n 2ˆσ 2 + 1 2ˆσ 4 (y Xβ) (y Xβ) n 2ˆσ 2 = 1 2ˆσ 4 (y Xβ) (y Xβ) n = 1ˆσ 2 (y Xβ) (y Xβ) ˆσ 2 = 1 n (y Xβ) (y Xβ) Thus ˆσ 2 is a function of β. (Davidson & MacKinnon 1999, 396-398)

: estimating σ 2 ˆσ 2 = 1 n (y Xβ) (y Xβ) = e e n Note that this is almost the equivalent of the variance estimator of OLS ( e e n k ). This estimator is a biased, but consistent, estimator of ˆσ 2.

: estimating β l(y, β, σ 2 ) = n 2 log σ2 n 2 log 2π 1 2σ 2 (y Xβ) (y Xβ) l(y, β) = n 2 log( 1 n (y Xβ) (y Xβ)) n log 2π 2 1 2( 1 n (y Xβ) (y Xβ)) (y Xβ) (y Xβ) = n 2 log( 1 n (y Xβ) (y Xβ)) n 2 log 2π n 2

: estimating β l(y, β, σ 2 ) = n 2 log σ2 n 2 log 2π 1 2σ 2 (y Xβ) (y Xβ) l(y, β) = n 2 log( 1 n (y Xβ) (y Xβ)) n log 2π 2 1 2( 1 n (y Xβ) (y Xβ)) (y Xβ) (y Xβ) = n 2 log( 1 n (y Xβ) (y Xβ)) n 2 log 2π n 2 Because the middle term is equivalent to minus n/2 times the log of SSR, maximizing l(y, β) with respect to β is the equivalent to minimizing SSR: ˆβ ML = ˆβ OLS. (Davidson & MacKinnon 1999, 396-398)

l(y, β, σ 2 ) = n 2 log σ2 n 2 log 2π 1 2σ 2 (y Xβ) (y Xβ) ll <- function(par, X, y) { s2 <- exp(par[1]) beta <- par[-1] n <- length(y) r <- y - X %*% beta - n/2 * log(s2) - n/2 * log(2*pi) - 1/(2*s2) * sum(r^2) } mlest <- optim(c(1,0,0,0), ll, NULL, X, y, control = list(fnscale = -1), hessian=true)

Some practical tricks: By default, optim() minimizes rather than maximizes, hence the need for fnscale = -1. To get ˆσ 2, we need to take the exponent of the first parameter - this is done to make sure σ 2 is always assumed positive.

Outline 1 2 3 4 5 6

Gradient vector The gradient vector or score vector is a vector with typical element: l(y, θ) g j (y, θ), θ j i.e. the vector of partial derivatives of the loglikelihood function towards each parameter in θ. (Davidson & MacKinnon 1999, 400)

Hessian matrix H(θ) is a k k matrix with typical element h ij = 2 l(θ) θ i θ j, i.e. the matrix of second derivatives of the loglikelihood function. (Davidson & MacKinnon 1999, 401, 407)

Hessian matrix H(θ) is a k k matrix with typical element h ij = 2 l(θ) θ i θ j, i.e. the matrix of second derivatives of the loglikelihood function. 1 Asymptotic equivalent: H(θ) plim n nh(y, θ). (Davidson & MacKinnon 1999, 401, 407)

Information matrix I(θ) = n E θ ((G i (y, θ)) G i (y, θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. (Davidson & MacKinnon 1999, 406-407)

Information matrix I(θ) = n E θ ((G i (y, θ)) G i (y, θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. There is also an asymptotic equivalent, I(θ) plim n 1 n I(θ). (Davidson & MacKinnon 1999, 406-407)

Information matrix I(θ) = n E θ ((G i (y, θ)) G i (y, θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. There is also an asymptotic equivalent, I(θ) plim n 1 n I(θ). I(θ) = H(θ) - they both measure the amount of curvature in the loglikelihood function. (Davidson & MacKinnon 1999, 406-407)

Outline 1 2 3 4 5 6

Under some weak conditions, ML estimators are consistent, asymptotically efficient and asymptotically normally distributed. (Davidson & MacKinnon 1999, 402-403)

Under some weak conditions, ML estimators are consistent, asymptotically efficient and asymptotically normally distributed. New ML estimators thus do not require extensive proof of this - once it is shown it is an ML estimator, it inherits those properties. (Davidson & MacKinnon 1999, 402-403)

Variance-covariance matrix of ˆβ ML V (plim n n( ˆβ ML θ)) = H 1 (θ)i (θ)h 1 (θ) = I 1 (θ), the latter being true iff I(θ) = H(θ) is true. (Davidson & MacKinnon 1999, 409-411)

Variance-covariance matrix of ˆβ ML V (plim n n( ˆβ ML θ)) = H 1 (θ)i (θ)h 1 (θ) = I 1 (θ), the latter being true iff I(θ) = H(θ) is true. One estimator for the variance is: V ( ˆβ ML ) = H 1 ( ˆβ ML ). For alternative estimators see reference below. (Davidson & MacKinnon 1999, 409-411)

Estimating V ( ˆβ ML ) mlest <- optim(..., hessian=true) V <- solve(-mlest$hessian) sqrt(diag(v))

Exercise Imagine, we take a random sample of N colored balls from a vase, with replacement. In the vase, a fraction p of the balls are yellow and the other ones blue. In our sample, there are q yellow balls. Assuming that y i = 1 if the ball is yellow and y i = 0 for blue, the probability of obtaining our sample is: P(q) = p q (1 p) N q, which is the likelihood function with unknown parameter p. Write down the loglikelihood function. (Verbeek 2008, 172-173)

Exercise Imagine, we take a random sample of N colored balls from a vase, with replacement. In the vase, a fraction p of the balls are yellow and the other ones blue. In our sample, there are q yellow balls. Assuming that y i = 1 if the ball is yellow and y i = 0 for blue, the probability of obtaining our sample is: P(q) = p q (1 p) N q, which is the likelihood function with unknown parameter p. Write down the loglikelihood function. l(p) = log(p q (1 p) N q ) = q log p + (N q) log(1 p) (Verbeek 2008, 172-173)

Exercise Take derivative towards p. l(p) = q log p + (N q) log(1 p) Set equal to zero and find ˆp ML. (Verbeek 2008, 172-173)

Exercise l(p) = q log p + (N q) log(1 p) Take derivative towards p. dl(p) dp = q p + N q 1 p Set equal to zero and find ˆp ML. (Verbeek 2008, 172-173)

Exercise l(p) = q log p + (N q) log(1 p) Take derivative towards p. dl(p) dp = q p + N q 1 p Set equal to zero and find ˆp ML. q ˆp + N q 1 ˆp = 0 = ˆp = q N (Verbeek 2008, 172-173)

Exercise l(p) = q log p + (N q) log(1 p) Using the loglikelihood function and optim(), find p for N = 100 and q = 30.

Exercise l(p) = q log p + (N q) log(1 p) Using the loglikelihood function and optim(), find p for N = 100 and q = 30. ML estimates are invariant under reparametrization, so we will use the logit transform of p instead of p, to avoid probabilities outside [0, 1]: p = 1 1 + e p.

Exercise l(p) = q log p + (N q) log(1 p) Using the loglikelihood function and optim(), find p for N = 100 and q = 30. ML estimates are invariant under reparametrization, so we will use the logit transform of p instead of p, to avoid probabilities outside [0, 1]: p = 1 1 + e p. ll <- function(p, N, q) { pstar <- 1 / (1 + exp(-p)) q * log(pstar) + (N - q) * log(1 - pstar) } mlest <- optim(0, ll, NULL, N=100, q=30, hessian=true, control = list(fnscale = -1)) phat <- 1 / (1 + exp(-mlest$par))

Exercise Using the previous estimate, including the Hessian, calculate a 95% confidence interval around this ˆp.

Exercise Using the previous estimate, including the Hessian, calculate a 95% confidence interval around this ˆp. se <- sqrt(-1/mlest$hessian) ci <- c(mlest$par - 1.96 * se, mlest$par + 1.96 * se) ci <- 1 / (1 + exp(-ci)) Repeat for N = 1000, q = 300.

Exercise Using the uswages.dta data, estimate model log(wage) i = β 0 + β 1 educ i + β 2 exper i + ε i with the lm() function and with the optim() function. Compare estimates for β, σ 2 and V (β). Try different optimization algorithms in optim().

Outline 1 2 3 4 5 6

Within the ML framework, there are three common tests: Likelihood ratio test (LR) Wald test (W ) Lagrange multiplier test (LM) These are asymptotically equivalent. Given r restrictions, all have asymptotically a χ 2 (r) distribution. (Davidson & MacKinnon 1999, 414)

Within the ML framework, there are three common tests: Likelihood ratio test (LR) Wald test (W ) Lagrange multiplier test (LM) These are asymptotically equivalent. Given r restrictions, all have asymptotically a χ 2 (r) distribution. In the following, θ ML will denote the restricted estimate and ˆθ ML the unrestricted one. (Davidson & MacKinnon 1999, 414)

Likelihood ratio test LR = 2(l( ˆθ ML ) l( θ ML )) = 2 log L( ˆθ ML ) L( θ ML ), thus twice the log of the ratio of likelihood functions. If we have two separate estimates, this is thus very easy to compute or even eye-ball. (Davidson & MacKinnon 1999, 414-416)

Likelihood ratio test unrestricted loglikelihood 15 10 5 0 restricted LR/2 0 1 2 3 4 5 6 7 theta

Wald test The basic intuition is that the Wald test is quite comparable to an F -test on a set of restrictions. For a single regression parameter the formule would be: W = ( ˆβ β 0 ) 2 ( d 2 L(β) dβ 2 ) (Davidson & MacKinnon 1999, 416-418)

Wald test The basic intuition is that the Wald test is quite comparable to an F -test on a set of restrictions. For a single regression parameter the formule would be: W = ( ˆβ β 0 ) 2 ( d 2 L(β) dβ 2 ) This test does not require an estimate of θ ML. The test is somewhat sensitive to the formulation of r, so that as long as estimating θ ML is not too costly, it is better to use an LR or LM test. (Davidson & MacKinnon 1999, 416-418)

Wald test If the second derivative is higher, the slope of l( ) changes faster, thus the difference between the likelihoods at the restricted and unrestricted positions will be larger.

Wald test loglikelihood 15 10 5 0 restricted unrestricted 0 1 2 3 4 5 6 7 theta

Lagrange multiplier test For a single parameter: LM = ( dl( β) )2 d β d 2 L( β) d β 2 Thus the steeper the slope of the loglikelihood function at the point of the restricted estimate, the more likely it is significantly different from the unrestricted estimate, unless this slope changes very quickly (the second derivative is high).

Lagrange multiplier test unrestricted loglikelihood 15 10 5 0 restricted 0 1 2 3 4 5 6 7 theta