Advanced Quantitative Methods: maximum likelihood

Size: px

Start display at page:

Download "Advanced Quantitative Methods: maximum likelihood"

Aldous Phelps
5 years ago
Views:

1 Advanced Quantitative Methods: Maximum Likelihood University College Dublin March 23, 2011

2 1 Introduction

3 Outline Introduction 1 Introduction

4 Preliminaries Introduction Ordinary least squares and its variations (GLS, WLS, 2SLS, etc.) is very flexible and applicable in many circumstances, but for some models (e.g. limited dependent variable models) we need more flexible estimation procedures. The most prominent alternative estimators are maximum likelihood (ML) and Bayesian estimators. This lecture is about the former.

5 Logarithms Introduction Some rules about logarithms, which are important for understanding ML: logab = loga+logb loge a = a loga b = bloga loga > logb iff a > b d(log a) = 1 da a d(logf(a)) da = d(f(a))/da f(a)

6 Likelihood: intuition Dice example.

7 Likelihood: intuition Dice example. Imagine a dice is thrown with unknown number of sides k. We know after throwing the dice that the outcome is 5. How likely is this outcome if k = 0? And k = 1? Continue until k = 10. Plot of this is the likelihood function.

8 Likelihood Introduction Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes. (Wolfram Mathworld)

9 Preliminaries Introduction Looking first at just univariate models, we can express the model as the probability density function (PDF) f(y,θ), where y is the dependent variable, θ the set of parameters, and f( ) expresses the model specification.

10 Preliminaries Introduction Looking first at just univariate models, we can express the model as the probability density function (PDF) f(y,θ), where y is the dependent variable, θ the set of parameters, and f( ) expresses the model specification. ML is purely parametric - we need to make strong assumptions about the shape of f( ) before we can estimate θ.

11 Likelihood Introduction Given θ, f(,θ) is the PDF of y.

12 Likelihood Introduction Given θ, f(,θ) is the PDF of y. Given y, f(y, ) cannot be interpreted as a PDF and is therefore called the likelihood function.

13 Likelihood Introduction Given θ, f(,θ) is the PDF of y. Given y, f(y, ) cannot be interpreted as a PDF and is therefore called the likelihood function. ˆθ ML is the θ that maximizes this likelihood function. (Davidson & MacKinnon 1999, )

14 (Log)likelihood function Generally, observations are assumed to be independent, in which case the joint density of the entire sample is the product of the densities of the individual observations. n f(y,θ) = f(y i,θ) i=1

15 (Log)likelihood function Generally, observations are assumed to be independent, in which case the joint density of the entire sample is the product of the densities of the individual observations. n f(y,θ) = f(y i,θ) It is easier to work with the log of this function, because sums are easier to deal with than products and the logarithmic transformation is monotonic: n n l(y,θ) logf(y,θ) = logf i (y i,θ) = l i (y i,θ) i=1 i=1 i=1 (Davidson & MacKinnon 1999, )

16 Example: exponential distribution For example, take the PDF of y to be f(y,θ) = θe θy y i > 0, θ > 0

17 Example: exponential distribution For example, take the PDF of y to be f(y,θ) = θe θy y i > 0, θ > 0 This distribution is useful when modeling something which is by definition positive. (Davidson & MacKinnon 1999, )

18 Example: exponential distribution Deriving the loglikelihood function: f(y,θ) = l(y,θ) = = n f(y i,θ) = i=1 n i=1 n logf(y i,θ) = i=1 θe θy i n log(θe θy i ) i=1 n (logθ θy i ) = nlogθ θ n i=1 i=1 y i (Davidson & MacKinnon 1999, )

19 Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ).

20 Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ). In some cases, such as the above exponential distribution or the linear model, we can find this value analytically, but taking the derivative, and setting equal to zero.

21 Maximum Likelihood Estimates The estimate of θ is the value of θ which maximizes the (log)likelihood function l(y, θ). In some cases, such as the above exponential distribution or the linear model, we can find this value analytically, but taking the derivative, and setting equal to zero. In many other cases, there is no such analytical solution, and we use numerical search algorithms to find the value.

22 Example: exponential distribution To find the maximum, we take the derivative and set this equal to zero: n l(y,θ) = nlogθ θ d(nlogθ θ n i=1 y i) dθ n n ˆθ y i = 0 ML i=1 ˆθ ML = = n θ n i=1 n n i=1 y i y i i=1 y i (Davidson & MacKinnon 1999, )

23 Numerical methods In many cases, closed form solutions like the examples above do not exist and numerical methods are necessary. A computer algorithm searches for the value that maximizes the loglikelihood.

24 Numerical methods In many cases, closed form solutions like the examples above do not exist and numerical methods are necessary. A computer algorithm searches for the value that maximizes the loglikelihood. In R: optim()

25 Newton s Method θ (m+1) = θ (m) H 1 (m) g (m) (Davidson & MacKinnon 1999, 401)

26 Outline Introduction 1 Introduction

27 Introduction Model: y = Xβ +ε, ε N(0,σ 2 I)

28 Introduction Model: y = Xβ +ε, ε N(0,σ 2 I) We assume ε to be independently distributed and therefore, conditional on X, y is assumed to be. f i (y i,β,σ 2 ) = 1 i X i β) 2πσ 2 e (y 2σ 2 2 (Davidson & MacKinnon 1999, )

29 : loglikelihood function f(y,β,σ 2 ) = l(y,β,σ 2 ) = n n f i (y i,β,σ 2 1 i ) = X i β) 2πσ 2 e (y 2σ 2 i=1 i=1 n logf i (y i,β,σ 2 ) = i=1 i=1 n 1 i log( X i β) 2πσ 2 e (y 2 2 2σ 2 )

30 : loglikelihoodfunction To zoom in on one part of this function: log 1 2πσ 2 = log 1 σ 2π = log1 log(σ 2π) = 0 (logσ +log 2π) = 1 2 logσ2 1 2 log2π,using logσ = 1 2 logσ2

31 : loglikelihood function f(y,β,σ 2 ) = l(y,β,σ 2 ) = n n f i (y i,β,σ 2 1 i ) = X i β) 2πσ 2 e (y 2σ 2 i=1 i=1 n logf i (y i,β,σ 2 ) = i=1 i=1 n 1 i log( X i β) 2πσ 2 e (y 2 2 2σ 2 )

32 : loglikelihood function f(y,β,σ 2 ) = l(y,β,σ 2 ) = = n f i (y i,β,σ 2 ) = i=1 n 1 i X i β) 2πσ 2 e (y 2σ 2 i=1 n logf i (y i,β,σ 2 ) = i=1 n i=1 n 1 i log( X i β) 2πσ 2 e (y i= σ 2 ) ( 1 2 logσ2 1 2 log2π 1 2σ 2(y i X i β) 2 ) = n 2 logσ2 n 2 log2π 1 2σ 2 n (y i X i β) 2 i=1 = n 2 logσ2 n 2 log2π 1 2σ 2(y Xβ) (y Xβ)

33 : estimating σ 2 l(y,β,σ 2 ) σ 2 = n 2σ σ 4(y Xβ) (y Xβ) 0 = n 2ˆσ ˆσ 4(y Xβ) (y Xβ) n 2ˆσ 2 = 1 2ˆσ 4(y Xβ) (y Xβ) n = 1ˆσ 2(y Xβ) (y Xβ) ˆσ 2 = 1 n (y Xβ) (y Xβ) Thus ˆσ 2 is a function of β. (Davidson & MacKinnon 1999, )

34 : estimating σ 2 ˆσ 2 = 1 n (y Xβ) (y Xβ) = e e n Note that this is almost the equivalent of the variance estimator of OLS ( e e n k ). This estimator is a biased, but consistent, estimator of ˆσ 2.

35 : estimating β l(y,β,σ 2 ) = n 2 logσ2 n 2 log2π 1 2σ 2(y Xβ) (y Xβ) l(y,β) = n 2 log(1 n (y Xβ) (y Xβ)) n 2 log2π 1 2( 1 n (y Xβ) (y Xβ)) (y Xβ) (y Xβ) = n 2 log(1 n (y Xβ) (y Xβ)) n 2 log2π n 2

36 : estimating β l(y,β,σ 2 ) = n 2 logσ2 n 2 log2π 1 2σ 2(y Xβ) (y Xβ) l(y,β) = n 2 log(1 n (y Xβ) (y Xβ)) n 2 log2π 1 2( 1 n (y Xβ) (y Xβ)) (y Xβ) (y Xβ) = n 2 log(1 n (y Xβ) (y Xβ)) n 2 log2π n 2 Because the middle term is equivalent to minus n/2 times the log of SSR, maximizing l(y,β) with respect to β is the equivalent to minimizing SSR: ˆβ ML = ˆβ OLS.

37 Introduction Creating some fake data: n <- 100 x1 <- rnorm(n) x2 <- rnorm(n) X <- cbind(1, x1, x2) y <- X %*% c(1, 2,.5) + rnorm(n, 0,.05)

38 Introduction l(y,β,σ 2 ) = n 2 logσ2 n 2 log2π 1 2σ 2(y Xβ) (y Xβ) ll <- function(par, X, y) { s2 <- exp(par[1]) beta <- par[-1] n <- length(y) r <- y - X %*% beta n/2 * log(s2) + n/2 * log(2*pi) + 1/(2*s2) * sum(r^2) } mlest <- optim(c(1,0,0,0), ll, NULL, X, y, hessian=true)

39 Introduction Some practical tricks: By default, optim() minimizes rather than maximizes, so we implemented l( ) rather than l( ). To get ˆσ 2, we need to take the exponent of the first parameter - this is done to make sure σ 2 is always assumed positive.

40 Outline Introduction 1 Introduction

41 Gradient vector The gradient vector or score vector is a vector with typical element: g j (y,θ) l(y,θ) θ j, i.e. the vector of partial derivatives of the loglikelihood function towards each parameter in θ. (Davidson & MacKinnon 1999, 400)

42 Hessian matrix H(θ) is a k k matrix with typical element h ij = 2 l(θ) θ i θ j, i.e. the matrix of second derivatives of the loglikelihood function. (Davidson & MacKinnon 1999, 401, 407)

43 Hessian matrix H(θ) is a k k matrix with typical element h ij = 2 l(θ) θ i θ j, i.e. the matrix of second derivatives of the loglikelihood function. Asymptotic equivalent: H(θ) plim n 1 n H(y,θ). (Davidson & MacKinnon 1999, 401, 407)

44 Information matrix I(θ) = n E θ ((G i (y,θ)) G i (y,θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. (Davidson & MacKinnon 1999, )

45 Information matrix I(θ) = n E θ ((G i (y,θ)) G i (y,θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. There is also an asymptotic equivalent, I(θ) plim n 1 n I(θ). (Davidson & MacKinnon 1999, )

46 Information matrix I(θ) = n E θ ((G i (y,θ)) G i (y,θ)) i=1 I(θ) is the information matrix, or the covariance matrix of the score vector. There is also an asymptotic equivalent, I(θ) plim n 1 n I(θ). I(θ) = H(θ) - they both measure the amount of curvature in the loglikelihood function. (Davidson & MacKinnon 1999, )

47 Outline Introduction 1 Introduction

48 Introduction Under some weak conditions, ML estimators are consistent, asymptotically efficient and asymptotically normally distributed. (Davidson & MacKinnon 1999, )

49 Introduction Under some weak conditions, ML estimators are consistent, asymptotically efficient and asymptotically normally distributed. New ML estimators thus do not require extensive proof of this - once it is shown it is an ML estimator, it inherits those properties. (Davidson & MacKinnon 1999, )

50 Variance-covariance matrix of ˆβ ML V(plim n n(ˆβ ML θ)) = H 1 (θ)i(θ)h 1 (θ) = I 1 (θ), the latter being true iff I(θ) = H(θ) is true. (Davidson & MacKinnon 1999, )

51 Variance-covariance matrix of ˆβ ML V(plim n n(ˆβ ML θ)) = H 1 (θ)i(θ)h 1 (θ) = I 1 (θ), the latter being true iff I(θ) = H(θ) is true. One estimator for the variance is: V(ˆβ ML ) = H 1 (ˆβ ML ). For alternative estimators see reference below. (Davidson & MacKinnon 1999, )

52 Estimating V(ˆβ ML ) mlest <- optim(..., hessian=true) V <- solve(mlest$hessian) sqrt(diag(v)) Note that because we implemented l( ) instead of l( ), we also take H instead of H.

53 Outline Introduction 1 Introduction

54 Within the ML framework, there are three common tests: Likelihood ratio test (LR) Wald test (W) Lagrange multiplier test (LM) These are asymptotically equivalent. Given r restrictions, all have asymptotically a χ 2 (r) distribution. (Davidson & MacKinnon 1999, 414)

55 Within the ML framework, there are three common tests: Likelihood ratio test (LR) Wald test (W) Lagrange multiplier test (LM) These are asymptotically equivalent. Given r restrictions, all have asymptotically a χ 2 (r) distribution. In the following, θ ML will denote the restricted estimate and the unrestricted one. ˆθ ML (Davidson & MacKinnon 1999, 414)

56 Likelihood ratio test LR = 2(l(ˆθ ML ) l( θ ML )) = 2log L(ˆθ ML ) L( θ ML ), thus twice the log of the ratio of likelihood functions. If we have two separate estimates, this is thus very easy to compute or even eye-ball. (Davidson & MacKinnon 1999, )

57 Likelihood ratio test loglikelihood unrestricted LR/2 restricted theta

58 Wald test Introduction The basic intuition is that the Wald test is quite comparable to an F-test on a set of restrictions. For a single regression parameter the formule would be: W = (ˆβ β 0 ) 2 ( d2 L(β) dβ 2 ) (Davidson & MacKinnon 1999, )

59 Wald test Introduction The basic intuition is that the Wald test is quite comparable to an F-test on a set of restrictions. For a single regression parameter the formule would be: W = (ˆβ β 0 ) 2 ( d2 L(β) dβ 2 ) This test does not require an estimate of θ ML. The test is somewhat sensitive to the formulation of r, so that as long as estimating θ ML is not too costly, it is better to use an LR or LM test. (Davidson & MacKinnon 1999, )

60 Wald test Introduction If the second derivative is higher, the slope of l( ) changes faster, thus the difference between the likelihoods at the restricted and unrestricted positions will be larger.

61 Wald test Introduction loglikelihood restricted unrestricted theta

62 Lagrange multiplier test For a single parameter: LM = ( dl( β) d β )2 d 2 L( β) d β 2 Thus the steeper the slope of the loglikelihood function at the point of the restricted estimate, the more likely it is significantly different from the unrestricted estimate, unless this slope changes very quickly (the second derivative is high).

63 Lagrange multiplier test loglikelihood unrestricted restricted theta

64 Exercise Introduction Imagine, we take a random sample of N colored balls from a vase, with replacement. In the vase, a fraction p of the balls are yellow and the other ones blue. In our sample, there are q yellow balls. Assuming that y i = 1 if the ball is yellow and y i = 0 for blue, the probability of obtaining our sample is: P(q) = p q (1 p) N q, which is the likelihood function with unknown parameter p. Write down the loglikelihood function. (Verbeek 2008, )

65 Exercise Introduction Imagine, we take a random sample of N colored balls from a vase, with replacement. In the vase, a fraction p of the balls are yellow and the other ones blue. In our sample, there are q yellow balls. Assuming that y i = 1 if the ball is yellow and y i = 0 for blue, the probability of obtaining our sample is: P(q) = p q (1 p) N q, which is the likelihood function with unknown parameter p. Write down the loglikelihood function. l(p) = log(p q (1 p) N q ) = qlogp +(N q)log(1 p) (Verbeek 2008, )

66 Exercise Introduction Take derivative towards p. l(p) = qlogp +(N q)log(1 p) Set equal to zero and find ˆp ML. (Verbeek 2008, )

67 Exercise Introduction l(p) = qlogp +(N q)log(1 p) Take derivative towards p. d logl(p) dp = q p + N q 1 p Set equal to zero and find ˆp ML. (Verbeek 2008, )

68 Exercise Introduction l(p) = qlogp +(N q)log(1 p) Take derivative towards p. d logl(p) dp = q p + N q 1 p Set equal to zero and find ˆp ML. q ˆp + N q 1 ˆp = 0 = ˆp = q N (Verbeek 2008, )

69 Exercise Introduction l(p) = qlogp +(N q)log(1 p) Using the loglikelihood function and optim(), find p for N = 100 and q = 30. (Note that log(0) is not possible.)

70 Exercise Introduction l(p) = qlogp +(N q)log(1 p) Using the loglikelihood function and optim(), find p for N = 100 and q = 30. (Note that log(0) is not possible.) ll <- function(p, N, q) { if (p > 0 && p < 1) { -q * log(p) - (N - q) * log(1 - p) } else { } } mlest <- optim(0, ll, NULL, N=100, q=30, hessian=true)

71 Exercise Introduction Using the previous estimate, including the Hessian, calculate a 95% confidence interval around this ˆp.

72 Exercise Introduction Using the previous estimate, including the Hessian, calculate a 95% confidence interval around this ˆp. mlest <- optim(0, ll, NULL, N=100, q=30, hessian=true) se <- sqrt(1/mlest$hessian) ci <- c(mlest$par * se, mlest$par * se) Repeat for N = 1000, q = 300.

73 Exercise Introduction Using the eaef01.dta data, estimate model log(earnings) i = β 0 +β 1 S i +β 2 LIBRARY i +ε i with the lm() function and with the optim() function. Compare estimates for β, σ 2 and V(β). Try different optimization algorithms in optim().

Advanced Quantitative Methods: maximum likelihood

Advanced Quantitative Methods: Maximum Likelihood University College Dublin 4 March 2014 1 2 3 4 5 6 Outline 1 2 3 4 5 6 of straight lines y = 1 2 x + 2 dy dx = 1 2 of curves y = x 2 4x + 5 of curves y