1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Size: px

Start display at page:

Download "1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators"

Monica Carpenter
5 years ago
Views:

1 Likelihood & Maximum Likelihood Estimators February 26, 2018 Debdeep Pati 1 Likelihood Likelihood is surely one of the most important concepts in statistical theory. We have seen the role it plays in sufficiency, through the factorization theorem. But, more importantly, the likelihood function establishes a preference among the possible parameter values given data X = x. That is, a parameter values θ 1 with larger likelihood is better than parameter value θ 2 with smaller likelihood, in the sense that the model P θ1 provides a better fit to the observed data than P θ2. This leads naturally to procedures for inference which select, as a point estimator, the parameter value that makes the likelihood the largest, or rejects a null hypothesis if the hypothesized value has likelihood too small. Estimators and hypothesis tests based on the likelihood function have some general desirable properties, in particular, there are widely applicable large-sample approximations of the relevant sampling distributions. A main focus of this chapter is the mostly rigorous derivation of these important results. 1.1 Likelihood function Consider a class of probability models {P θ : θ Θ}, defined on the measurable space (X, A), absolutely continuous with respect to a dominating σ-finite measure µ. In this case, for each θ, the Radon-Nikodym derivative (dp θ /dµ)(x) is the usual probability density function for the observable X, written as p θ (x). For fixed θ, we know that p θ (x) characterizes the sampling distribution of X as well as that of any statistic T = T (X). But how do we use/interpret p θ (x) as a function of θ for fixed x? This is a special function with its own name?the likelihood function. Definition 1. Given X = x, the likelihood function is L(θ) = p θ (x). The intuition behind the choice of name is that a θ for which L(θ) is large is more likely to be true value compared to a θ for which L(θ ) is small. The name likelihood was coined by Fisher (1973): What has now appeared is that the mathematical concept of probability is... in- adequate to express our mental confidence or indifference in making... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term likelihood to designate this quantity; since both words likelihood and probability are loosely used in common speech 1

2 to cover both kinds of relationship. Fisher s point is that L(θ) is a measure of how plausible θ is, but that this measure of plausibility is different from our usual understanding of probability; see Aldrich (1997) for more on Fisher and likelihood. While we understand the probability (density) p θ (x), for fixed θ, as a pre-experimental summary of our uncertainty about where X will fall, the likelihood L(θ) = p θ (x), for fixed x, gives a post-experimental summary of how likely it is that model P θ produced the observed X = x. In other words, the likelihood function provides a ranking of the possible parameter values - those θ with greater likelihood are better, in that they fit the data better, than those θ with smaller likelihood. Therefore, only the shape of the likelihood function is relevant, not the scale. The likelihood function is useful across all approaches to statistics. We ve already seen some uses of the likelihood function. In particular, the factorization theorem states that the (shape of the) likelihood function depends on the observed data X = x only through the sufficient statistic. The next section discusses some standard and some not-so-standard uses of the likelihood. 2 Maximum Likelihood Estimation and first order theory Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The likelihood function L(θ x) and joint pdf f(x θ) are the same except that f(x θ) is generally viewed as a function of x with θ held fixed, and L(θ x) as a function of θ with x held fixed. f(x θ) is a density in x for each fixed θ. But L(θ x) is not a density (or mass function) in θ for fixed x (except by coincidence). Given a class of potential models P θ indexed by Θ, a subset of R d, we observe X = x and we?d like to know which model is the most likely to have produced this x. This defines an optimization problem, and the result, namely ˆθ = argmax θ Θ L(θ) (1) is the maximum likelihood estimate (MLE) of θ. Naturally, Pˆθ is then considered the most likely model, that is, among the class {P θ : θ Θ}, the model Pˆθ provides the best fit to the observed X = x. In terms of ranking intuition, ˆθ is ranked the highest. When the likelihood function is smooth, the optimization problem can be posed as a rootfinding problem. That is, the MLE ˆθ can be viewed as a solution to the equation l(θ) = 0 (2) 2

3 where denotes the gradient operator, l = log L is the log-likelihood, and the right-hand side is a d-vector of zeroes. Equation 2 is called the likelihood equation. In a lot of examples, the solution of this equation is available in closed form. In scenarios, when the solution is not available in closed form, we resort to optimization tools. 2.1 A bit about computation Newton s method is a simple and powerful tool for doing optimization or, more precisely, root finding. You should be familiar with this method from a calculus course. The idea is based on the fact that, locally, any differentiable function can be suitably approximated by a linear function. This linear function is then used to define a recursive procedure that will, under suitable conditions, eventually find the desired solution to (2). MLE is a solution to this equation, i.e., a root of the gradient of the log-likelihood function. Assume that the gradient l(θ) is also differentiable, and let D(θ) denote that matrix of derivatives, i.e., D(θ) ij = ( 2 / θ i θ j )l(θ). Assume D(θ) is non-singular for all θ. The idea behind Newton s method is as follows. Pick some guess, say θ (0) of the MLE ˆθ. Now approximate l(θ) by a linear function l(θ) = l(θ (0) ) + D(θ (0) )(θ θ (0) ) + error. Ignore the error, solve for θ, and call the solution θ (1) : θ (1) = θ (0) D(θ (0) ) 1 l(θ ( 0)) If θ (0) is close to the solution of the likelihood equation, then so will θ (1). The idea is to iterate this process until the solutions converge. So the method is to pick a reasonable starting value θ (0) and, at iteration t 0 set θ (t+1) = θ (t) D(θ(t)) 1 (θ (t) ). Then stop the algorithm when t is large and/or θ (t+1) θ (t) is small. There are lots of tools available for doing optimization, the Newton method described above is just one simple approach. Fortunately, there are good implementations of these methods already available in the standard software. For example, the routine optim in R is a very powerful and simple-to-use tool for generic optimization. For problems that have a certain form, specifically, problems that can be written in a latent variable form, there is a very clever tool called the EM algorithm (e.g. Dempster et al. 1977) for maximizing the. likelihood. Section 9.6 in Keener (2010) gives some description of this method. 2.2 The Maximum Likelihood Estimator (MLE) A point estimator ˆθ = ˆθ(x) is a MLE for θ if L(ˆθ x) = sup L(θ x), θ 3

4 that is, ˆθ maximizes the likelihood. In most cases, the maximum is achieved at a unique value, and we can refer to the MLE, and write ˆθ(x) = argmax θ L(θ x). (But there are cases where the likelihood has flat spots and the MLE is not unique.) 2.3 Motivation for MLE s Note: We often write L(θ x) = L(θ), suppressing x, which is kept fixed at the observed data. Suppose x R n. Discrete Case: If f( θ) is a mass function (X is discrete), then L(θ) = f(x θ) = P θ (X = x). L(θ) is the probability of getting the observed data x when the parameter value is θ. Continuous Case: When f( θ) is a continuous density P θ (X = x) = 0, but if B R n is a very, very small ball (or cube) centered at the observed data x, then P θ (X B) f(x θ) Volume(B) L(θ). L(θ) is proportional to the probability the random data X will be close to the observed data x when the parameter value is θ. Thus, the MLE ˆθ is the value of θ which makes the observed data x most probable. To find ˆθ, we maximize L(θ). This is usually done by calculus (finding a stationary point), but not always. If the parameter space Θ contains endpoints or boundary points, the maximum can be achieved at a boundary point without being a stationary point. If L(θ) is not smooth (continuous and everywhere differentiable), the maximum does not have to be achieved at a stationary point. Cautionary Example: Suppose X 1,..., X n are iid Uniform(0, θ) and Θ = (0, ). Given data x = (x 1,..., x n ), find the MLE for θ. L(θ) = = n θ 1 I(0 < x i < θ) = θ n I(0 x (1) )I(x (n) θ) { θ n for θ x (n) 0 for 0 < θ < x (n) which is maximized at θ = x (n), which is a point of discontinuity and certainly not a stationary point. Thus, the MLE is ˆθ = x (n). 4

5 Notes: L(θ) = 0 for θ < x (n) is just saying that these values of θ are absolutely ruled out by the data (which is obvious). A strange property of the MLE in this example (not typical): P θ (ˆθ < θ) = 1 The MLE is biased; it is always less than the true value. A Similar Example: Let X 1,..., X n be iid Uniform(α, β) and Θ = {(α, β) : α < β}. Given data x = (x 1,..., x n ), find the MLE for θ = (α, β). L(α, β) = = n (β α) 1 I(α < x i < β) = (β α) n I(α x (1) )I(x (n) β) { (β α) n for α x (1), x (n) β 0 otherwise which is maximized by making β α as small as possible without entering 0 otherwise region. Clearly, the maximum is achieved at (α, β) = (x (1), x (n) ). Thus the MLE is θ = (ˆα, ˆβ) = (x (1), x (n) ). Again, P α,β (α < ˆα, ˆβ < β) = 1. 3 Maximizing the Likelihood (one parameter) 3.1 General Remarks Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attains its supremum (but might do so at one of the endpoints). (That is, there exists a point θ 0 J such that g(θ 0 ) = sup θ J g(θ). ) Consequence: Suppose g(θ) is a continuous, non-negative function defined on an open interval J = (c, d) (where perhaps c = or d = ). If lim g(θ) = lim g(θ) = 0, θ c θ d then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous. 4 Maxima at Stationary Points Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infinite or finite). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point (that is, g (θ 0 ) = 0). 5

6 1. If g (θ 0 ) = 0 and g (θ 0 ) < 0, then θ 0 is a local maximum (but might not be the global maximum). 2. If g (θ 0 ) = 0 and g (θ) < 0 for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). The condition in (1) is necessary (but not sufficient) for θ 0 to be a global maximum. Condition (2) is sufficient (but not necessary). A function satisfying g (θ) < 0 for all θ Θ is called strictly concave. It lies below any tangent line. Another useful condition (sufficient, but not necessary) is: 3. If g (θ) > 0 for θ < θ 0 and g (θ) < 0 for θ > θ 0, then θ 0 is a global maximum. 5 Maximizing the Likelihood (multi-parameter) 5.1 Basic Result: A continuous function g(θ) defined on a closed, bounded set J R k attains its supremum (but might do so on the boundary). 5.2 Consequence: Suppose g(θ) is a continuous, non-negative function defined for all θ R k. If g(θ) 0 as θ, then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous. Suppose the function g(θ) is defined on a convex set Θ R k (that is, the line segment joining any two points in Θ lies entirely inside Θ). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point: g(θ 0 ) θ i = 0, i = 1, 2,..., k. Define the vector D and Hessian matrix H: ( ) g(θ) k D(θ) = θ i ( 2 ) g(θ) k H(θ) = θ i θ j i,j=1 5.3 Maxima at Stationary Points (a k 1 vector). (a k k matrix) 1. If D(θ 0 ) = 0 and H(θ 0 ) is negative definite, then θ 0 is a local maximum (but might not be the global maximum). 6

7 2. If D(θ 0 ) = 0 and H(θ) is negative definite for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). (1) is necessary (but not sufficient) for θ 0 to be a global maximum. (2) is sufficient (but not necessary). A function for which H(θ) is negative definite for all θ Θ is called strictly concave. It lies below any tangent plane. 5.4 Positive and Negative Definite Matrices Suppose M is a k k symmetric matrix. Note: Hessian matrices and covariance matrices are symmetric. Definitions: 1. M is positive definite if x Mx > 0 for all x 0 (x R k ). 2. M is negative definite if x Mx < 0 for all x M is non-negative definite (or positive semi-definite) if x Mx 0 for all x R k. Facts: 1. M is p.d. iff all its eigenvalues are positive. 2. M is n.d. iff all its eigenvalues are negative. 3. M is n.n.d. iff all its eigenvalues are non-negative. 4. M is p.d. iff M is n.d. 5. If M is p.d., all its diagonal elements must be positive. 6. If M is n.d., all its diagonal elements must be negative. 7. The determinant of a symmetric matrix is equal to the product of its eigenvalues. 2 2 Symmetric Matrices: ( ) m11 m M = (m ij ) = 12, m m 21 m 12 = m M = m 11 m 22 m 12 m 21 = m 11 m 22 m A 2 2 matrix is p.d. when the determinant is positive and the diagonal elements are positive. A 2 2 matrix is n.d. when the determinant is positive and the diagonal 7

8 elements are negative. The bare minimum you need to check: M is p.d. if m 11 > 0 (or m 22 > 0) and M > 0. M is n.d. if m 11 < 0 (or m 22 < 0) and M > 0. Example: Observe X 1, X 2,..., X n be iid Gamma(α, β). Preliminaries: L(α, β) = n x α 1 i e x i/β β α Γ(α) Maximizing L is same as maximzing l = log L given by l(α, β) = (α 1)T 1 T 2 /β nα log β n log Γ(α) where T 1 = i log x i, T 2 = i x i. Note that T = (T 1, T 2 ) is the natural sufficient statistic of this 2pef. l α = T 1 n log β nψ(α), ψ(α) d dα log Γ(α) = Γ (α) Γ(α) l β = T 2 β 2 nα β = 1 β 2 (T 2 nαβ) 2 l α 2 = nψ (α) 2 l β 2 = 2T 2 β 3 + nα β 2 = 1 β 3 (2T 2 nαβ) 2 l α β = n β Situation #1: Suppose α = α 0 is known. Find MLE for β. (Drop α from arguments: l(β) = l(α 0, β) etc.) l(β) is continuous and differentiable. l(β) has a unique stationary point: Now we check the second derivative. l (β) = l β = 1 β 2 (T 2 nα 0 β) = 0 iff T 2 = nα 0 β, iff β = T 2 nα 0 ( β ) l (β) = 2 l β 2 = 1 β 3 (2T 2 nαβ) = 1 β 3 {T 2 + (T 2 nαβ)}. 8

9 Note l (β ) < 0 since T 2 nα 0 β = 0, but l (β) > 0 for β > 2T 2 /(nα 0 ). Thus, the stationary point satisfies the necessary condition for a global maximum, but not the sufficient condition (i.e., l(β) is not a strictly concave function). How can we be sure that we have found the global maximum, and not just a local maximum? In this case, there is a simple argument: The stationary point β is unique, and l (β) > 0 for β < β,and l (β) < 0 for β > β. This ensures β is the unique global maximizer. Conclusion: ˆβ = T 2 nα 0. (This is a function of T 2, which is a sufficient statistic for β when α is known.) Situation #2: Suppose β = β 0 is known. Find MLE for α. (Drop β from arguments: l(α) = l(α, β 0 ) etc.) Note: l (α) and l (α) involve ψ(α). The function ψ is infinitely differentiable on the interval (0, ), and satisfies ψ (α) > 0 and ψ (α) < 0 for all α > 0. (The function is strictly increasing and strictly concave.) Also, lim ψ(α) =, lim ψ(α) =. α 0 + α Thus ψ 1 : R (0, ) exists. l(α) is continuous and differentiable. l(α) has a unique stationary point: l (α) = T 1 n log β 0 nψ(α) = 0 iff ψ(α) = T 1 /n log β 0 iff α = ψ 1 (T 1 /n log β 0 ) This is the unique global maximizer since l (α) = nψ (α) < 0, α > 0. Thus ˆα = ψ 1 (T 1 /n log β 0 ) is the MLE. (This is a function of T 1, which is a sufficient statistic for α when β is known.) Situation #3: Find MLE for θ = (α, β). l(α, β) is continuous and differentiable. A stationary point must satisfy the system of two equations: Solving the second equation for β gives l α = T 1 n log β nψ(α) = 0 l β = 1 β 2 (T 2 nαβ) = 0. β = T 2 nα 9

10 Plugging this into the first equation, and rearranging a bit leads to ( ) T 1 n log T2 = ψ(α) log α H(α) n The function H(α) is continuous and strictly increasing from (0, ) to (, 0), so that it has an inverse mapping (, 0) to (0, ). Thus, the solution to the above equation can be written: α = H 1 { T1 n log ( )} T2 n Thus the unique stationary point is: { ( )} ˆα = H 1 T1 n log T2 n Is this the MLE? Let us examine the Hessian. H(α, β) = ˆβ = T 2 nˆα = H(ˆα, ˆβ) = ( 2 l α 2 2 l α β 2 l α β 2 l β 2 ) ( nψ (α) n n β 1 ( nψ (ˆα) n 2 ˆα T 2 n 2 ˆα T 2 n 3 ˆα 3 T 2 2 β (2T β 3 2 nαβ) ) ) The diagonal elements are both negative, and the determinant is equal to n 4 ˆα 2 T2 2 (ˆαψ (ˆα) 1). This is positive since αψ (α) 1 > 0 for all α > 0. This guarantees that H(ˆα, ˆβ) is negative definite so that (ˆα, ˆβ) is at least a local maximum. 6 Invariance principle for the MLE s If η = τ(θ) and ˆθ is the MLE of θ, then ˆη = τ(ˆθ) is the MLE of η. Comments 10

11 1. If τ(θ) is a 1-1 function, this is a trivial theorem. 2. If τ(θ) is not 1-1, this is essentially true by definition of induced likelihood. (see later). Example: X = (X 1, X 2,..., X n ) iid N(µ, σ 2 ). The usual parameters θ = (µ, σ 2 ) are related to the natural parameters η = (µ/σ 2, 1/(2σ 2 )) of the 2pef by a 1-1 function: η = τ(θ). The likelihood in terms of θ is L 1 (θ) = (2πσ 2 ) n/2 e nµ2 /2σ 2 e µ/σ2 T 1 (1/2σ 2 )T 2 where T 1 = X i, T 2 = X 2 i. Simple Example: X = (X 1, X 2,..., X n ) iid Bernoulli(p). It is known that MLE of p is ˆp = X. Thus 1. MLE of p 2 is ˆp 2 = X MLE of p(1 p) is X(1 X). The function of p in 1. is 1-1, but not 1-1 in Induced Likelihood Definition 1. If η = τ(θ), then L (η) sup L(θ). θ:τ(θ)=η Go back to the example X 1, X 2,..., X n N(µ, σ 2 ) iid. If the MLE ˆη of η is defined to be the value which maximized L (η), then it is easily seen that ˆη = τ(ˆθ). The likelihood in terms of η is obtained by substituting in L 1 (θ) that is, evaluating L 1 at L 2 (η) = ( π/η 2 ) n/2 e nη2 1 /4η 2 e η 1T 1 +η 2 T 2 µ = η 1 /(2η 2 ), σ 2 = 1/(2η 2 ), θ = (µ, σ 2 ) = ( η 1 /(2η 2 ), 1/(2η 2 )) = τ 1 (η). 11

12 Stated abstractly L 2 (η) = L 1 (τ 1 (η)), so that L 2 is maximized when τ 1 (η) = ˆθ, that is, by η = τ(ˆθ). The MLE of θ is known to be ( ˆθ = (ˆµ, ˆσ 2 ) = X, 1 n ) (X i n X) 2 so the invariance principle says the MLE of η is ( ˆµ ˆη = τ(ˆθ) = ˆσ 2, 1 ) 2ˆσ 2. Continuation of example: What is the MLE of α = µ + σ 2? Note that is not a 1-1 function, but where SS = n (X i X) 2. What is MLE of µ?, σ 2? With g 1 (x, y) = x, g 2 (x, y) = y, we have so that the MLEs are Thus, the invariance principle implies: α = g(µ, σ 2 ) = µ + σ 2 ˆα = g(ˆµ, ˆσ 2 ) = ˆµ + ˆσ 2 = X + SS/n µ = g 1 (θ), σ 2 = g 2 (θ) ˆµ = g 1 (ˆθ) = X, ˆσ 2 = g 2 ( ˆθ) = SS/n. ˆ (µ, σ 2 )(MLE as a pair) = (ˆµ(MLE of µ), ˆσ 2 (MLE of σ 2 )) 6.2 MLE for Exponential Families The invariance principle for MLEs allows us to work with the natural parameter η (which is a 1-1 function of θ). 1pef: f(x θ) = c(θ)h(x) exp{w(θ)t(x)} Natural parameter: η = w(θ). With a little abuse of notation (writing f(x η) for f (x η) = f(x w 1 (η)) and c(η) for c (η) = c(w 1 (η)), we can write f(x η) = c(η)h(x) exp{ηt(x)}. 12

13 For clarity of notations, we will use x = (x 1,..., x N ) as the observed data and X = (X 1, X 2,..., X N ) as the random data. If X 1,..., X N iid from f(x η), then l(η) = N log{c(η)} + N N log h(x i ) + η t(x i ) Since by 3.32(a) Et(X i ) = η log{c(η)}, we have l (η) = N N η log{c(η)} + t(x i ) (3) [ N ] = E t(x i ) + = ET (X) + T (x) N t(x i ) where T (X) = N t(x i). Hence the condition for a stationary point is equivalent to: Note that using (3), E η T (X) = T (x) l (η) = N 2 η 2 log{c(η)} = N{ Var ηt(x i )} < 0 for all η. Thus any interior stationary point (not on the boundary of Θ = {w(θ) : θ Θ}) is automatically a global maximum so long as Θ is convex. In one dimension (Θ R), this means Θ must be an interval of some sort (can be infinite). Ignoring this fine point, for a 1pef, the log-likelihood will have a unique stationary point which will be the MLE. k-pef: f(x θ) = c(θ)h(x) exp{ k w j (θ)t j (x)} Natural parameter: η = (η 1,..., n k ) = (w 1 (θ),..., w k (θ)), that is η j = w j (θ). j=1 f(x η) = c(η)h(x) exp{ k η j t j (x)} j=1 13

14 If X 1, X 2,..., X N iid from f(x η), then l(η) = N log c(η) + N log h(x i ) + l = N log c(η) + η j η j }{{} t j (X i ) N = E t j (X i ) + N t j (x i ) N t j (x i ) k { N } η j t j (x i ) j=1 2 ( ) l 2 = N log c(η) η j η l η j η l Thus, the equations for a stationary point is are equivalent to = N( Cov(t j (X i ), t l (X i ))) l η j = 0, j = 1,..., k E η T j (X) = T j (x), j = 1,..., k (4) where T j (X) = N t j(x i ) and T j (x) = N t j(x i ) or in vector notation, E η T (X) = T (x), j = 1,..., k where T (X) = (T 1 (X),..., T k (X)) and T (x) = (T 1 (x),..., T k (x)). ( ) k The Hessian matrix H(η) = 2 l η i η j is given i,j=1 H(η) = NΣ(η) where Σ(η) is the k k covariance matrix of (T 1 (X 1 ), T 2 (X 1 ),..., T k (X 1 )). A covariance matrix will be positive definite (except in degenerate cases), so that H(η) will be negative definite for all η. Conclusion: An interior stationary point (i.e., a solution of (4)) must be the unique global maximum, and hence the MLE. This result also holds in the original parameterization with (4) restated as E θ T j (X) = T j (x), j = 1,..., k. Connection with MOM: For a 1pef with t(x) = x, MOM and MLE agree. For a kpef with t j (x) = x j, MOM and MLE agree. Why? Because then (4) is equivalent to the equations for the MOM estimator. 14

15 7 Revisiting Gamma Example: The system of equations for the MLE of (α, β) may be easily derived directly from (4). ET 1 (X) = T 1 (x) ET 2 (X) = T 2 (x) which becomes n E log X i = ne log X 1 = n(log β + ψ(α)) = T 1 (x) n E X i = nex 1 = nαβ = T 2 (x) The equations are the same as the equations for a stationary point derived earlier. For X Gamma(α, β), we have used: E log X = = = log x xα 1 e x/β β α Γ(α) dx (log(x/β) + log β) (x/β)α 1 e x/β Γ(α) (log z + log β) zα 1 e z dz Γ(α) = log β + 1 Γ(α) = log β + 1 Γ(α) α = log β + Γ (α) Γ(α) 0 (z α 1 log z) e z dz }{{} α zα 1 0 z α 1 e z dz = log β + ψ(α). Verifying Stationary Point is Global Maximum: The Gamma family is a 2pef (or a 1pef if α or β is held fixed). Switching to the natural parameters η 1 = α 1, η 2 = 1/β(or just making the substitution λ = 1/β) simplifies the second derivatives w.r.t. η 2 (or λ). The Hessian matrix is now negative definite for all θ = (η 1, η 2 ), which is a sufficient condition for the stationary point to be the global maximum. dx β 15

16 7.1 MLEs for More General Exponential Families Proposition 1. If X P θ, θ Θ, where P θ has a joint pdf (pmt) from an n-variate k-parameter exponential family { k } f(x θ) = c(θ)h(x) exp w j (θ)t j (x) for x R n, θ Θ R k, then the MLE of θ based on the observed data x is the solution of the system of equations providing the solution (call it ˆθ) satisfies j=1 E θ T j (X) = T j (x), j = 1,..., k, Solve for θ. w(ˆθ) interior of {w(θ) : θ Θ}. Proof. Essentially the same as for the ordinary kpef. Example: Simple Linear Regression with known variance: Y 1, Y 2,..., Y n are independent with Y i N(β 0 + β 1 x i, σ 2 0), θ = (β 0, β 1 ) Joint distribution of Ỹ = (Y 1, Y 2,..., Y n ) forms exponential family. Natural sufficient statistic is t(ỹ ) = ( i Y i, i x i Y i ). E θ t(ỹ ) = t(ỹ) has the form E( Y i ) = y i E( x i Y i ) = x i y i Thus the MLE ˆθ = ( ˆβ 0, ˆβ 1 ) is solution of (β 0 + β 1 x i ) = i i x i (β 0 + β 1 x i ) = i i y i x i y i 16

17 7.2 Sufficient statistics and MLEs If T = T (X) is a sufficient statistic for θ, then there is an MLE which is a function of T. (If the MLE is unique, then we can say the MLE is a function of T ). Proof. By FC, f(x θ) = g(t (x), θ)h(x). Assume for convenience the MLE is unique. Then the MLE is which is clearly a function of T (x). ˆθ(x) = argmax θ f(x θ) = argmax θ g(t (x), θ) MLE coincides with Least Squares. For independent normal rv s with constant variance σ 2 (known or unknown). Y 1, Y 2,..., Y n are independent with or more generally, where β is possibly a vector. Then L(β, σ 2 ) = f(ỹ θ) = }{{} θ Y i N(β 0 + β 1 x i, σ 2 0), θ = (β 0, β 1 ) Y i N(g(x i, β), σ 2 0), ( ) 1 n { exp 1 2πσ 2σ 2 n (y i g(x i, β) }. 2 For any σ 2 (fixed arbitrary value), maximizing L(β, σ 2 ) with respect to β is equivalent to minimizing n (y i g(x i, β)) 2 with respect to β. Hence MLE and Least squares give same estimates of β parameters. 8 Limiting distribution of the MLE An large-sample property is one that describes the limiting distribution. This (i) gives an exact characterization of the rate of convergence, and (ii) allows for the construction of asymptotically exact statistical procedures. Though it is possible to get non-normal limits, all standard problems that admit a limiting distribution have a normal limit. From previous experience, we know that MLEs typically have an asymptotic normality property. 17

18 Here is one version of such a theorem, similar to Theorem 9.14 in Keener (2010), with conditions given in C1-C4 below. Condition C3 is the most difficult to check, but it does hold for regular exponential families. We focus on the one-dimensional case, but the exact same theorem, with obvious modifications, holds for d-dimensional θ. C1. The support of P θ does not depend on θ. C2. For each x in the support, f x (θ) := log p θ (x) is three times differentiable with respect to θ in an interval (θ δ, θ + δ); moreover, E θ f X (θ ) and E θ f X (θ ) are finite and there exists a function M(x) such that sup f x (θ) M(x), E θ [M(X)] <. (5) θ (θ δ,θ +δ) C3. Expectation with respect to P θ and differentiation at θ can be interchanged, which implies that the score function has mean zero and that the Fisher information exists and can be evaluated using either of the two familiar formulas. C4. The Fisher information at θ is positive. Theorem 1. Suppose X 1, X 2,..., X n are iid P θ, where θ Θ R. Assume C1-C4, and let ˆθ n be a consistent sequence of solutions to (2). Then, for any interior point θ, n 1/2 (ˆθ n θ ) N(0, I(θ ) 1 ), in distribution under P θ. Proof. Let l n (θ) = n 1 log L n (θ) be scaled log-likelihood. Since θ is an interior point, there exists an open neighborhood A of θ contained in Θ. From consistency of ˆθ n, the event {ˆθ n A} has P θ -probability converging to 1. Therefore, it suffices (prove!) to consider the behavior of ˆθ n only when it is in A where the log-likelihood is well-behaved, in particular, l n(ˆθ n ) = 0. Next, take a second-order Taylor approximation of l n(ˆθ n ) around θ : 0 = l n(θ ) + l n(θ )(ˆθ n θ ) l n ( θ n )(ˆθ n θ ) 2, where θ n is between ˆθ n and θ. After a bit of simple algebra, we get n 1/2 (ˆθ n θ n 1/2 l ) = n(θ ) l n(θ ) + 0.5l n ( θ n )(ˆθ n θ ), for ˆθ n near θ So, it remains to show that the right-hand side above has the stated asymptotically normal distribution. Let s look at the numerator and denominator separately. Numerator. The numerator can be written as n 1/2 l n(θ ) = n 1/2 1 n n 18 θ log p θ(x i ) θ=θ.

19 The summands are iid with mean zero and variance I(θ ), by our assumptions about interchanging derivatives and integrals. Therefore, the standard Central Limit Theorem says that n 1/2 l n(θ ) to N(0, I(θ )) in distribution. Denominator The first term in the denominator converges in P θ -probability to I(θ ) by the usual law of large numbers. It remains to show that the second term in the denominator is negligible. For this, note that by (5), l n ( θ n 1 n n M(X i ), for ˆθn close to θ. The ordinary law of large numbers, again, says that the upper bound converges to E θ [M(X 1)], which is finite. Consequently, l ( θ n ) is bounded in probability, and since ˆθ n θ 0 in P θ -probability by assumption, we may conclude that (ˆθ n θ )l n ( θ n ) 0, in P θ probability. It then follows from Slutsky s Theorem that n 1/2 l n(θ ) l n(θ ) (ˆθ n θ )l n ( θ n ) N(0, I(θ ) I(θ ) in distribution, which is the desired result. = N(0, I(θ ) 1 ), The take-away message here is that, under certain conditions, if n is large, then the MLE ˆθ has sampling distribution close to N(θ, [ni(θ)] 1 ) under P θ. To apply this result, e.g., to construct an asymptotically approximate confidence interval, one needs to replace I(θ) with a quantity that does not depend on the unknown parameter. Standard choices are the expected Fisher information I(ˆθ) and the observed Fisher information l n(ˆθ); The latter is often preferred, for it has some desirable conditioning properties. With asymptotic normality of the MLE, it is possible to derive the asymptotic distribution of any smooth function of the MLE. This is the well-known delta theorem. The delta theorem also offers an alternative-called variance stabilizing transformations to the plug-in rules discussed above to eliminate θ from the variance in the asymptotic normal approximation. It is possible to drop the requirement that the likelihood be three times differentiable if one assumes that the second derivative exists and has a certain Lipschitz property: log p θ (x) is twice differentiable at θ, and there exists a function g r (x, θ) such that, for each interior point θ, sup 2 θ 2 log p θ(x) 2 θ 2 log p θ (x) g r(x, θ ), θ: θ θ r with lim r 0 E θ {g r (X, θ)} = for each θ. 19

20 With this assumption, the same asymptotic normality result holds. Interestingly, it is possible to get asymptotic normality under a much weaker condition, namely, differentiable in quadratic mean, which assumes less than differentiability of θ p θ (x), but the details are a bit more technical. 9 Likelihood ratio tests For two competing hypotheses H 0 and H 1 about the parameter θ, the likelihood ratio is often used to make a comparison. For example, for H 0 : θ = θ 0 versus H 1 : θ = θ 1, the likelihood ratio is L(θ 0 )/L(θ 1 ), and large (resp. small) values of this ratio indicate that the data x favors H 0 (resp. H 1 ). A more difficult and somewhat more general problem is H 0 : θ Θ 0 versus H 1 : θ / Θ 0, where Θ 0 is a subset of Θ. In this case, one can define the likelihood ratio as T n = T n (X, Θ 0 ) = sup θ Θ 0 L(θ) sup θ Θ L(θ). The interpretation of this likelihood ratio is the same as before, i.e., if the ratio is small, then data lends little evidence to the null hypothesis. For practical purposes, we need to know what it means for the ratio to be small ; this means we need the null distribution of T n, i.e., the distribution of T n under P θ, when θ Θ 0. For Θ R s,consider the testing problem H 0 : θ Θ 0 versus H 1 : θ / Θ 0,where Θ 0 is a subset of Θ that specifies the values θ 01,..., θ 0m of θ 1,..., θ m, where m is a fixed integer between 1 and d. The following result, known as Wilks s Theorem, gives conditions under which the null distribution of W n = 2 log T n is asymptotically of a convenient form. Theorem 2. Suppose the conditions of Theorem 1 hold. Under the setup described in the previous paragraph, W n ChiSq(m) in distribution, under P θ with θ Θ 0. Proof. We focus here only on the case d = m = 1 That is, Θ 0 = {θ 0 } is a singleton, and we want to know the limiting distribution of W n under P θ0. Clearly, W n = 2l n (θ 0 ) + 2l n (ˆθ n ), where ˆθ n is the MLE and l n is the log-likelihood. By the assumed continuity of the loglikelihood, do a two-term Taylor approximation of l n (θ 0 ) around ˆθ n : l n (θ 0 ) =.l n (ˆθ n ) + l n(ˆθ n )(θ 0 ˆθ n ) + l n( θ n ) (θ 0 2 ˆθ n ) 2, where ˆθ n is between θ 0 and ˆθ n. Since l n(ˆθ n ) = 0, we get W n = l n( θ n )(θ 0 ˆθ n ) 2 = l n( θ n ) n {n1/2 (ˆθ θ 0 )} 2. 20

21 From Theorem 1, we have that n(ˆθ n θ 0 ) N(0, I(θ 0 ) 1 ) in distribution, as n. Also, l n( θ n ) = l n(θ 0 ) + l n( θ n ) l n(θ 0 ) and we have that l n( θ n ) l n(θ 0 ) 1 n n 2 θ 2 log p θ(x i ) θ= θn 2 θ 2 log p θ(x i ) θ=θ0 Using Condition C2, the upper bound is bounded by n 1 n M(X i) θ n θ 0, which goes to zero in probability under P θ0 since θ n is consistent. Therefore, l n( θ n ) has the same limiting behavior as l n(θ 0 ). Finally, by Slutsky, we get W n I(θ 0 )N(0, I(θ 0 ) 1 ) 2 N(0, 1) 2 ChiSq(1). Wilks theorem facilitates construction of an approximate size-α test of H 0 when n is large, i.e., by rejecting H 0 iff W n is more than χ 2 m,1 α, the 100(1 α) percentile of the ChiSq(m) distribution. The advantage of Wilks theorem appears in cases where the exact sampling distribution of W n is intractable, so that an exact (analytical) size-α test is not available. Monte Carlo can often be used to find a test, but Wilks theorem gives a good answer and only requires use of a simple chi-square table. One can also use the Wilks theorem result to obtain approximation confidence regions. Let W n (θ 0 ) = 2 log T n (X; θ 0 ), where θ 0 is a fixed generic value of the full d-dimensional parameter θ, i.e., H 0 : θ = θ 0. Then an approximate 100(1 α)% confidence region for θ is {θ 0 : W n (θ 0 ) χ 2 m,1 α }. An interesting and often overlooked aspect of Wilks theorem is that the asymptotic null distribution does not depend on the true values of those parameters unspecified under the null. For example, in a gamma distribution problem with the goal of testing if the shape is equal to some specified value, the null distribution of W n does not depend on the true value of the scale. 9.1 Cautions concerning the first-order theory One might be tempted to conclude that the desirable properties of the likelihood-based methods presented in the previous section are universal, i.e., that maximum likelihood estimators will always work. Moreover, based on the form of the asymptotic variance of the MLE and its similarity to the Cramer-Rao lower bound in Chapter 2, it is tempting to conclude that the MLE is asymptotically efficient. However, both of these conclusions are technically false in general. Indeed, there are examples where 1. the MLE is not unique or even does not exist 21

22 2. the MLE works (in the sense of consistency), but the conditions of the theory are not met so asymptotic normality fails; and 3. the MLE is not even consistent! Non-uniqueness or non-existence of the MLE are roadblocks to practical implementation but, for some reason, aren t viewed as much of a concern from a theoretical point of view. The case where the MLE works but is not asymptotically normal is also not really a problem, provided that one recognizes the non-regular nature of the problem and makes the necessary adjustments. The most concerning of these points is inconsistency of the MLE. Since consistency is a rather weak property, inconsistency of the MLE means that its performance is poor and can give very misleading results. The most famous example of inconsistency of the MLE, due to Neyman and Scott (1948), is given next. Example: (Neyman and Scott 1948). Let X ij be independent normal random variables, X ij N(µ i, σ 2 ), i = 1,..., n and j = 1, 2; the case of two j levels is the simplest, but the result holds for any fixed number of levels. The point here is that X i1 and X i2 have the same mean µ i, but there are possibly n different means. The full parameter is θ = (µ 1,..., µ n, σ 2 ), which is of dimension n + 1. It is easy to check that the MLEs are given by ˆµ i = 1 2 (X i1 + X i2 ), i = 1,..., n ˆσ 2 = 1 4n n (X i1 X i2 ) 2 It is easy to see that as n, ˆσ 2 σ 2 /2 in probability, so that the MLE of σ 2 is inconsistent! The issue here that is causing inconsistency is that the dimension of the nuisance parameter, the means µ 1,..., µ n, is increasing with n. In general, when the dimension of the parameter depends on n, consistency of the MLE will be a concern so one should be careful. 22

Maximum Likelihood Estimation

Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The