Maximum Likelihood Estimation

Size: px

Start display at page:

Download "Maximum Likelihood Estimation"

Marjorie Sutton
5 years ago
Views:

1 Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The likelihood function L(θ x) and joint pdf f(x θ) are the same except that f(x θ) is generally viewed as a function of x with θ held fixed, and L(θ x) as a function of θ with x held fixed. f(x θ) is a density in x for each fixed θ. But L(θ x) is not a density (or mass function) in θ for fixed x (except by coincidence). The Maximum Likelihood Estimator (MLE) A point estimator ˆθ = ˆθ(x) is a MLE for θ if L(ˆθ x) = sup θ that is, ˆθ maximizes the likelihood. L(θ x), In most cases, the maximum is achieved at a unique value, and we can refer to the MLE, and write ˆθ(x) = argmax θ L(θ x). (But there are cases where the likelihood has flat spots and the MLE is not unique.)

2 Motivation for MLE s Note: We often write L(θ x) = L(θ), suppressing x, which is kept fixed at the observed data. Suppose x R n. Discrete Case: If f( θ) is a mass function (X is discrete), then L(θ) = f(x θ) = P θ (X = x). L(θ) is the probability of getting the observed data x when the parameter value is θ. Continuous Case: When f( θ) is a continuous density P θ (X = x) = 0, but if B R n is a very, very small ball (or cube) centered at the observed data x, then P θ (X B) f(x θ) Volume(B) L(θ). L(θ) is proportional to the probability the random data X will be close to the observed data x when the parameter value is θ. Thus, the MLE ˆθ is the value of θ which makes the observed data x most probable.

3 To find ˆθ, we maximize L(θ). This is usually done by calculus (finding a stationary point), but not always. If the parameter space Θ contains endpoints or boundary points, the maximum can be achieved at a boundary point without being a stationary point. If L(θ) is not smooth (continuous and everywhere differentiable), the maximum does not have to be achieved at a stationary point. Cautionary Example: Suppose X 1,..., X n are iid Uniform(0, θ) and Θ = (0, ). Given data x = (x 1,..., x n ), find the MLE for θ. L(θ) = n θ 1 I(0 x i θ) = θ n I(0 min x i )I(max x i θ) i=1 = { θ n for θ max x i 0 for 0 < θ < max x i (Draw this!) which is maximized at θ = max x i, which is a point of discontinuity (and certainly not a stationary point). Thus, the MLE is ˆθ = max x i = x (n). Notes: L(θ) = 0 for θ < max x i is just saying that these values of θ are absolutely ruled out by the data (which is obvious). A strange property of the MLE in this example (not typical): P θ (ˆθ < θ) = 1 The MLE is biased; it is always less than the true value.

4 A Similar Example: Let X 1,..., X n be iid Uniform(α, β) and Θ = {(α, β) : α < β}. Given data x = (x 1,..., x n ), find the MLE for θ = (α, β). L(α, β) = n (β α) 1 I(α x i β) i=1 = (β α) n I(α min x i )I(max x i β) { (β α) n for α min x i, max x i β = 0 otherwise which is maximized by making β α as small as possible without entering 0 otherwise region. Clearly, the maximum is achieved at (α, β) = (min x i, max x i ). Thus the MLE is ˆθ = (ˆα, ˆβ) = (min x i, max x i ). Again, P α,β (α < ˆα, ˆβ < β) = 1.

5 Maximizing the Likelihood (one parameter) General Remarks Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attains its supremum (but might do so at one of the endpoints). (That is, there exists a point θ 0 J such that g(θ 0 ) = sup θ J g(θ). ) Consequence: Suppose g(θ) is a continuous, non-negative function defined on an open interval J = (c, d) (where perhaps c = or d = + ). If then g attains its supremum. lim g(θ) = lim g(θ) = 0, θ c θ d Thus, MLE s usually exist when the likelihood function is continuous.

6 Maxima at Stationary Points Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infinite or finite). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point (that is, g (θ 0 ) = 0). (1) If g (θ 0 ) = 0 and g (θ 0 ) < 0, then θ 0 is a local maximum (but might not be the global maximum). (2) If g (θ 0 ) = 0 and g (θ) < 0 for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). The condition in (1) is necessary (but not sufficient) for θ 0 to be a global maximum. Condition (2) is sufficient (but not necessary). A function satisfying g (θ) < 0 for all θ Θ is called strictly concave. It lies below any tangent line. Another useful condition (sufficient, but not necessary) is: (3) If g (θ) > 0 for θ < θ 0 and g (θ) < 0 for θ > θ 0, then θ 0 is a global maximum.

7 Maximizing the Likelihood (multi-parameter) Basic Result: A continuous function g(θ) defined on a closed, bounded set J R k attains its supremum (but might do so on the boundary). Consequence: Suppose g(θ) is a continuous, non-negative function defined for all θ R k. If g(θ) 0 as θ, then g attains its supremum. Thus, MLE s usually exist when the likelihood function is continuous. Suppose the function g(θ) is defined on a convex set Θ R k (that is, the line segment joining any two points in Θ lies entirely inside Θ). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point: g(θ 0 ) θ i = 0 for i = 1, 2,..., k. Define the gradient vector D and Hessian matrix H: D(θ) = ( g(θ) θ i ) k i=1 (a k 1 vector). H(θ) = ( 2 g(θ) θ i θ j where θ = (θ 1, θ 2,..., θ k ). ) k i,j=1 (a k k matrix).

8 Maxima at Stationary Points (1) If D(θ 0 ) = 0 and H(θ 0 ) is negative definite, then θ 0 is a local maximum (but might not be the global maximum). (2) If D(θ 0 ) = 0 and H(θ) is negative definite for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). (1) is necessary (but not sufficient) for θ 0 to be a global maximum. (2) is sufficient (but not necessary). A function for which H(θ) is negative definite for all θ Θ is called strictly concave. It lies below any tangent plane. Positive and Negative Definite Matrices Suppose M is a k k symmetric matrix. Note: Hessian matrices and covariance matrices are symmetric. Definitions: M is positive definite if x Mx > 0 for all x 0 (x R k ). M is negative definite if x Mx < 0 for all x 0. M is non-negative definite (or positive semi-definite) if x Mx 0 for all x R k.

9 Facts: M is p.d. iff all its eigenvalues are positive. M is n.d. iff all its eigenvalues are negative. M is n.n.d. iff all its eigenvalues are non-negative. M is p.d. iff M is n.d. If M is p.d., all its diagonal elements must be positive. If M is n.d., all its diagonal elements must be negative. The determinant of a symmetric matrix is equal to the product of its eigenvalues. 2 2 Symmetric Matrices: ( ) m11 m M = (m ij ) = 12 m 21 m 22 with m 12 = m 21. M = m 11 m 22 m 12 m 21 = m 11 m 22 m A 2 2 matrix is p.d. when the determinant is positive and the diagonal elements are positive. A 2 2 matrix is n.d. when the determinant is positive and the diagonal elements are negative. The bare minimum you need to check: M is p.d. if m 11 > 0 (or m 22 > 0) and M > 0. M is n.d. if m 11 < 0 (or m 22 < 0) and M > 0.

10 Example: Observe X 1,..., X n be iid Gamma(α, β). Preliminaries: (likelihood) L(α, β) = n i=1 x α 1 i e x i/β β α Γ(α). Maximizing L is same as maximizing l = log L given by l(α, β) = (α 1)T 1 T 2 /β nα log β n log Γ(α) where T 1 = i log x i, T 2 = i x i. Note that T = (T 1, T 2 ) is the natural sufficient statistic of this 2pef. l α = T 1 n log β nψ(α) where ψ(α) d dα log Γ(α) = Γ (α) Γ(α) l β = T 2 β nα 2 β = 1 β (T 2 2 nαβ) 2 l α 2 = nψ (α) 2 l = 2T 2 β 2 β 3 2 l α β = n β + nα β 2 = 1 β 3 (2T 2 nαβ)

11 Situation #1: Suppose α = α 0 is known. Find MLE for β. (Drop α from arguments: l(β) = l(α 0, β) etc.) l(β) is continuous and differentiable. l(β) has a unique stationary point: l (β) = l β = 1 β 2 (T 2 nα 0 β) = 0 iff T 2 = nα 0 β iff β = T 2 nα 0 ( β ). Now we check the second derivative. l (β) = 2 l β 2 = 1 β 3 (2T 2 nαβ) = 1 β 3 (T 2 + (T 2 nαβ)). Note l (β ) < 0 since T 2 nα 0 β = 0, but l (β) > 0 for β > 2T 2 /(nα 0 ). Thus, the stationary point satisfies the necessary condition for a global maximum, but not the sufficient condition (i.e., l(β) is not a strictly concave function). How can we be sure that we have found the global maximum, and not just a local maximum? In this case, there is a simple argument: The stationary point β is unique, and l (β) > 0 for β < β, and l (β) < 0 for β > β. This ensures β is the unique global maximizer. Conclusion: ˆβ = T 2 nα 0 is the MLE. (This is a function of T 2, which is a sufficient statistic for β when α is known.)

12 Situation #2: Suppose β = β 0 is known. Find MLE for α. (Drop β from arguments: l(α) = l(α, β 0 ) etc.) Note: l (α) and l (α) involve ψ(α) The function ψ is infinitely differentiable on the interval (0, ), and satisfies ψ (α) > 0 and ψ (α) < 0 for all α > 0. (The function is strictly increasing and strictly concave.) Also lim ψ(α) = and lim ψ(α) =. α 0 + α Thus ψ 1 : R (0, ) exists. (Draw a Picture.) l(α) is continuous and differentiable. l(α) has a unique stationary point: l (α) = T 1 n log β 0 nψ(α) = 0 iff ψ(α) = T 1 /n log β 0 iff α = ψ 1 (T 1 /n log β 0 ) This is the unique global maximizer since l (α) = nψ (α) < 0 for all α > 0. Thus ˆα = ψ 1 (T 1 /n log β 0 ) is the MLE. (This is a function of T 1, which is a sufficient statistic for α when β is known.)

13 Situation #3: Find MLE for θ = (α, β) l(α, β) is continuous and differentiable. A stationary point must satisfy the system of two equations: l α = T 1 n log β nψ(α) = 0 l β = 1 β 2 (T 2 nαβ) = 0 Solving the second equation for β gives β = T 2 nα. Plugging this into the first equation, and rearranging a bit leads to ( ) T 1 n log T2 = ψ(α) log α H(α) n The function H(α) is continuous and strictly increasing from (0, ) to (, 0), so that it has an inverse mapping (, 0) to (0, ). Thus, the solution to the above equation can be written: ( ( )) α = H 1 T1 n log T2. n

14 Thus, the unique stationary point is: Is this the MLE? ˆα = H 1 ( T1 n log ˆβ = T 2 nˆα. Let us examine the Hessian. 2 l 2 l H(α, β) = α 2 α β 2 l 2 l = H(ˆα, ˆβ) = ( T2 n )) α β β 2 nψ n (α) β n 1. β β (2T 3 2 nαβ) nψ n 2ˆα (ˆα) plugging in T 2 n 2ˆα n 3ˆα 3 ˆβ = T 2 T 2 T2 2 nˆα The diagonal elements are both negative, and the determinant is equal to n 4ˆα 2 (ˆαψ (ˆα) 1). T 2 2 This is positive since αψ (α) 1 > 0 for all α > 0. This guarantees that H(ˆα, ˆβ) is negative definite so that (ˆα, ˆβ) is (at least) a local maximum.

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators Likelihood & Maximum Likelihood Estimators February 26, 2018 Debdeep Pati 1 Likelihood Likelihood is surely one of the most important concepts in statistical theory. We have seen the role it plays in sufficiency,