1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

Size: px
Start display at page:

Download "1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators"

Transcription

1 Likelihood & Maximum Likelihood Estimators February 26, 2018 Debdeep Pati 1 Likelihood Likelihood is surely one of the most important concepts in statistical theory. We have seen the role it plays in sufficiency, through the factorization theorem. But, more importantly, the likelihood function establishes a preference among the possible parameter values given data X = x. That is, a parameter values θ 1 with larger likelihood is better than parameter value θ 2 with smaller likelihood, in the sense that the model P θ1 provides a better fit to the observed data than P θ2. This leads naturally to procedures for inference which select, as a point estimator, the parameter value that makes the likelihood the largest, or rejects a null hypothesis if the hypothesized value has likelihood too small. Estimators and hypothesis tests based on the likelihood function have some general desirable properties, in particular, there are widely applicable large-sample approximations of the relevant sampling distributions. A main focus of this chapter is the mostly rigorous derivation of these important results. 1.1 Likelihood function Consider a class of probability models {P θ : θ Θ}, defined on the measurable space (X, A), absolutely continuous with respect to a dominating σ-finite measure µ. In this case, for each θ, the Radon-Nikodym derivative (dp θ /dµ)(x) is the usual probability density function for the observable X, written as p θ (x). For fixed θ, we know that p θ (x) characterizes the sampling distribution of X as well as that of any statistic T = T (X). But how do we use/interpret p θ (x) as a function of θ for fixed x? This is a special function with its own name?the likelihood function. Definition 1. Given X = x, the likelihood function is L(θ) = p θ (x). The intuition behind the choice of name is that a θ for which L(θ) is large is more likely to be true value compared to a θ for which L(θ ) is small. The name likelihood was coined by Fisher (1973): What has now appeared is that the mathematical concept of probability is... in- adequate to express our mental confidence or indifference in making... inferences, and that the mathematical quantity which usually appears to be appropriate for measuring our order of preference among different possible populations does not in fact obey the laws of probability. To distinguish it from probability, I have used the term likelihood to designate this quantity; since both words likelihood and probability are loosely used in common speech 1

2 to cover both kinds of relationship. Fisher s point is that L(θ) is a measure of how plausible θ is, but that this measure of plausibility is different from our usual understanding of probability; see Aldrich (1997) for more on Fisher and likelihood. While we understand the probability (density) p θ (x), for fixed θ, as a pre-experimental summary of our uncertainty about where X will fall, the likelihood L(θ) = p θ (x), for fixed x, gives a post-experimental summary of how likely it is that model P θ produced the observed X = x. In other words, the likelihood function provides a ranking of the possible parameter values - those θ with greater likelihood are better, in that they fit the data better, than those θ with smaller likelihood. Therefore, only the shape of the likelihood function is relevant, not the scale. The likelihood function is useful across all approaches to statistics. We ve already seen some uses of the likelihood function. In particular, the factorization theorem states that the (shape of the) likelihood function depends on the observed data X = x only through the sufficient statistic. The next section discusses some standard and some not-so-standard uses of the likelihood. 2 Maximum Likelihood Estimation and first order theory Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The likelihood function L(θ x) and joint pdf f(x θ) are the same except that f(x θ) is generally viewed as a function of x with θ held fixed, and L(θ x) as a function of θ with x held fixed. f(x θ) is a density in x for each fixed θ. But L(θ x) is not a density (or mass function) in θ for fixed x (except by coincidence). Given a class of potential models P θ indexed by Θ, a subset of R d, we observe X = x and we?d like to know which model is the most likely to have produced this x. This defines an optimization problem, and the result, namely ˆθ = argmax θ Θ L(θ) (1) is the maximum likelihood estimate (MLE) of θ. Naturally, Pˆθ is then considered the most likely model, that is, among the class {P θ : θ Θ}, the model Pˆθ provides the best fit to the observed X = x. In terms of ranking intuition, ˆθ is ranked the highest. When the likelihood function is smooth, the optimization problem can be posed as a rootfinding problem. That is, the MLE ˆθ can be viewed as a solution to the equation l(θ) = 0 (2) 2

3 where denotes the gradient operator, l = log L is the log-likelihood, and the right-hand side is a d-vector of zeroes. Equation 2 is called the likelihood equation. In a lot of examples, the solution of this equation is available in closed form. In scenarios, when the solution is not available in closed form, we resort to optimization tools. 2.1 A bit about computation Newton s method is a simple and powerful tool for doing optimization or, more precisely, root finding. You should be familiar with this method from a calculus course. The idea is based on the fact that, locally, any differentiable function can be suitably approximated by a linear function. This linear function is then used to define a recursive procedure that will, under suitable conditions, eventually find the desired solution to (2). MLE is a solution to this equation, i.e., a root of the gradient of the log-likelihood function. Assume that the gradient l(θ) is also differentiable, and let D(θ) denote that matrix of derivatives, i.e., D(θ) ij = ( 2 / θ i θ j )l(θ). Assume D(θ) is non-singular for all θ. The idea behind Newton s method is as follows. Pick some guess, say θ (0) of the MLE ˆθ. Now approximate l(θ) by a linear function l(θ) = l(θ (0) ) + D(θ (0) )(θ θ (0) ) + error. Ignore the error, solve for θ, and call the solution θ (1) : θ (1) = θ (0) D(θ (0) ) 1 l(θ ( 0)) If θ (0) is close to the solution of the likelihood equation, then so will θ (1). The idea is to iterate this process until the solutions converge. So the method is to pick a reasonable starting value θ (0) and, at iteration t 0 set θ (t+1) = θ (t) D(θ(t)) 1 (θ (t) ). Then stop the algorithm when t is large and/or θ (t+1) θ (t) is small. There are lots of tools available for doing optimization, the Newton method described above is just one simple approach. Fortunately, there are good implementations of these methods already available in the standard software. For example, the routine optim in R is a very powerful and simple-to-use tool for generic optimization. For problems that have a certain form, specifically, problems that can be written in a latent variable form, there is a very clever tool called the EM algorithm (e.g. Dempster et al. 1977) for maximizing the. likelihood. Section 9.6 in Keener (2010) gives some description of this method. 2.2 The Maximum Likelihood Estimator (MLE) A point estimator ˆθ = ˆθ(x) is a MLE for θ if L(ˆθ x) = sup L(θ x), θ 3

4 that is, ˆθ maximizes the likelihood. In most cases, the maximum is achieved at a unique value, and we can refer to the MLE, and write ˆθ(x) = argmax θ L(θ x). (But there are cases where the likelihood has flat spots and the MLE is not unique.) 2.3 Motivation for MLE s Note: We often write L(θ x) = L(θ), suppressing x, which is kept fixed at the observed data. Suppose x R n. Discrete Case: If f( θ) is a mass function (X is discrete), then L(θ) = f(x θ) = P θ (X = x). L(θ) is the probability of getting the observed data x when the parameter value is θ. Continuous Case: When f( θ) is a continuous density P θ (X = x) = 0, but if B R n is a very, very small ball (or cube) centered at the observed data x, then P θ (X B) f(x θ) Volume(B) L(θ). L(θ) is proportional to the probability the random data X will be close to the observed data x when the parameter value is θ. Thus, the MLE ˆθ is the value of θ which makes the observed data x most probable. To find ˆθ, we maximize L(θ). This is usually done by calculus (finding a stationary point), but not always. If the parameter space Θ contains endpoints or boundary points, the maximum can be achieved at a boundary point without being a stationary point. If L(θ) is not smooth (continuous and everywhere differentiable), the maximum does not have to be achieved at a stationary point. Cautionary Example: Suppose X 1,..., X n are iid Uniform(0, θ) and Θ = (0, ). Given data x = (x 1,..., x n ), find the MLE for θ. L(θ) = = n θ 1 I(0 < x i < θ) = θ n I(0 x (1) )I(x (n) θ) { θ n for θ x (n) 0 for 0 < θ < x (n) which is maximized at θ = x (n), which is a point of discontinuity and certainly not a stationary point. Thus, the MLE is ˆθ = x (n). 4

5 Notes: L(θ) = 0 for θ < x (n) is just saying that these values of θ are absolutely ruled out by the data (which is obvious). A strange property of the MLE in this example (not typical): P θ (ˆθ < θ) = 1 The MLE is biased; it is always less than the true value. A Similar Example: Let X 1,..., X n be iid Uniform(α, β) and Θ = {(α, β) : α < β}. Given data x = (x 1,..., x n ), find the MLE for θ = (α, β). L(α, β) = = n (β α) 1 I(α < x i < β) = (β α) n I(α x (1) )I(x (n) β) { (β α) n for α x (1), x (n) β 0 otherwise which is maximized by making β α as small as possible without entering 0 otherwise region. Clearly, the maximum is achieved at (α, β) = (x (1), x (n) ). Thus the MLE is θ = (ˆα, ˆβ) = (x (1), x (n) ). Again, P α,β (α < ˆα, ˆβ < β) = 1. 3 Maximizing the Likelihood (one parameter) 3.1 General Remarks Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attains its supremum (but might do so at one of the endpoints). (That is, there exists a point θ 0 J such that g(θ 0 ) = sup θ J g(θ). ) Consequence: Suppose g(θ) is a continuous, non-negative function defined on an open interval J = (c, d) (where perhaps c = or d = ). If lim g(θ) = lim g(θ) = 0, θ c θ d then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous. 4 Maxima at Stationary Points Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infinite or finite). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point (that is, g (θ 0 ) = 0). 5

6 1. If g (θ 0 ) = 0 and g (θ 0 ) < 0, then θ 0 is a local maximum (but might not be the global maximum). 2. If g (θ 0 ) = 0 and g (θ) < 0 for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). The condition in (1) is necessary (but not sufficient) for θ 0 to be a global maximum. Condition (2) is sufficient (but not necessary). A function satisfying g (θ) < 0 for all θ Θ is called strictly concave. It lies below any tangent line. Another useful condition (sufficient, but not necessary) is: 3. If g (θ) > 0 for θ < θ 0 and g (θ) < 0 for θ > θ 0, then θ 0 is a global maximum. 5 Maximizing the Likelihood (multi-parameter) 5.1 Basic Result: A continuous function g(θ) defined on a closed, bounded set J R k attains its supremum (but might do so on the boundary). 5.2 Consequence: Suppose g(θ) is a continuous, non-negative function defined for all θ R k. If g(θ) 0 as θ, then g attains its supremum. Thus, MLEs usually exist when the likelihood function is continuous. Suppose the function g(θ) is defined on a convex set Θ R k (that is, the line segment joining any two points in Θ lies entirely inside Θ). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point: g(θ 0 ) θ i = 0, i = 1, 2,..., k. Define the vector D and Hessian matrix H: ( ) g(θ) k D(θ) = θ i ( 2 ) g(θ) k H(θ) = θ i θ j i,j=1 5.3 Maxima at Stationary Points (a k 1 vector). (a k k matrix) 1. If D(θ 0 ) = 0 and H(θ 0 ) is negative definite, then θ 0 is a local maximum (but might not be the global maximum). 6

7 2. If D(θ 0 ) = 0 and H(θ) is negative definite for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). (1) is necessary (but not sufficient) for θ 0 to be a global maximum. (2) is sufficient (but not necessary). A function for which H(θ) is negative definite for all θ Θ is called strictly concave. It lies below any tangent plane. 5.4 Positive and Negative Definite Matrices Suppose M is a k k symmetric matrix. Note: Hessian matrices and covariance matrices are symmetric. Definitions: 1. M is positive definite if x Mx > 0 for all x 0 (x R k ). 2. M is negative definite if x Mx < 0 for all x M is non-negative definite (or positive semi-definite) if x Mx 0 for all x R k. Facts: 1. M is p.d. iff all its eigenvalues are positive. 2. M is n.d. iff all its eigenvalues are negative. 3. M is n.n.d. iff all its eigenvalues are non-negative. 4. M is p.d. iff M is n.d. 5. If M is p.d., all its diagonal elements must be positive. 6. If M is n.d., all its diagonal elements must be negative. 7. The determinant of a symmetric matrix is equal to the product of its eigenvalues. 2 2 Symmetric Matrices: ( ) m11 m M = (m ij ) = 12, m m 21 m 12 = m M = m 11 m 22 m 12 m 21 = m 11 m 22 m A 2 2 matrix is p.d. when the determinant is positive and the diagonal elements are positive. A 2 2 matrix is n.d. when the determinant is positive and the diagonal 7

8 elements are negative. The bare minimum you need to check: M is p.d. if m 11 > 0 (or m 22 > 0) and M > 0. M is n.d. if m 11 < 0 (or m 22 < 0) and M > 0. Example: Observe X 1, X 2,..., X n be iid Gamma(α, β). Preliminaries: L(α, β) = n x α 1 i e x i/β β α Γ(α) Maximizing L is same as maximzing l = log L given by l(α, β) = (α 1)T 1 T 2 /β nα log β n log Γ(α) where T 1 = i log x i, T 2 = i x i. Note that T = (T 1, T 2 ) is the natural sufficient statistic of this 2pef. l α = T 1 n log β nψ(α), ψ(α) d dα log Γ(α) = Γ (α) Γ(α) l β = T 2 β 2 nα β = 1 β 2 (T 2 nαβ) 2 l α 2 = nψ (α) 2 l β 2 = 2T 2 β 3 + nα β 2 = 1 β 3 (2T 2 nαβ) 2 l α β = n β Situation #1: Suppose α = α 0 is known. Find MLE for β. (Drop α from arguments: l(β) = l(α 0, β) etc.) l(β) is continuous and differentiable. l(β) has a unique stationary point: Now we check the second derivative. l (β) = l β = 1 β 2 (T 2 nα 0 β) = 0 iff T 2 = nα 0 β, iff β = T 2 nα 0 ( β ) l (β) = 2 l β 2 = 1 β 3 (2T 2 nαβ) = 1 β 3 {T 2 + (T 2 nαβ)}. 8

9 Note l (β ) < 0 since T 2 nα 0 β = 0, but l (β) > 0 for β > 2T 2 /(nα 0 ). Thus, the stationary point satisfies the necessary condition for a global maximum, but not the sufficient condition (i.e., l(β) is not a strictly concave function). How can we be sure that we have found the global maximum, and not just a local maximum? In this case, there is a simple argument: The stationary point β is unique, and l (β) > 0 for β < β,and l (β) < 0 for β > β. This ensures β is the unique global maximizer. Conclusion: ˆβ = T 2 nα 0. (This is a function of T 2, which is a sufficient statistic for β when α is known.) Situation #2: Suppose β = β 0 is known. Find MLE for α. (Drop β from arguments: l(α) = l(α, β 0 ) etc.) Note: l (α) and l (α) involve ψ(α). The function ψ is infinitely differentiable on the interval (0, ), and satisfies ψ (α) > 0 and ψ (α) < 0 for all α > 0. (The function is strictly increasing and strictly concave.) Also, lim ψ(α) =, lim ψ(α) =. α 0 + α Thus ψ 1 : R (0, ) exists. l(α) is continuous and differentiable. l(α) has a unique stationary point: l (α) = T 1 n log β 0 nψ(α) = 0 iff ψ(α) = T 1 /n log β 0 iff α = ψ 1 (T 1 /n log β 0 ) This is the unique global maximizer since l (α) = nψ (α) < 0, α > 0. Thus ˆα = ψ 1 (T 1 /n log β 0 ) is the MLE. (This is a function of T 1, which is a sufficient statistic for α when β is known.) Situation #3: Find MLE for θ = (α, β). l(α, β) is continuous and differentiable. A stationary point must satisfy the system of two equations: Solving the second equation for β gives l α = T 1 n log β nψ(α) = 0 l β = 1 β 2 (T 2 nαβ) = 0. β = T 2 nα 9

10 Plugging this into the first equation, and rearranging a bit leads to ( ) T 1 n log T2 = ψ(α) log α H(α) n The function H(α) is continuous and strictly increasing from (0, ) to (, 0), so that it has an inverse mapping (, 0) to (0, ). Thus, the solution to the above equation can be written: α = H 1 { T1 n log ( )} T2 n Thus the unique stationary point is: { ( )} ˆα = H 1 T1 n log T2 n Is this the MLE? Let us examine the Hessian. H(α, β) = ˆβ = T 2 nˆα = H(ˆα, ˆβ) = ( 2 l α 2 2 l α β 2 l α β 2 l β 2 ) ( nψ (α) n n β 1 ( nψ (ˆα) n 2 ˆα T 2 n 2 ˆα T 2 n 3 ˆα 3 T 2 2 β (2T β 3 2 nαβ) ) ) The diagonal elements are both negative, and the determinant is equal to n 4 ˆα 2 T2 2 (ˆαψ (ˆα) 1). This is positive since αψ (α) 1 > 0 for all α > 0. This guarantees that H(ˆα, ˆβ) is negative definite so that (ˆα, ˆβ) is at least a local maximum. 6 Invariance principle for the MLE s If η = τ(θ) and ˆθ is the MLE of θ, then ˆη = τ(ˆθ) is the MLE of η. Comments 10

11 1. If τ(θ) is a 1-1 function, this is a trivial theorem. 2. If τ(θ) is not 1-1, this is essentially true by definition of induced likelihood. (see later). Example: X = (X 1, X 2,..., X n ) iid N(µ, σ 2 ). The usual parameters θ = (µ, σ 2 ) are related to the natural parameters η = (µ/σ 2, 1/(2σ 2 )) of the 2pef by a 1-1 function: η = τ(θ). The likelihood in terms of θ is L 1 (θ) = (2πσ 2 ) n/2 e nµ2 /2σ 2 e µ/σ2 T 1 (1/2σ 2 )T 2 where T 1 = X i, T 2 = X 2 i. Simple Example: X = (X 1, X 2,..., X n ) iid Bernoulli(p). It is known that MLE of p is ˆp = X. Thus 1. MLE of p 2 is ˆp 2 = X MLE of p(1 p) is X(1 X). The function of p in 1. is 1-1, but not 1-1 in Induced Likelihood Definition 1. If η = τ(θ), then L (η) sup L(θ). θ:τ(θ)=η Go back to the example X 1, X 2,..., X n N(µ, σ 2 ) iid. If the MLE ˆη of η is defined to be the value which maximized L (η), then it is easily seen that ˆη = τ(ˆθ). The likelihood in terms of η is obtained by substituting in L 1 (θ) that is, evaluating L 1 at L 2 (η) = ( π/η 2 ) n/2 e nη2 1 /4η 2 e η 1T 1 +η 2 T 2 µ = η 1 /(2η 2 ), σ 2 = 1/(2η 2 ), θ = (µ, σ 2 ) = ( η 1 /(2η 2 ), 1/(2η 2 )) = τ 1 (η). 11

12 Stated abstractly L 2 (η) = L 1 (τ 1 (η)), so that L 2 is maximized when τ 1 (η) = ˆθ, that is, by η = τ(ˆθ). The MLE of θ is known to be ( ˆθ = (ˆµ, ˆσ 2 ) = X, 1 n ) (X i n X) 2 so the invariance principle says the MLE of η is ( ˆµ ˆη = τ(ˆθ) = ˆσ 2, 1 ) 2ˆσ 2. Continuation of example: What is the MLE of α = µ + σ 2? Note that is not a 1-1 function, but where SS = n (X i X) 2. What is MLE of µ?, σ 2? With g 1 (x, y) = x, g 2 (x, y) = y, we have so that the MLEs are Thus, the invariance principle implies: α = g(µ, σ 2 ) = µ + σ 2 ˆα = g(ˆµ, ˆσ 2 ) = ˆµ + ˆσ 2 = X + SS/n µ = g 1 (θ), σ 2 = g 2 (θ) ˆµ = g 1 (ˆθ) = X, ˆσ 2 = g 2 ( ˆθ) = SS/n. ˆ (µ, σ 2 )(MLE as a pair) = (ˆµ(MLE of µ), ˆσ 2 (MLE of σ 2 )) 6.2 MLE for Exponential Families The invariance principle for MLEs allows us to work with the natural parameter η (which is a 1-1 function of θ). 1pef: f(x θ) = c(θ)h(x) exp{w(θ)t(x)} Natural parameter: η = w(θ). With a little abuse of notation (writing f(x η) for f (x η) = f(x w 1 (η)) and c(η) for c (η) = c(w 1 (η)), we can write f(x η) = c(η)h(x) exp{ηt(x)}. 12

13 For clarity of notations, we will use x = (x 1,..., x N ) as the observed data and X = (X 1, X 2,..., X N ) as the random data. If X 1,..., X N iid from f(x η), then l(η) = N log{c(η)} + N N log h(x i ) + η t(x i ) Since by 3.32(a) Et(X i ) = η log{c(η)}, we have l (η) = N N η log{c(η)} + t(x i ) (3) [ N ] = E t(x i ) + = ET (X) + T (x) N t(x i ) where T (X) = N t(x i). Hence the condition for a stationary point is equivalent to: Note that using (3), E η T (X) = T (x) l (η) = N 2 η 2 log{c(η)} = N{ Var ηt(x i )} < 0 for all η. Thus any interior stationary point (not on the boundary of Θ = {w(θ) : θ Θ}) is automatically a global maximum so long as Θ is convex. In one dimension (Θ R), this means Θ must be an interval of some sort (can be infinite). Ignoring this fine point, for a 1pef, the log-likelihood will have a unique stationary point which will be the MLE. k-pef: f(x θ) = c(θ)h(x) exp{ k w j (θ)t j (x)} Natural parameter: η = (η 1,..., n k ) = (w 1 (θ),..., w k (θ)), that is η j = w j (θ). j=1 f(x η) = c(η)h(x) exp{ k η j t j (x)} j=1 13

14 If X 1, X 2,..., X N iid from f(x η), then l(η) = N log c(η) + N log h(x i ) + l = N log c(η) + η j η j }{{} t j (X i ) N = E t j (X i ) + N t j (x i ) N t j (x i ) k { N } η j t j (x i ) j=1 2 ( ) l 2 = N log c(η) η j η l η j η l Thus, the equations for a stationary point is are equivalent to = N( Cov(t j (X i ), t l (X i ))) l η j = 0, j = 1,..., k E η T j (X) = T j (x), j = 1,..., k (4) where T j (X) = N t j(x i ) and T j (x) = N t j(x i ) or in vector notation, E η T (X) = T (x), j = 1,..., k where T (X) = (T 1 (X),..., T k (X)) and T (x) = (T 1 (x),..., T k (x)). ( ) k The Hessian matrix H(η) = 2 l η i η j is given i,j=1 H(η) = NΣ(η) where Σ(η) is the k k covariance matrix of (T 1 (X 1 ), T 2 (X 1 ),..., T k (X 1 )). A covariance matrix will be positive definite (except in degenerate cases), so that H(η) will be negative definite for all η. Conclusion: An interior stationary point (i.e., a solution of (4)) must be the unique global maximum, and hence the MLE. This result also holds in the original parameterization with (4) restated as E θ T j (X) = T j (x), j = 1,..., k. Connection with MOM: For a 1pef with t(x) = x, MOM and MLE agree. For a kpef with t j (x) = x j, MOM and MLE agree. Why? Because then (4) is equivalent to the equations for the MOM estimator. 14

15 7 Revisiting Gamma Example: The system of equations for the MLE of (α, β) may be easily derived directly from (4). ET 1 (X) = T 1 (x) ET 2 (X) = T 2 (x) which becomes n E log X i = ne log X 1 = n(log β + ψ(α)) = T 1 (x) n E X i = nex 1 = nαβ = T 2 (x) The equations are the same as the equations for a stationary point derived earlier. For X Gamma(α, β), we have used: E log X = = = log x xα 1 e x/β β α Γ(α) dx (log(x/β) + log β) (x/β)α 1 e x/β Γ(α) (log z + log β) zα 1 e z dz Γ(α) = log β + 1 Γ(α) = log β + 1 Γ(α) α = log β + Γ (α) Γ(α) 0 (z α 1 log z) e z dz }{{} α zα 1 0 z α 1 e z dz = log β + ψ(α). Verifying Stationary Point is Global Maximum: The Gamma family is a 2pef (or a 1pef if α or β is held fixed). Switching to the natural parameters η 1 = α 1, η 2 = 1/β(or just making the substitution λ = 1/β) simplifies the second derivatives w.r.t. η 2 (or λ). The Hessian matrix is now negative definite for all θ = (η 1, η 2 ), which is a sufficient condition for the stationary point to be the global maximum. dx β 15

16 7.1 MLEs for More General Exponential Families Proposition 1. If X P θ, θ Θ, where P θ has a joint pdf (pmt) from an n-variate k-parameter exponential family { k } f(x θ) = c(θ)h(x) exp w j (θ)t j (x) for x R n, θ Θ R k, then the MLE of θ based on the observed data x is the solution of the system of equations providing the solution (call it ˆθ) satisfies j=1 E θ T j (X) = T j (x), j = 1,..., k, Solve for θ. w(ˆθ) interior of {w(θ) : θ Θ}. Proof. Essentially the same as for the ordinary kpef. Example: Simple Linear Regression with known variance: Y 1, Y 2,..., Y n are independent with Y i N(β 0 + β 1 x i, σ 2 0), θ = (β 0, β 1 ) Joint distribution of Ỹ = (Y 1, Y 2,..., Y n ) forms exponential family. Natural sufficient statistic is t(ỹ ) = ( i Y i, i x i Y i ). E θ t(ỹ ) = t(ỹ) has the form E( Y i ) = y i E( x i Y i ) = x i y i Thus the MLE ˆθ = ( ˆβ 0, ˆβ 1 ) is solution of (β 0 + β 1 x i ) = i i x i (β 0 + β 1 x i ) = i i y i x i y i 16

17 7.2 Sufficient statistics and MLEs If T = T (X) is a sufficient statistic for θ, then there is an MLE which is a function of T. (If the MLE is unique, then we can say the MLE is a function of T ). Proof. By FC, f(x θ) = g(t (x), θ)h(x). Assume for convenience the MLE is unique. Then the MLE is which is clearly a function of T (x). ˆθ(x) = argmax θ f(x θ) = argmax θ g(t (x), θ) MLE coincides with Least Squares. For independent normal rv s with constant variance σ 2 (known or unknown). Y 1, Y 2,..., Y n are independent with or more generally, where β is possibly a vector. Then L(β, σ 2 ) = f(ỹ θ) = }{{} θ Y i N(β 0 + β 1 x i, σ 2 0), θ = (β 0, β 1 ) Y i N(g(x i, β), σ 2 0), ( ) 1 n { exp 1 2πσ 2σ 2 n (y i g(x i, β) }. 2 For any σ 2 (fixed arbitrary value), maximizing L(β, σ 2 ) with respect to β is equivalent to minimizing n (y i g(x i, β)) 2 with respect to β. Hence MLE and Least squares give same estimates of β parameters. 8 Limiting distribution of the MLE An large-sample property is one that describes the limiting distribution. This (i) gives an exact characterization of the rate of convergence, and (ii) allows for the construction of asymptotically exact statistical procedures. Though it is possible to get non-normal limits, all standard problems that admit a limiting distribution have a normal limit. From previous experience, we know that MLEs typically have an asymptotic normality property. 17

18 Here is one version of such a theorem, similar to Theorem 9.14 in Keener (2010), with conditions given in C1-C4 below. Condition C3 is the most difficult to check, but it does hold for regular exponential families. We focus on the one-dimensional case, but the exact same theorem, with obvious modifications, holds for d-dimensional θ. C1. The support of P θ does not depend on θ. C2. For each x in the support, f x (θ) := log p θ (x) is three times differentiable with respect to θ in an interval (θ δ, θ + δ); moreover, E θ f X (θ ) and E θ f X (θ ) are finite and there exists a function M(x) such that sup f x (θ) M(x), E θ [M(X)] <. (5) θ (θ δ,θ +δ) C3. Expectation with respect to P θ and differentiation at θ can be interchanged, which implies that the score function has mean zero and that the Fisher information exists and can be evaluated using either of the two familiar formulas. C4. The Fisher information at θ is positive. Theorem 1. Suppose X 1, X 2,..., X n are iid P θ, where θ Θ R. Assume C1-C4, and let ˆθ n be a consistent sequence of solutions to (2). Then, for any interior point θ, n 1/2 (ˆθ n θ ) N(0, I(θ ) 1 ), in distribution under P θ. Proof. Let l n (θ) = n 1 log L n (θ) be scaled log-likelihood. Since θ is an interior point, there exists an open neighborhood A of θ contained in Θ. From consistency of ˆθ n, the event {ˆθ n A} has P θ -probability converging to 1. Therefore, it suffices (prove!) to consider the behavior of ˆθ n only when it is in A where the log-likelihood is well-behaved, in particular, l n(ˆθ n ) = 0. Next, take a second-order Taylor approximation of l n(ˆθ n ) around θ : 0 = l n(θ ) + l n(θ )(ˆθ n θ ) l n ( θ n )(ˆθ n θ ) 2, where θ n is between ˆθ n and θ. After a bit of simple algebra, we get n 1/2 (ˆθ n θ n 1/2 l ) = n(θ ) l n(θ ) + 0.5l n ( θ n )(ˆθ n θ ), for ˆθ n near θ So, it remains to show that the right-hand side above has the stated asymptotically normal distribution. Let s look at the numerator and denominator separately. Numerator. The numerator can be written as n 1/2 l n(θ ) = n 1/2 1 n n 18 θ log p θ(x i ) θ=θ.

19 The summands are iid with mean zero and variance I(θ ), by our assumptions about interchanging derivatives and integrals. Therefore, the standard Central Limit Theorem says that n 1/2 l n(θ ) to N(0, I(θ )) in distribution. Denominator The first term in the denominator converges in P θ -probability to I(θ ) by the usual law of large numbers. It remains to show that the second term in the denominator is negligible. For this, note that by (5), l n ( θ n 1 n n M(X i ), for ˆθn close to θ. The ordinary law of large numbers, again, says that the upper bound converges to E θ [M(X 1)], which is finite. Consequently, l ( θ n ) is bounded in probability, and since ˆθ n θ 0 in P θ -probability by assumption, we may conclude that (ˆθ n θ )l n ( θ n ) 0, in P θ probability. It then follows from Slutsky s Theorem that n 1/2 l n(θ ) l n(θ ) (ˆθ n θ )l n ( θ n ) N(0, I(θ ) I(θ ) in distribution, which is the desired result. = N(0, I(θ ) 1 ), The take-away message here is that, under certain conditions, if n is large, then the MLE ˆθ has sampling distribution close to N(θ, [ni(θ)] 1 ) under P θ. To apply this result, e.g., to construct an asymptotically approximate confidence interval, one needs to replace I(θ) with a quantity that does not depend on the unknown parameter. Standard choices are the expected Fisher information I(ˆθ) and the observed Fisher information l n(ˆθ); The latter is often preferred, for it has some desirable conditioning properties. With asymptotic normality of the MLE, it is possible to derive the asymptotic distribution of any smooth function of the MLE. This is the well-known delta theorem. The delta theorem also offers an alternative-called variance stabilizing transformations to the plug-in rules discussed above to eliminate θ from the variance in the asymptotic normal approximation. It is possible to drop the requirement that the likelihood be three times differentiable if one assumes that the second derivative exists and has a certain Lipschitz property: log p θ (x) is twice differentiable at θ, and there exists a function g r (x, θ) such that, for each interior point θ, sup 2 θ 2 log p θ(x) 2 θ 2 log p θ (x) g r(x, θ ), θ: θ θ r with lim r 0 E θ {g r (X, θ)} = for each θ. 19

20 With this assumption, the same asymptotic normality result holds. Interestingly, it is possible to get asymptotic normality under a much weaker condition, namely, differentiable in quadratic mean, which assumes less than differentiability of θ p θ (x), but the details are a bit more technical. 9 Likelihood ratio tests For two competing hypotheses H 0 and H 1 about the parameter θ, the likelihood ratio is often used to make a comparison. For example, for H 0 : θ = θ 0 versus H 1 : θ = θ 1, the likelihood ratio is L(θ 0 )/L(θ 1 ), and large (resp. small) values of this ratio indicate that the data x favors H 0 (resp. H 1 ). A more difficult and somewhat more general problem is H 0 : θ Θ 0 versus H 1 : θ / Θ 0, where Θ 0 is a subset of Θ. In this case, one can define the likelihood ratio as T n = T n (X, Θ 0 ) = sup θ Θ 0 L(θ) sup θ Θ L(θ). The interpretation of this likelihood ratio is the same as before, i.e., if the ratio is small, then data lends little evidence to the null hypothesis. For practical purposes, we need to know what it means for the ratio to be small ; this means we need the null distribution of T n, i.e., the distribution of T n under P θ, when θ Θ 0. For Θ R s,consider the testing problem H 0 : θ Θ 0 versus H 1 : θ / Θ 0,where Θ 0 is a subset of Θ that specifies the values θ 01,..., θ 0m of θ 1,..., θ m, where m is a fixed integer between 1 and d. The following result, known as Wilks s Theorem, gives conditions under which the null distribution of W n = 2 log T n is asymptotically of a convenient form. Theorem 2. Suppose the conditions of Theorem 1 hold. Under the setup described in the previous paragraph, W n ChiSq(m) in distribution, under P θ with θ Θ 0. Proof. We focus here only on the case d = m = 1 That is, Θ 0 = {θ 0 } is a singleton, and we want to know the limiting distribution of W n under P θ0. Clearly, W n = 2l n (θ 0 ) + 2l n (ˆθ n ), where ˆθ n is the MLE and l n is the log-likelihood. By the assumed continuity of the loglikelihood, do a two-term Taylor approximation of l n (θ 0 ) around ˆθ n : l n (θ 0 ) =.l n (ˆθ n ) + l n(ˆθ n )(θ 0 ˆθ n ) + l n( θ n ) (θ 0 2 ˆθ n ) 2, where ˆθ n is between θ 0 and ˆθ n. Since l n(ˆθ n ) = 0, we get W n = l n( θ n )(θ 0 ˆθ n ) 2 = l n( θ n ) n {n1/2 (ˆθ θ 0 )} 2. 20

21 From Theorem 1, we have that n(ˆθ n θ 0 ) N(0, I(θ 0 ) 1 ) in distribution, as n. Also, l n( θ n ) = l n(θ 0 ) + l n( θ n ) l n(θ 0 ) and we have that l n( θ n ) l n(θ 0 ) 1 n n 2 θ 2 log p θ(x i ) θ= θn 2 θ 2 log p θ(x i ) θ=θ0 Using Condition C2, the upper bound is bounded by n 1 n M(X i) θ n θ 0, which goes to zero in probability under P θ0 since θ n is consistent. Therefore, l n( θ n ) has the same limiting behavior as l n(θ 0 ). Finally, by Slutsky, we get W n I(θ 0 )N(0, I(θ 0 ) 1 ) 2 N(0, 1) 2 ChiSq(1). Wilks theorem facilitates construction of an approximate size-α test of H 0 when n is large, i.e., by rejecting H 0 iff W n is more than χ 2 m,1 α, the 100(1 α) percentile of the ChiSq(m) distribution. The advantage of Wilks theorem appears in cases where the exact sampling distribution of W n is intractable, so that an exact (analytical) size-α test is not available. Monte Carlo can often be used to find a test, but Wilks theorem gives a good answer and only requires use of a simple chi-square table. One can also use the Wilks theorem result to obtain approximation confidence regions. Let W n (θ 0 ) = 2 log T n (X; θ 0 ), where θ 0 is a fixed generic value of the full d-dimensional parameter θ, i.e., H 0 : θ = θ 0. Then an approximate 100(1 α)% confidence region for θ is {θ 0 : W n (θ 0 ) χ 2 m,1 α }. An interesting and often overlooked aspect of Wilks theorem is that the asymptotic null distribution does not depend on the true values of those parameters unspecified under the null. For example, in a gamma distribution problem with the goal of testing if the shape is equal to some specified value, the null distribution of W n does not depend on the true value of the scale. 9.1 Cautions concerning the first-order theory One might be tempted to conclude that the desirable properties of the likelihood-based methods presented in the previous section are universal, i.e., that maximum likelihood estimators will always work. Moreover, based on the form of the asymptotic variance of the MLE and its similarity to the Cramer-Rao lower bound in Chapter 2, it is tempting to conclude that the MLE is asymptotically efficient. However, both of these conclusions are technically false in general. Indeed, there are examples where 1. the MLE is not unique or even does not exist 21

22 2. the MLE works (in the sense of consistency), but the conditions of the theory are not met so asymptotic normality fails; and 3. the MLE is not even consistent! Non-uniqueness or non-existence of the MLE are roadblocks to practical implementation but, for some reason, aren t viewed as much of a concern from a theoretical point of view. The case where the MLE works but is not asymptotically normal is also not really a problem, provided that one recognizes the non-regular nature of the problem and makes the necessary adjustments. The most concerning of these points is inconsistency of the MLE. Since consistency is a rather weak property, inconsistency of the MLE means that its performance is poor and can give very misleading results. The most famous example of inconsistency of the MLE, due to Neyman and Scott (1948), is given next. Example: (Neyman and Scott 1948). Let X ij be independent normal random variables, X ij N(µ i, σ 2 ), i = 1,..., n and j = 1, 2; the case of two j levels is the simplest, but the result holds for any fixed number of levels. The point here is that X i1 and X i2 have the same mean µ i, but there are possibly n different means. The full parameter is θ = (µ 1,..., µ n, σ 2 ), which is of dimension n + 1. It is easy to check that the MLEs are given by ˆµ i = 1 2 (X i1 + X i2 ), i = 1,..., n ˆσ 2 = 1 4n n (X i1 X i2 ) 2 It is easy to see that as n, ˆσ 2 σ 2 /2 in probability, so that the MLE of σ 2 is inconsistent! The issue here that is causing inconsistency is that the dimension of the nuisance parameter, the means µ 1,..., µ n, is increasing with n. In general, when the dimension of the parameter depends on n, consistency of the MLE will be a concern so one should be careful. 22

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The

More information

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ). Estimation February 3, 206 Debdeep Pati General problem Model: {P θ : θ Θ}. Observe X P θ, θ Θ unknown. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ). Examples: θ = (µ,

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota 1 Likelihood Inference We have learned one very general method of estimation: method of moments. the Now we

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata Maura Department of Economics and Finance Università Tor Vergata Hypothesis Testing Outline It is a mistake to confound strangeness with mystery Sherlock Holmes A Study in Scarlet Outline 1 The Power Function

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher

More information

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation Ryan Martin www.math.uic.edu/~rgmartin Version: August 19, 2013 1 Introduction Previously we have discussed various properties of

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

Information in a Two-Stage Adaptive Optimal Design

Information in a Two-Stage Adaptive Optimal Design Information in a Two-Stage Adaptive Optimal Design Department of Statistics, University of Missouri Designed Experiments: Recent Advances in Methods and Applications DEMA 2011 Isaac Newton Institute for

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

1 Probability Model. 1.1 Types of models to be discussed in the course

1 Probability Model. 1.1 Types of models to be discussed in the course Sufficiency January 18, 016 Debdeep Pati 1 Probability Model Model: A family of distributions P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

Classical Estimation Topics

Classical Estimation Topics Classical Estimation Topics Namrata Vaswani, Iowa State University February 25, 2014 This note fills in the gaps in the notes already provided (l0.pdf, l1.pdf, l2.pdf, l3.pdf, LeastSquares.pdf). 1 Min

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random

More information

f(y θ) = g(t (y) θ)h(y)

f(y θ) = g(t (y) θ)h(y) EXAM3, FINAL REVIEW (and a review for some of the QUAL problems): No notes will be allowed, but you may bring a calculator. Memorize the pmf or pdf f, E(Y ) and V(Y ) for the following RVs: 1) beta(δ,

More information

Problem Selected Scores

Problem Selected Scores Statistics Ph.D. Qualifying Exam: Part II November 20, 2010 Student Name: 1. Answer 8 out of 12 problems. Mark the problems you selected in the following table. Problem 1 2 3 4 5 6 7 8 9 10 11 12 Selected

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

1 Complete Statistics

1 Complete Statistics Complete Statistics February 4, 2016 Debdeep Pati 1 Complete Statistics Suppose X P θ, θ Θ. Let (X (1),..., X (n) ) denote the order statistics. Definition 1. A statistic T = T (X) is complete if E θ g(t

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Chapter 4: Asymptotic Properties of the MLE (Part 2) Chapter 4: Asymptotic Properties of the MLE (Part 2) Daniel O. Scharfstein 09/24/13 1 / 1 Example Let {(R i, X i ) : i = 1,..., n} be an i.i.d. sample of n random vectors (R, X ). Here R is a response

More information

Greene, Econometric Analysis (6th ed, 2008)

Greene, Econometric Analysis (6th ed, 2008) EC771: Econometrics, Spring 2010 Greene, Econometric Analysis (6th ed, 2008) Chapter 17: Maximum Likelihood Estimation The preferred estimator in a wide variety of econometric settings is that derived

More information

Introduction to Estimation Methods for Time Series models Lecture 2

Introduction to Estimation Methods for Time Series models Lecture 2 Introduction to Estimation Methods for Time Series models Lecture 2 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 2 SNS Pisa 1 / 21 Estimators:

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

5601 Notes: The Sandwich Estimator

5601 Notes: The Sandwich Estimator 560 Notes: The Sandwich Estimator Charles J. Geyer December 6, 2003 Contents Maximum Likelihood Estimation 2. Likelihood for One Observation................... 2.2 Likelihood for Many IID Observations...............

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given. (a) If X and Y are independent, Corr(X, Y ) = 0. (b) (c) (d) (e) A consistent estimator must be asymptotically

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

Lecture 17: Likelihood ratio and asymptotic tests

Lecture 17: Likelihood ratio and asymptotic tests Lecture 17: Likelihood ratio and asymptotic tests Likelihood ratio When both H 0 and H 1 are simple (i.e., Θ 0 = {θ 0 } and Θ 1 = {θ 1 }), Theorem 6.1 applies and a UMP test rejects H 0 when f θ1 (X) f

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

1 Probability Model. 1.1 Types of models to be discussed in the course

1 Probability Model. 1.1 Types of models to be discussed in the course Sufficiency January 11, 2016 Debdeep Pati 1 Probability Model Model: A family of distributions {P θ : θ Θ}. P θ (B) is the probability of the event B when the parameter takes the value θ. P θ is described

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Lecture 1: Introduction

Lecture 1: Introduction Principles of Statistics Part II - Michaelmas 208 Lecturer: Quentin Berthet Lecture : Introduction This course is concerned with presenting some of the mathematical principles of statistical theory. One

More information

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two

More information

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method Rebecca Barter February 2, 2015 Confidence Intervals Confidence intervals What is a confidence interval? A confidence interval is calculated

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

The outline for Unit 3

The outline for Unit 3 The outline for Unit 3 Unit 1. Introduction: The regression model. Unit 2. Estimation principles. Unit 3: Hypothesis testing principles. 3.1 Wald test. 3.2 Lagrange Multiplier. 3.3 Likelihood Ratio Test.

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018 Mathematics Ph.D. Qualifying Examination Stat 52800 Probability, January 2018 NOTE: Answers all questions completely. Justify every step. Time allowed: 3 hours. 1. Let X 1,..., X n be a random sample from

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS

March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS March 10, 2017 THE EXPONENTIAL CLASS OF DISTRIBUTIONS Abstract. We will introduce a class of distributions that will contain many of the discrete and continuous we are familiar with. This class will help

More information

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30 MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ)

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Advanced Quantitative Methods: maximum likelihood

Advanced Quantitative Methods: maximum likelihood Advanced Quantitative Methods: Maximum Likelihood University College Dublin 4 March 2014 1 2 3 4 5 6 Outline 1 2 3 4 5 6 of straight lines y = 1 2 x + 2 dy dx = 1 2 of curves y = x 2 4x + 5 of curves y

More information

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X).

parameter space Θ, depending only on X, such that Note: it is not θ that is random, but the set C(X). 4. Interval estimation The goal for interval estimation is to specify the accurary of an estimate. A 1 α confidence set for a parameter θ is a set C(X) in the parameter space Θ, depending only on X, such

More information

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes Maximum Likelihood Estimation Econometrics II Department of Economics Universidad Carlos III de Madrid Máster Universitario en Desarrollo y Crecimiento Económico Outline 1 3 4 General Approaches to Parameter

More information

Stat 451 Lecture Notes Numerical Integration

Stat 451 Lecture Notes Numerical Integration Stat 451 Lecture Notes 03 12 Numerical Integration Ryan Martin UIC www.math.uic.edu/~rgmartin 1 Based on Chapter 5 in Givens & Hoeting, and Chapters 4 & 18 of Lange 2 Updated: February 11, 2016 1 / 29

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona

Likelihood based Statistical Inference. Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona Likelihood based Statistical Inference Dottorato in Economia e Finanza Dipartimento di Scienze Economiche Univ. di Verona L. Pace, A. Salvan, N. Sartori Udine, April 2008 Likelihood: observed quantities,

More information

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form. Stat 8112 Lecture Notes Asymptotics of Exponential Families Charles J. Geyer January 23, 2013 1 Exponential Families An exponential family of distributions is a parametric statistical model having densities

More information

Sampling distribution of GLM regression coefficients

Sampling distribution of GLM regression coefficients Sampling distribution of GLM regression coefficients Patrick Breheny February 5 Patrick Breheny BST 760: Advanced Regression 1/20 Introduction So far, we ve discussed the basic properties of the score,

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

6 Classic Theory of Point Estimation

6 Classic Theory of Point Estimation 6 Classic Theory of Point Estimation Point estimation is usually a starting point for more elaborate inference, such as construction of confidence intervals. Centering a confidence interval at a point

More information

Chapter 7. Hypothesis Testing

Chapter 7. Hypothesis Testing Chapter 7. Hypothesis Testing Joonpyo Kim June 24, 2017 Joonpyo Kim Ch7 June 24, 2017 1 / 63 Basic Concepts of Testing Suppose that our interest centers on a random variable X which has density function

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties

More information

Hypothesis Testing: The Generalized Likelihood Ratio Test

Hypothesis Testing: The Generalized Likelihood Ratio Test Hypothesis Testing: The Generalized Likelihood Ratio Test Consider testing the hypotheses H 0 : θ Θ 0 H 1 : θ Θ \ Θ 0 Definition: The Generalized Likelihood Ratio (GLR Let L(θ be a likelihood for a random

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation University of Pavia Maximum Likelihood Estimation Eduardo Rossi Likelihood function Choosing parameter values that make what one has observed more likely to occur than any other parameter values do. Assumption(Distribution)

More information

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes Neyman-Pearson paradigm. Suppose that a researcher is interested in whether the new drug works. The process of determining whether the outcome of the experiment points to yes or no is called hypothesis

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

Mathematical statistics

Mathematical statistics October 4 th, 2018 Lecture 12: Information Where are we? Week 1 Week 2 Week 4 Week 7 Week 10 Week 14 Probability reviews Chapter 6: Statistics and Sampling Distributions Chapter 7: Point Estimation Chapter

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Chapter 4: Asymptotic Properties of the MLE

Chapter 4: Asymptotic Properties of the MLE Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider

More information

Miscellaneous Errors in the Chapter 6 Solutions

Miscellaneous Errors in the Chapter 6 Solutions Miscellaneous Errors in the Chapter 6 Solutions 3.30(b In this problem, early printings of the second edition use the beta(a, b distribution, but later versions use the Poisson(λ distribution. If your

More information

Lecture 3 January 16

Lecture 3 January 16 Stats 3b: Theory of Statistics Winter 28 Lecture 3 January 6 Lecturer: Yu Bai/John Duchi Scribe: Shuangning Li, Theodor Misiakiewicz Warning: these notes may contain factual errors Reading: VDV Chater

More information

Introduction to Statistical Inference

Introduction to Statistical Inference Introduction to Statistical Inference Ping Yu Department of Economics University of Hong Kong Ping Yu (HKU) Statistics 1 / 30 1 Point Estimation 2 Hypothesis Testing Ping Yu (HKU) Statistics 2 / 30 The

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Guy Lebanon February 19, 2011 Maximum likelihood estimation is the most popular general purpose method for obtaining estimating a distribution from a finite sample. It was

More information

Some General Types of Tests

Some General Types of Tests Some General Types of Tests We may not be able to find a UMP or UMPU test in a given situation. In that case, we may use test of some general class of tests that often have good asymptotic properties.

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Statistical Inference

Statistical Inference Statistical Inference Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA. Asymptotic Inference in Exponential Families Let X j be a sequence of independent,

More information

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Principles of Statistical Inference Recap of statistical models Statistical inference (frequentist) Parametric vs. semiparametric

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test.

Hypothesis Testing. A rule for making the required choice can be described in two ways: called the rejection or critical region of the test. Hypothesis Testing Hypothesis testing is a statistical problem where you must choose, on the basis of data X, between two alternatives. We formalize this as the problem of choosing between two hypotheses:

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:. MATHEMATICAL STATISTICS Homework assignment Instructions Please turn in the homework with this cover page. You do not need to edit the solutions. Just make sure the handwriting is legible. You may discuss

More information

3.1 General Principles of Estimation.

3.1 General Principles of Estimation. 154 Chapter 3 Basic Theory of Point Estimation. Suppose X is a random observable taking values in a measurable space (Ξ, G) and let P = {P θ : θ Θ} denote the family of possible distributions of X. An

More information

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That Statistics Lecture 4 August 9, 2000 Frank Porter Caltech The plan for these lectures: 1. The Fundamentals; Point Estimation 2. Maximum Likelihood, Least Squares and All That 3. What is a Confidence Interval?

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information