Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Similar documents
Review and continuation from last week Properties of MLEs

Statistical Estimation

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

simple if it completely specifies the density of x

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics 3858 : Maximum Likelihood Estimators

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

ECE 275A Homework 7 Solutions

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Lecture 32: Asymptotic confidence sets and likelihoods

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Introduction to Estimation Methods for Time Series models Lecture 2

Math 494: Mathematical Statistics

Brief Review on Estimation Theory

Chapter 3 : Likelihood function and inference

Some General Types of Tests

Master s Written Examination

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Generalized Linear Models Introduction

Theory of Maximum Likelihood Estimation. Konstantin Kashin

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

A Very Brief Summary of Statistical Inference, and Examples

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Chapter 3: Maximum Likelihood Theory

Lecture 1: August 28

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Probability and Statistics qualifying exam, May 2015

Generalized Linear Models

Lecture 7 Introduction to Statistical Decision Theory

Master s Written Examination

Problem Selected Scores

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Stat 710: Mathematical Statistics Lecture 12

5.2 Fisher information and the Cramer-Rao bound

Estimation, Inference, and Hypothesis Testing

Lecture 17: Likelihood ratio and asymptotic tests

A Very Brief Summary of Statistical Inference, and Examples

MCMC algorithms for fitting Bayesian models

Graduate Econometrics I: Maximum Likelihood I

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Maximum Likelihood Estimation

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Outline of GLMs. Definitions

2017 Financial Mathematics Orientation - Statistics

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

ST5215: Advanced Statistical Theory

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

3.1 General Principles of Estimation.

General Linear Model: Statistical Inference

Review Quiz. 1. Prove that in a one-dimensional canonical exponential family, the complete and sufficient statistic achieves the

Lecture 8: Information Theory and Statistics

Stat 5101 Lecture Notes

Maximum Likelihood Estimation

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

POLI 8501 Introduction to Maximum Likelihood Estimation

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

HT Introduction. P(X i = x i ) = e λ λ x i

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Lecture 4 September 15

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

Ph.D. Qualifying Exam Monday Tuesday, January 4 5, 2016

A General Overview of Parametric Estimation and Inference Techniques.

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Probability Distributions Columns (a) through (d)

Likelihood-Based Methods

Exponential Families

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Stat 579: Generalized Linear Models and Extensions

1 One-way analysis of variance

Mathematical Statistics

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

MAS223 Statistical Inference and Modelling Exercises

STAT 512 sp 2018 Summary Sheet

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Bias Variance Trade-off

Formal Statement of Simple Linear Regression Model

BIOS 2083 Linear Models c Abdus S. Wahed

Statistical Inference

BTRY 4090: Spring 2009 Theory of Statistics

Maximum Likelihood Estimation

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

Maximum Likelihood Estimation

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

MATH4427 Notebook 2 Fall Semester 2017/2018

Transcription:

Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties often can be used as guides in seeking optimal statistical procedures.

Approaches to Statistical Inference use of the empirical cumulative distribution function (ECDF) for example, method of moments use of a likelihood function for example, maximum likelihood fitting expected values for example, least squares fitting a probability distribution for example, maximum entropy definition and use of a loss function for example, uniform minimum variance unbiased estimation.

The Likelihood Function The domain of the likelihood function is some class of distributions specified by their probability densities, P = {p i (x)}, where all PDFs are with respect to a common σ-finite measure. In applications, often the PDFs are of a common parametric form, so equivalently, we can think of the domain of the likelihood function as being a parameter space, say Θ, so the family of densities can be written as P = {p θ (x)} where θ Θ, the known parameter space. We often write the likelihood function as L(θ ; x) = n i=1 p(x i ; θ). Although we have written L(θ ; x), the expression L(p θ ; x) may be more appropriate because it reminds us of an essential ingredient in the likelihood, namely a PDF.

PDF p X (x;θ) Likelihood L(θ; x) θ=1 θ=5 x=1 x=5.5.5 0 x 0 θ

What Likelihood Is Not A likelihood is neither a probability nor a probability density. Notice, for example, that the definite integral over IR + of the likelihood in exponential is x 2 ; that is, the likelihood does not integrate to 1. It is not appropriate to refer to the likelihood of an observation. We use the term likelihood in the sense of the likelihood of a model or the likelihood of a distribution given observations.

Likelihood Principle The likelihood principle in statistical inference asserts that all of the information which the data provide concerning the relative merits of two hypotheses (two possible distributions that give rise to the data) is contained in the likelihood ratio of those hypotheses and the data. An alternative statement of the likelihood principle is that if for x and y, L(θ ; x) = c(x, y) L(θ ; y) θ, where c(x, y) is constant for given x and y, then any inference about θ based on x should be in agreement with any inference about θ based on y.

Example 1 The likelihood principle in sampling in a Bernoulli distribution Consider the problem of making inferences on the parameter π in a family of Bernoulli distributions. One approach is to take a random sample of size n, X 1,..., X n from the Bernoulli(π), and then use T = X i, which has a binomial distribution with parameters n and π. Another approach is to take a sequential sample, X 1, X 2,..., until a fixed number t of 1 s have occurred. The size of the sample N is random has the negative binomial with parameters t and π.

Now, suppose we take the first approach with n = 10 and we observe t = 5; and then we take the second approach with t = 5 and we observe n = 10. We get the likelihoods L B (π) = ( 10) π 5 (1 π) 5 5 and L NB (π) = ( 9) π 5 (1 π) 5. 4 Because L B (π)/l NB (π) does not involve π, in accordance with the likelihood principle, any decision about π based on a binomial observation of 5 out of 10 should be the same as any decision about π based on a negative binomial observation of 10 for 5 1 s. This seems reasonable. We see that the likelihood principle allows the likelihood function to be defined as any member of an equivalence class {cl : c IR + }.

MLE Let L(θ ; x) be the likelihood of θ Θ for the observations x from a distribution with PDF with respect to a σ-finite measure ν. A maximum likelihood estimate, or MLE, of θ, written ˆθ, is defined as if it exists. ˆθ = arg max θ Θ L(θ ; x), There may be more than solution; any one is an MLE.

Example 2 MLE in a linear model Let Y = Xβ + E, where Y and E are n-vectors with E(E) = 0 and V(E) = σ 2 I n, X is an n p matrix whose rows are the x T i, and β is the p-vector parameter. A least squares estimator of β is b = arg min Y Xb 2 b B = (X T X) X T Y. Even if X is not of full rank, in which case the least squares estimator is not unique, we found that the least squares estimator has certain optimal properties for estimable functions. Of course at this point, we could not use MLE we do not have a distribution.

Then we considered the additional assumption in the model that or E N n (0, σ 2 I n ), Y N n (Xβ, σ 2 I n ). In that case, we found that the least squares estimator yielded the unique UMVUE for any estimable function of β and for σ 2. We could define a least squares estimator without an assumption on the distribution of Y or E, but for an MLE we need an assumption on the distribution. Again, let us assume Y N n (Xβ, σ 2 I n ), yielding, for an observed y, the log-likelihood l L (β, σ 2 y, X) = n 2 log(2πσ2 ) 1 2σ 2(y Xβ)T (y Xβ). So an MLE of β is the same as a least squares estimator of β.

Estimation of σ 2, however, is different. In the case of least squares estimation, with no specific assumptions about the distribution of E in the model, we have no basis for forming an objective function of squares to minimize. Even with an assumption of normality, however, instead of explicitly forming a least squares problem for estimating σ 2, we used the least squares estimator of β, which is complete and sufficient, and we used the distribution of (y Xb ) T (y Xb ) to form a UMVUE of σ 2, where r = rank(x). s 2 = (Y Xb ) T (Y Xb )/(n r),

In the case of maximum likelihood, we directly determine the value of σ 2 that maximizes the expression. This is an easy optimization problem. The solution is where β = b is an MLE of β. σ 2 = (y X β) T (y X β)/n Compare the MLE of σ 2 with the least squares estimator, and note that the MLE is biased. Recall that we have encountered these two estimators in the simpler cases.

Example 3 one-way random-effects AOV model Consider the linear model Y ij = µ + δ i + ɛ ij, i = 1,..., m, j = 1,..., n, where the δ i are identically distributed with E(δ i ) = 0, V(δ i ) = σ 2 δ, and Cov(δ i, δĩ) = 0 for i ĩ, and the ɛ ij are independent of the δ i and are identically distributed with with E(ɛ ij ) = 0, V(ɛ ij ) = σ 2 ɛ, and Cov(ɛ ij, ɛĩ j ) = 0 for either i ĩ or j j. A model such as this may be appropriate when there are a large number of possible treatments and m of them are chosen randomly and applied to experimental units whose responses Y ij are observed. While in the fixed-effects model, we are interested in whether α 1 = = α m = 0, in the random-effects model, we are interested in whether σδ 2 = 0, which would result in a similar practical decision about the treatments.

In the random-effects model the variance of each Y ij is σ 2 δ + σ2 ɛ, and our interest in using the model is to make inference on the relative sizes of the components of the variance σ 2 δ and σ2 ɛ. The model is sometimes called a variance components model. To estimate the variance components, we first observe that Cov(Y ij, Yĩ j ) = σ 2 δ + σ2 ɛ for i = ĩ, j = j, σ 2 δ for i = ĩ, j j, 0 for i ĩ.

To continue with the analysis, we follow the same steps, and get the same decomposition of the adjusted total sum of squares : m n i=1 j=1 (Y ij Y ) 2 = SSA + SSE. Now, to make inferences about σδ 2, we will add to our distribution assumptions, which at this point involve only means, variances, and covariances and independence of the δ i and ɛ ij. Let us suppose now that δ i i.i.d. N(0, σ 2 δ ) and ɛ ij i.i.d. N(0, σ 2 ɛ ). Again, we get chi-squared distributions, but the distribution involving SSA is not the same as for the fixed-effects model.

Forming MSA = SSA/(m 1) and MSE = SSE/(m(n 1)), we see that E(MSA) = nσ 2 δ + σ2 ɛ and E(MSE) = σ 2 ɛ. Unbiased estimators of σ 2 δ and σ2 ɛ are therefore s 2 δ = (MSA MSE)/n and s 2 ɛ = MSE, and we can also see that these are UMVUEs.

At this point, we note something that might at first glance be surprising: s 2 δ may be negative. This occurs if (m 1)MSA/m < MSE. (This will be the case if the variation among Y ij for a fixed i is relatively large compared to the variation among Y i.) Now, we consider the MLE of σ 2 δ and σ2 ɛ. In the case of the model with the assumption of normality and independence, it is relatively easy to write the log-likelihood, l L (µ, σ 2 δ, σ2 ɛ ; y) = 1 2 ( mn log(2π) + m(n 1) log(σ 2 ɛ ) + m log(σ 2 δ ) +SSE/σ 2 ɛ + SSA/(σ 2 ɛ + nσ 2 δ ) + mn(ȳ µ)2 /(σ 2 ɛ + nσ 2 δ )).

The MLEs must be in the closure of the parameter space, which for σ 2 δ and σ2 ɛ is IR + {0}. From this we have the MLEs if (m 1)MSA/m MSE σ δ 2 = 1 ( m 1 MSA MSE n m σ ɛ 2 = MSE ) if (m 1)MSA/m < MSE σ 2 δ = 0 σ ɛ 2 = m 1 m MSE. This was easy because the model was balanced. If, instead, we had j = 1,..., n i, it would be much harder.

Any approach to estimation may occasionally yield very poor or meaningless estimators. In addition to the possibly negative UMVUEs for variance components, the UMVUE of g(θ) = e 3θ in a Poisson distribution is not a very good estimator. While in some cases the MLE is more reasonable, in other cases the MLE may be very poor. An example of a likelihood function that is not very useful without some modification is in nonparametric probability density estimation. Suppose we assume that a sample comes from a distribution with continuous PDF p(x). The likelihood is n i=1 p(x i ). Even under the assumption of continuity, there is no solution.

C. R. Rao cites another example in which the likelihood function is not very meaningful. Consider an urn containing N balls labeled 1,..., N and also labeled with distinct real numbers θ 1,..., θ N (with N known). For a sample without replacement of size n < N where we observe (x i, y i ) = (label, θ label ), what is the likelihood function? It is either 0, if the label and θ label for at least one observation is inconsistent, or ( ) N 1, n otherwise; and, of course, we don t know! This likelihood function is not informative, and could not be used, for example, for estimating θ = θ 1 + + θ N. (There is a pretty good estimator of θ; it is N( y i )/n.)

Finding an MLE Notice that the problem of obtaining an MLE is a constrained optimization problem; that is, an objective function is to be optimized subject to the constraints that the solution be within the closure of the parameter space. In some cases the MLE occurs at a stationary point, which can be identified by differentiation. That is not always the case, however. A standard example in which the MLE does not occur at a stationary point is a distribution in which the range depends on the parameter, and the simplest such distribution is the uniform U(0, θ).

Derivatives of the Likelihood and Log-Likelihood If the likelihood function and consequently the log-likelihood are twice differentiable within Θ, finding an MLE may be relatively easy. The derivatives of the log-likelihood function relate directly to useful concepts in statistical inference. If it exists, the derivative of the log-likelihood is the relative rate of change, with respect to the parameter placeholder θ, of the probability density function at a fixed observation. If θ is a scalar, some positive function of the derivative, such as its square or its absolute value, is obviously a measure of the effect of change in the parameter, or of change in the estimate of the parameter.

More generally, an outer product of the derivative with itself is a useful measure of the changes in the components of the parameter: l L (θ (k) ; x) ( ll (θ (k) ) T ; x). Notice that the average of this quantity with respect to the probability density of the random variable X, I(θ 1 ; X) = E θ1 ( l L (θ (k) ( ) ; X) ll (θ (k) T ; X)), is the information matrix for an observation on Y parameter θ. about the

If θ is a scalar, the square of the first derivative is the negative of the second derivative, or, in general, ( ) 2 θ l L(θ ; x) = 2 θ 2l L(θ ; x), l L (θ (k) ; x) ( ll (θ (k) ) T ; x) = HlL (θ (k) ; x).

Computations If the log-likelihood is twice differentiable and if the range does not depend on the parameter, Newton s method could be used. Newton s equation H ll (θ (k 1) ; x) d (k) = l L (θ (k 1) ; x) is used to determine the step direction in the k th iteration. A quasi-newton method uses a matrix H ll (θ (k 1) ) in place of the Hessian H ll (θ (k 1) ). The second derivative, or an approximation of it, is used in a Newton-like method to solve the maximization problem.

The Likelihood Equation In the regular case, with the likelihood log-likelihood function differentiable within Θ, we call or the likelihood equations. L(θ ; x) = 0 l L (θ ; x) = 0 If the maximum occurs within Θ, then every MLE is a root of the likelihood equations. There may be other roots within Θ, of course. A likelihood equation is an estimating equation, similar to the estimating equation for least squares estimators.

The Likelihood Equation Any root of the likelihood equations, which is called an RLE, may be an MLE. A theorem from functional analysis, usually proved in the context of numerical optimization, states that if θ is an RLE and H ll (θ ; x) is negative definite, then there is a local maximum at θ. This may allow us to determine that an RLE is an MLE. There are, of course, other ways of determining whether an RLE is an MLE. In MLE, the determination that an RLE is actually an MLE is an important step in the process.

Consider the gamma family of distributions with parameters α and β. Given a random sample x 1,..., x n, the log-likelihood of α and β is l L (α, β ; x) = nα log(β) n log(γ(α)))+(α 1) log(x i ) 1 β xi. This yields the likelihood equations and n log(β) nγ (α) Γ(α) + log(x i ) = 0 nα β + 1 β 2 xi = 0.

Nondifferentiable Likelihood Functions The definition of MLEs does not depend on the existence of a likelihood equation. The likelihood function may not be differentiable with respect to the parameter, as in the case of the hypergeometric distribution, in which the parameter must be an integer (see Shao, Example 4.32).

Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that they also often have good asymptotic properties. We now consider some other properties; some useful and some less desirable.

Relation to Sufficient Statistics Theorem 1 If there is a sufficient statistic and an MLE exists, then an MLE is a function of the sufficient statistic. Proof. This follows directly from the factorization theorem.

Relation to Efficient Statistics Given the three Fisher information regularity conditions, we have defined Fisher efficient estimators as unbiased estimators that achieve the lower bound on their variance. Theorem 2 Assume the FI regularity conditions for a family of distributions {P θ } with the additional Le Cam-type requirement that the Fisher information matrix I(θ) is positive definite for all θ. Let T (X) be a Fisher efficient estimator of θ. Then T (X) is an MLE of θ.

Proof. Let p θ (x) be the PDF. We have θ log(p θ(x) = I(θ)(T (x) θ) for any θ and x. Clearly, for θ = T (x), this equation is 0 (hence, T (X) is an RLE). Because I(θ), which is the negative of the Hessian of the likelihood, is positive definite, T (x) maximizes the likelihood. Notice that without the additional requirement of a positive definite information matrix, Theorem 2 would yield only the conclusion that T (X) is an RLE.

Equivariance of MLEs If ˆθ is a good estimator of θ, it would seem to be reasonable that g(ˆθ) is a good estimator of g(θ), where g is a Borel function. Good, of course, is relative to some criterion. If the criterion is UMVU, then the estimator in general will not have this equivariance property; that is, if ˆθ is a UMVUE of θ, then g(ˆθ) may not be a UMVUE of g(θ). Maximum likelihood estimators almost have this equivariance property, however. Because of the word almost in the previous sentence, which we will elaborate on below, we will just define MLEs of functions so as to be equivariant.

Definition 1 (functions of estimands and of MLEs) If ˆθ is an MLE of θ, and g is a Borel function, then g(ˆθ) is an MLE of the estimand g(θ). The reason that this does not follow from the definition of an MLE is that the function may not be one-to-one. We could just restrict our attention to one-to-one functions. Instead, however, we will now define a function of the likelihood, and show that g(ˆθ) is an MLE w.r.t. this transformed likelihood, called the induced likelihood. Note, of course, that the expression g(θ) does not allow other variables and functions, e.g., we would not say that E((Y Xβ) T (Y Xβ)) qualifies as a g(θ) in the definition above. (Otherwise, what would an MLE of σ 2 be?)

Definition 2 (induced likelihood) Let {p θ : θ Θ} with Θ IR d be a family of PDFs w.r.t. a common σ-finite measure, and let L(θ) be the likelihood associated with this family, given observations. Now let g be a Borel function from Θ to Λ IR d 1 where 1 d 1 d. Then L(λ) = sup {θ : θ Θ and g(θ)=λ} L(θ) is called the induced likelihood function for the transformed parameter. Theorem 3 Suppose {p θ : θ Θ} with Θ IR d is a family of PDFs w.r.t. a common σ-finite measure with associated likelihood L(θ). Let ˆθ be an MLE of θ. Now let g be a Borel function from Θ to Λ IR d 1 where 1 d 1 d and let L(λ) be the resulting induced likelihood. Then g(ˆθ) maximizes L(λ).

Example 4 MLE of the variance in a Bernoulli distribution Consider the Bernoulli family of distributions with parameter π. The variance of a Bernoulli distribution is g(π) = π(1 π). Given a random sample x 1,..., x n, the MLE of π is ˆπ = n i=1 x i /n, hence the MLE of the variance is 1 ( xi 1 xi /n ). n Note that this estimator is biased and that it is the same estimator as that of the variance in a normal distribution: (xi x) 2 /n. The UMVUE of the variance in a Bernoulli distribution is, 1 ( xi 1 xi /n ). n 1

The difference in the MLE and the UMVUE of the variance in the Bernoulli distribution is the same as the difference in the estimators of the variance in the normal distribution. How do the MSEs of the estimators of the variance in a Bernoulli distribution compare? (Exercise.) Whenever the variance of a distribution can be expressed as a function of other parameters g(θ), as in the case of the Bernoulli distribution, the estimator of the variance is g(ˆθ), where ˆθ is an MLE of θ. The MLE of the variance of the gamma distribution, for example, is ˆαˆβ 2, where ˆα and ˆβ are the MLEs. The plug-in estimator of the variance of the gamma distribution is different. Given the sample, X 1, X 2..., X n, as always, it is 1 n n i=1 (X i X) 2.

Other Properties Other properties of MLEs are not always desirable. First of all, we note that an MLE may be biased. familiar example of this is the MLE of the variance. The most Another example is the MLE of the location parameter in the uniform distribution. Although the MLE approach is usually an intuitively logical one, it is not based on a formal decision theory, so it is not surprising that MLEs may not possess certain desirable properties that are formulated from that perspective. There are other interesting examples in which MLEs do not have desirable (or expected) properties.

An MLE may be discontinuous as, for example, in the case of ɛ-mixture distribution family with CDF where 0 ɛ 1. P x,ɛ (y) = (1 ɛ)p (y) + ɛi [x, [ (y), An MLE may not be a function of a sufficient statistic (if the MLE is not unique). An MLE may not satisfy the likelihood equation as, for example, when the likelihood function is not differentiable at its maximum. An MLE may differ from a simple method of moments estimator; in particular an MLE of the population mean may not be the sample mean.

The likelihood equation may have a unique root, but no MLE exists. While there are examples in which the roots of the likelihood equations occur at minima of the likelihood, this situation does not arise in any realistic distribution (that I am aware of). Romano and Siegel (1986) construct a location family of distributions with support on IR {x 1 + θ, x 2 + θ : x 1 < x 2 }, where x 1 and x 2 are known but θ is unknown, with a Lebesgue density p(x) that rises as x x 1 to a singularity at x 1 and rises as x x 2 to a singularity at x 2 and that is continuous and strictly convex over ]x 1, x 2 [ and singular at both x 1 and x 2. With a single observation, the likelihood equation has a root at the minimum of the convex portion of the density between x 1 +θ and x 2 + θ, but likelihood increases without bound at both x 1 + θ and x 2 + θ.

Nonuniqueness There are many cases in which the MLEs are not unique (and I m not just referring to RLEs). The following examples illustrate this. Example 5 likelihood in a Cauchy family Consider the Cauchy distribution with location parameter θ. The likelihood equation is n i=1 2(x i θ) 1 + (x i θ) 2. This may have multiple roots (depending on the sample), and so the one yielding the maximum would be the MLE. Depending on the sample, however, multiple roots can yield the same value of the likelihood function.

Another example in which the MLE is not unique is U(θ 1 2, θ + 1 2 ). Example 6 likelihood in a uniform family with fixed range The likelihood function for U(θ 1 2, θ + 1 2 ) is I ]x(n) 1/2,x (1) +1/2[ (θ). It is maximized at any value between x (n) 1/2 and x (1) + 1/2.

Nonexistence and Other Properties We have already mentioned situations in which the likelihood approach does not seem to be the logical way, and have seen that sometimes in nonparametric problems, the MLE does not exist. This often happens when there are more things to estimate than there are observations. This can also happen in parametric problems. In this case, some people prefer to say that the likelihood function does not exist; that is, they suggest that the definition of a likelihood function include boundedness.

MLE and the Exponential Class If X has a distribution in the exponential class and we write its density in the natural or canonical form, the likelihood has the form L(η ; x) = exp(η T T (x) ζ(η))h(x). The log-likelihood equation is particularly simple: T (x) ζ(η) = 0. η Newton s method for solving the likelihood equation is ( η (k) = η (k 1) 2 ) ζ(η) 1 ( η( η) T η=η (k 1) T (x) ζ(η) η η=η (k 1) Note that the second term includes the Fisher information matrix for η. (The expectation is constant.) (Note that the FI is not for a distribution; it is for a parametrization.) )

We have V(T (X)) = 2 ζ(η) η( η) T η=η. Note that the variance is evaluated at the true η. If we have a full-rank member of the exponential class then V is positive definite, and hence there is a unique maximum. If we write µ(η) = ζ(η) η, in the full-rank case, µ 1 exists and so we have the solution to the likelihood equation: ˆη = µ 1 (T (x)). So maximum likelihood estimation is very nice for the exponential class.

EM Methods Although EM methods do not rely on missing data, they can be explained most easily in terms of a random sample that consists of two components, one observed and one unobserved or missing. A simple example of missing data occurs in life-testing, when, for example, a number of electrical units are switched on and the time when each fails is recorded. In such an experiment, it is usually necessary to curtail the recordings prior to the failure of all units. The failure times of the units still working are unobserved, but the number of censored observations and the time of the censoring obviously provide information about the distribution of the failure times.

Another common example that motivates the EM algorithm is a finite mixture model. Each observation comes from an unknown one of an assumed set of distributions. The missing data is the distribution indicator. The parameters of the distributions are to be estimated. As a side benefit, the class membership indicator is estimated. The missing data can be missing observations on the same random variable that yields the observed sample, as in the case of the censoring example; or the missing data can be from a different random variable that is related somehow to the random variable observed. Many common applications of EM methods involve missing-data problems, but this is not necessary.

Example 7 MLE in a normal mixtures model A two-component normal mixture model can be defined by two normal distributions, N(µ 1, σ1 2) and N(µ 2, σ2 2 ), and the probability that the random variable (the observable) arises from the first distribution is w. The parameter in this model is the vector θ = (w, µ 1, σ 2 1, µ 2, σ 2 2 ). (Note that w and the σs have the obvious constraints.) The pdf of the mixture is p(y; θ) = wp 1 (y; µ 1, σ 2 1 ) + (1 w)p 2(y; µ 2, σ 2 2 ), where p j (y; µ j, σ 2 j ) is the normal pdf with parameters µ j and σ 2 j.

We denote the complete data as C = (X, U), where X represents the observed data, and the unobserved U represents class membership. Let U = 1 if the observation is from the first distribution and U = 0 if the observation is from the second distribution. The unconditional E(U) is the probability that an observation comes from the first distribution, which of course is w. Suppose we have n observations on X, x 1,..., x n. Given a provisional value of θ, we can compute the conditional expected value E(U x) for any realization of X. It is merely E(U x, θ (k) ) = w (k) p 1 (x; µ (k) 1, σ2(k) 1 ) p(x; w (k), µ (k) 1, σ2(k) 1, µ (k) 2, σ2(k) 2 )

The M step is just the familiar MLE of the parameters: σ 2(k+1) 1 = w (k+1) = 1 n E(U xi, θ (k) ) µ (k+1) 1 = 1 nw (k+1) q (k) (xi, θ (k) )x i 1 nw (k+1) q (k) (xi, θ (k) )(x i µ (k+1) 1 ) 2 µ (k+1) 2 = 1 n(1 w (k+1) ) q (k) (x i, θ (k) )x i σ 2(k+1) 2 = 1 n(1 w (k+1) ) q (k) (xi, θ (k) )(x i µ (k+1) 2 ) 2 (Recall that the MLE of σ 2 has a divisor of n, rather than n 1.)

Asymptotic Properties of MLEs, RLEs, and GEE Estimators The argmax of the likelihood function, that is, the MLE of the argument of the likelihood function, is obviously an important statistic. In many cases, a likelihood equation exists, and often in those cases, the MLE is a root of the likelihood equation. In some cases there are roots of the likelihood equation (RLEs) that may or may not be an MLE.

Asymptotic Distributions of MLEs and RLEs We recall that asymptotic expectations are defined as expectations in asymptotic distributions (rather than as limits of expectations). The first step in studying asymptotic properties is to determine the asymptotic distribution.

Example 8 asymptotic distribution of the MLE of the variance in a Bernoulli family In Example 4 we determined the MLE of the variance g(π) = π(1 π) in the Bernoulli family of distributions with parameter π. The MLE of g(π) is T n = X(1 X). Now, let us determine its asymptotic distributions.

From the central limit theorem, n(x π) N(0, g(π)). If π 1/2, g (π) 0, we can use the delta method and the CLT to get n(g(π) Tn ) N(0, π(1 π)(1 2π) 2 ). (I have written (g(π) T n ) instead of (T n g(π)) so the expression looks more similar to one we get next.) If π = 1/2, this is a degenerate distribution. (The limiting variance actually is 0, but the degenerate distribution is not very useful.)

Let s take a different approach for the case π = 1/2. We have from the CLT, ( n(x 1/2) N 0, 1 ). 4 Hence, if we scale and square, we get 4n(X 1/2) 2 d χ 2 1, or 4n(g(π) T n ) d χ 2 1. This is a specific instance of a second order delta method.

Following the same methods as for deriving the first order delta method using a Taylor series expansion, for the univariate case, we can see that n(sn c) N(0, σ 2 ) g (c) = 0 g (c) 0 = 2n(g(S n) g(c)) σ 2 g (c) d χ 2 1.