Review and continuation from last week Properties of MLEs

Similar documents
Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Statistical Estimation

simple if it completely specifies the density of x

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

ECE 275A Homework 7 Solutions

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

3.1 General Principles of Estimation.

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Statistics 3858 : Maximum Likelihood Estimators

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Lecture 32: Asymptotic confidence sets and likelihoods

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Problem Selected Scores

Statistical Inference

Probability and Statistics qualifying exam, May 2015

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Theory of Maximum Likelihood Estimation. Konstantin Kashin

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

A Very Brief Summary of Statistical Inference, and Examples

Stat 710: Mathematical Statistics Lecture 12

Lecture 17: Likelihood ratio and asymptotic tests

Mathematical statistics

Exponential Families

Qualifying Exam in Probability and Statistics.

Master s Written Examination

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

ST5215: Advanced Statistical Theory

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Lecture 7 Introduction to Statistical Decision Theory

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Math 494: Mathematical Statistics

Generalized Linear Models Introduction

Math 494: Mathematical Statistics

Inference in non-linear time series

Generalized Linear Models. Kurt Hornik

Lecture 8: Information Theory and Statistics

Introduction to Estimation Methods for Time Series models Lecture 2

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Stat 411 Lecture Notes 03 Likelihood and Maximum Likelihood Estimation

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from

Lecture 3 September 1

Robust Inference. A central concern in robust statistics is how a functional of a CDF behaves as the distribution is perturbed.

Maximum Likelihood Estimation

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Maximum Likelihood Estimation

Master s Written Examination

Maximum Likelihood Estimation

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation, Inference, and Hypothesis Testing

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

A Very Brief Summary of Statistical Inference, and Examples

ECON 4160, Autumn term Lecture 1

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

Lecture 26: Likelihood ratio tests

Chapter 3: Maximum Likelihood Theory

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Generalized Linear Models

Review Quiz. 1. Prove that in a one-dimensional canonical exponential family, the complete and sufficient statistic achieves the

Sampling distribution of GLM regression coefficients

Fundamentals of Statistics

5.2 Fisher information and the Cramer-Rao bound

Introduction to Maximum Likelihood Estimation

ECE 275B Homework # 1 Solutions Winter 2018

STA 260: Statistics and Probability II

Lecture 4 September 15

Parametric Inference

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

A General Overview of Parametric Estimation and Inference Techniques.

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Chapter 7. Hypothesis Testing

ECE 275B Homework # 1 Solutions Version Winter 2015

Statistics: Learning models from data

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Mathematical Statistics

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Chapter 3. Point Estimation. 3.1 Introduction

Evaluating the Performance of Estimators (Section 7.3)

1. Fisher Information

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

Semiparametric posterior limits

The outline for Unit 3

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Economics 583: Econometric Theory I A Primer on Asymptotics

The Delta Method and Applications

Transcription:

Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that they also often have good asymptotic properties. We now consider some other properties; some useful and some less desirable.

Relation to Sufficient Statistics Theorem 1 If there is a sufficient statistic and an MLE exists, then an MLE is a function of the sufficient statistic. Proof. This follows directly from the factorization theorem.

Relation to Efficient Statistics Given the three Fisher information regularity conditions, we have defined Fisher efficient estimators as unbiased estimators that achieve the lower bound on their variance. Theorem 2 Assume the FI regularity conditions for a family of distributions {P θ } with the additional Le Cam-type requirement that the Fisher information matrix I(θ) is positive definite for all θ. Let T (X) be a Fisher efficient estimator of θ. Then T (X) is an MLE of θ.

Proof. Let p θ (x) be the PDF. We have θ log(p θ(x) = I(θ)(T (x) θ) for any θ and x. Clearly, for θ = T (x), this equation is 0 (hence, T (X) is an RLE). Because I(θ), which is the negative of the Hessian of the likelihood, is positive definite, T (x) maximizes the likelihood. Notice that without the additional requirement of a positive definite information matrix, Theorem 2 would yield only the conclusion that T (X) is an RLE.

Equivariance of MLEs If ˆθ is a good estimator of θ, it would seem to be reasonable that g(ˆθ) is a good estimator of g(θ), where g is a Borel function. Good, of course, is relative to some criterion. If the criterion is UMVU, then the estimator in general will not have this equivariance property; that is, if ˆθ is a UMVUE of θ, then g(ˆθ) may not be a UMVUE of g(θ). Maximum likelihood estimators almost have this equivariance property, however. Because of the word almost in the previous sentence, which we will elaborate on below, we will just define MLEs of functions so as to be equivariant.

Definition 1 (functions of estimands and of MLEs) If ˆθ is an MLE of θ, and g is a Borel function, then g(ˆθ) is an MLE of the estimand g(θ). The reason that this does not follow from the definition of an MLE is that the function may not be one-to-one. We could just restrict our attention to one-to-one functions. Instead, however, we will now define a function of the likelihood, and show that g(ˆθ) is an MLE w.r.t. this transformed likelihood, called the induced likelihood. Note, of course, that the expression g(θ) does not allow other variables and functions, e.g., we would not say that E((Y Xβ) T (Y Xβ)) qualifies as a g(θ) in the definition above. (Otherwise, what would an MLE of σ 2 be?)

Definition 2 (induced likelihood) Let {p θ : θ Θ} with Θ IR d be a family of PDFs w.r.t. a common σ-finite measure, and let L(θ) be the likelihood associated with this family, given observations. Now let g be a Borel function from Θ to Λ IR d 1 where 1 d 1 d. Then L(λ) = sup {θ : θ Θ and g(θ)=λ} L(θ) is called the induced likelihood function for the transformed parameter. Theorem 3 Suppose {p θ : θ Θ} with Θ IR d is a family of PDFs w.r.t. a common σ-finite measure with associated likelihood L(θ). Let ˆθ be an MLE of θ. Now let g be a Borel function from Θ to Λ IR d 1 where 1 d 1 d and let L(λ) be the resulting induced likelihood. Then g(ˆθ) maximizes L(λ).

Example 1 MLE of the variance in a Bernoulli distribution Consider the Bernoulli family of distributions with parameter π. The variance of a Bernoulli distribution is g(π) = π(1 π). Given a random sample x 1,..., x n, the MLE of π is ˆπ = n i=1 x i /n, hence the MLE of the variance is 1 ( xi 1 xi /n ). n Note that this estimator is biased and that it is the same estimator as that of the variance in a normal distribution: (xi x) 2 /n. The UMVUE of the variance in a Bernoulli distribution is, 1 ( xi 1 xi /n ). n 1

The difference in the MLE and the UMVUE of the variance in the Bernoulli distribution is the same as the difference in the estimators of the variance in the normal distribution. How do the MSEs of the estimators of the variance in a Bernoulli distribution compare? (Exercise.) Whenever the variance of a distribution can be expressed as a function of other parameters g(θ), as in the case of the Bernoulli distribution, the estimator of the variance is g(ˆθ), where ˆθ is an MLE of θ. The MLE of the variance of the gamma distribution, for example, is ˆαˆβ 2, where ˆα and ˆβ are the MLEs. The plug-in estimator of the variance of the gamma distribution is different. Given the sample, X 1, X 2..., X n, as always, it is 1 n n i=1 (X i X) 2.

Other Properties Other properties of MLEs are not always desirable. First of all, we note that an MLE may be biased. familiar example of this is the MLE of the variance. The most Another example is the MLE of the location parameter in the uniform distribution. Although the MLE approach is usually an intuitively logical one, it is not based on a formal decision theory, so it is not surprising that MLEs may not possess certain desirable properties that are formulated from that perspective. There are other interesting examples in which MLEs do not have desirable (or expected) properties.

An MLE may be discontinuous as, for example, in the case of ɛ-mixture distribution family with CDF where 0 ɛ 1. P x,ɛ (y) = (1 ɛ)p (y) + ɛi [x, [ (y), An MLE may not be a function of a sufficient statistic (if the MLE is not unique). An MLE may not satisfy the likelihood equation as, for example, when the likelihood function is not differentiable at its maximum. An MLE may differ from a simple method of moments estimator; in particular an MLE of the population mean may not be the sample mean.

The likelihood equation may have a unique root, but no MLE exists. While there are examples in which the roots of the likelihood equations occur at minima of the likelihood, this situation does not arise in any realistic distribution (that I am aware of). Romano and Siegel (1986) construct a location family of distributions with support on IR {x 1 + θ, x 2 + θ : x 1 < x 2 }, where x 1 and x 2 are known but θ is unknown, with a Lebesgue density p(x) that rises as x x 1 to a singularity at x 1 and rises as x x 2 to a singularity at x 2 and that is continuous and strictly convex over ]x 1, x 2 [ and singular at both x 1 and x 2. With a single observation, the likelihood equation has a root at the minimum of the convex portion of the density between x 1 +θ and x 2 + θ, but likelihood increases without bound at both x 1 + θ and x 2 + θ.

Nonuniqueness There are many cases in which the MLEs are not unique (and I m not just referring to RLEs). The following examples illustrate this. Example 2 likelihood in a Cauchy family Consider the Cauchy distribution with location parameter θ. The likelihood equation is n i=1 2(x i θ) 1 + (x i θ) 2. This may have multiple roots (depending on the sample), and so the one yielding the maximum would be the MLE. Depending on the sample, however, multiple roots can yield the same value of the likelihood function.

Another example in which the MLE is not unique is U(θ 1 2, θ + 1 2 ). Example 3 likelihood in a uniform family with fixed range The likelihood function for U(θ 1 2, θ + 1 2 ) is I ]x(n) 1/2,x (1) +1/2[ (θ). It is maximized at any value between x (n) 1/2 and x (1) + 1/2.

Nonexistence and Other Properties We have already mentioned situations in which the likelihood approach does not seem to be the logical way, and have seen that sometimes in nonparametric problems, the MLE does not exist. This often happens when there are more things to estimate than there are observations. This can also happen in parametric problems. In this case, some people prefer to say that the likelihood function does not exist; that is, they suggest that the definition of a likelihood function include boundedness.

MLE and the Exponential Class If X has a distribution in the exponential class and we write its density in the natural or canonical form, the likelihood has the form L(η ; x) = exp(η T T (x) ζ(η))h(x). The log-likelihood equation is particularly simple: T (x) ζ(η) = 0. η Newton s method for solving the likelihood equation is ( η (k) = η (k 1) 2 ) ζ(η) 1 ( η( η) T η=η (k 1) T (x) ζ(η) η η=η (k 1) Note that the second term includes the Fisher information matrix for η. (The expectation is constant.) (Note that the FI is not for a distribution; it is for a parametrization.) )

We have V(T (X)) = 2 ζ(η) η( η) T η=η. Note that the variance is evaluated at the true η. If we have a full-rank member of the exponential class then V is positive definite, and hence there is a unique maximum. If we write µ(η) = ζ(η) η, in the full-rank case, µ 1 exists and so we have the solution to the likelihood equation: ˆη = µ 1 (T (x)). So maximum likelihood estimation is very nice for the exponential class.

EM Methods Although EM methods do not rely on missing data, they can be explained most easily in terms of a random sample that consists of two components, one observed and one unobserved or missing. A simple example of missing data occurs in life-testing, when, for example, a number of electrical units are switched on and the time when each fails is recorded. In such an experiment, it is usually necessary to curtail the recordings prior to the failure of all units. The failure times of the units still working are unobserved, but the number of censored observations and the time of the censoring obviously provide information about the distribution of the failure times.

Another common example that motivates the EM algorithm is a finite mixture model. Each observation comes from an unknown one of an assumed set of distributions. The missing data is the distribution indicator. The parameters of the distributions are to be estimated. As a side benefit, the class membership indicator is estimated. The missing data can be missing observations on the same random variable that yields the observed sample, as in the case of the censoring example; or the missing data can be from a different random variable that is related somehow to the random variable observed. Many common applications of EM methods involve missing-data problems, but this is not necessary.

Example 4 MLE in a normal mixtures model A two-component normal mixture model can be defined by two normal distributions, N(µ 1, σ1 2) and N(µ 2, σ2 2 ), and the probability that the random variable (the observable) arises from the first distribution is w. The parameter in this model is the vector θ = (w, µ 1, σ 2 1, µ 2, σ 2 2 ). (Note that w and the σs have the obvious constraints.) The pdf of the mixture is p(y; θ) = wp 1 (y; µ 1, σ 2 1 ) + (1 w)p 2(y; µ 2, σ 2 2 ), where p j (y; µ j, σ 2 j ) is the normal pdf with parameters µ j and σ 2 j.

We denote the complete data as C = (X, U), where X represents the observed data, and the unobserved U represents class membership. Let U = 1 if the observation is from the first distribution and U = 0 if the observation is from the second distribution. The unconditional E(U) is the probability that an observation comes from the first distribution, which of course is w. Suppose we have n observations on X, x 1,..., x n. Given a provisional value of θ, we can compute the conditional expected value E(U x) for any realization of X. It is merely E(U x, θ (k) ) = w (k) p 1 (x; µ (k) 1, σ2(k) 1 ) p(x; w (k), µ (k) 1, σ2(k) 1, µ (k) 2, σ2(k) 2 )

The M step is just the familiar MLE of the parameters: σ 2(k+1) 1 = w (k+1) = 1 n E(U xi, θ (k) ) µ (k+1) 1 = 1 nw (k+1) q (k) (xi, θ (k) )x i 1 nw (k+1) q (k) (xi, θ (k) )(x i µ (k+1) 1 ) 2 µ (k+1) 2 = 1 n(1 w (k+1) ) q (k) (x i, θ (k) )x i σ 2(k+1) 2 = 1 n(1 w (k+1) ) q (k) (xi, θ (k) )(x i µ (k+1) 2 ) 2 (Recall that the MLE of σ 2 has a divisor of n, rather than n 1.)

Asymptotic Properties of MLEs, RLEs, and GEE Estimators The argmax of the likelihood function, that is, the MLE of the argument of the likelihood function, is obviously an important statistic. In many cases, a likelihood equation exists, and often in those cases, the MLE is a root of the likelihood equation. In some cases there are roots of the likelihood equation (RLEs) that may or may not be an MLE.

Asymptotic Distributions of MLEs and RLEs We recall that asymptotic expectations are defined as expectations in asymptotic distributions (rather than as limits of expectations). The first step in studying asymptotic properties is to determine the asymptotic distribution.

Example 5 asymptotic distribution of the MLE of the variance in a Bernoulli family In Example 1 we determined the MLE of the variance g(π) = π(1 π) in the Bernoulli family of distributions with parameter π. The MLE of g(π) is T n = X(1 X). Now, let us determine its asymptotic distributions.

From the central limit theorem, n(x π) N(0, g(π)). If π 1/2, g (π) 0, we can use the delta method and the CLT to get n(g(π) Tn ) N(0, π(1 π)(1 2π) 2 ). (I have written (g(π) T n ) instead of (T n g(π)) so the expression looks more similar to one we get next.) If π = 1/2, this is a degenerate distribution. (The limiting variance actually is 0, but the degenerate distribution is not very useful.)

Let s take a different approach for the case π = 1/2. We have from the CLT, ( n(x 1/2) N 0, 1 ). 4 Hence, if we scale and square, we get 4n(X 1/2) 2 d χ 2 1, or 4n(g(π) T n ) d χ 2 1. This is a specific instance of a second order delta method.

Following the same methods as for deriving the first order delta method using a Taylor series expansion, for the univariate case, we can see that n(sn c) N(0, σ 2 ) g (c) = 0 g (c) 0 = 2n(g(S n) g(c)) σ 2 g (c) d χ 2 1.