f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Similar documents
Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Chapter 3: Maximum Likelihood Theory

Graduate Econometrics I: Maximum Likelihood I

Introduction to Estimation Methods for Time Series models Lecture 2

SOLUTION FOR HOMEWORK 7, STAT p(x σ) = (1/[2πσ 2 ] 1/2 )e (x µ)2 /2σ 2.

6. MAXIMUM LIKELIHOOD ESTIMATION

simple if it completely specifies the density of x

Statistics 3858 : Maximum Likelihood Estimators

Loglikelihood and Confidence Intervals

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Maximum Likelihood Large Sample Theory

Statistical Inference

ECE 275A Homework 7 Solutions

Chapter 3. Point Estimation. 3.1 Introduction

Mathematical statistics

Review of Discrete Probability (contd.)

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Chapters 9. Properties of Point Estimators

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Topic 12 Overview of Estimation

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

A Very Brief Summary of Statistical Inference, and Examples

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

Statistics Ph.D. Qualifying Exam: Part II November 3, 2001

HT Introduction. P(X i = x i ) = e λ λ x i

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Math 152. Rumbos Fall Solutions to Assignment #12

SOLUTION FOR HOMEWORK 8, STAT 4372

Lecture 1: Introduction

Modelling of Dependent Credit Rating Transitions

A Very Brief Summary of Statistical Inference, and Examples

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Exponential Families

Mathematical statistics

Machine learning - HT Maximum Likelihood

Exercises and Answers to Chapter 1

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Elements of statistics (MATH0487-1)

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Link lecture - Lagrange Multipliers

Better Bootstrap Confidence Intervals

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

Chapter 2. Discrete Distributions

IE 303 Discrete-Event Simulation

Mathematical statistics

A Very Brief Summary of Bayesian Inference, and Examples

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Theory of Statistics.

Notes on the Multivariate Normal and Related Topics

Parameter Estimation

Primer on statistics:

Statistics 135 Fall 2008 Final Exam

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Math 494: Mathematical Statistics

Greene, Econometric Analysis (6th ed, 2008)

Hypothesis testing: theory and methods

Some Curiosities Arising in Objective Bayesian Analysis

Inference in non-linear time series

Topic 19 Extensions on the Likelihood Ratio

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Part III. A Decision-Theoretic Approach and Bayesian testing

Parametric Techniques Lecture 3

Lecture 4 September 15

Final Examination Statistics 200C. T. Ferguson June 11, 2009

Maximum Likelihood Estimation

This paper is not to be removed from the Examination Halls

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Parametric Techniques

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Statistics 135 Fall 2007 Midterm Exam

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

A732: Exercise #7 Maximum Likelihood

Fisher Information & Efficiency

COM336: Neural Computing

9 Asymptotic Approximations and Practical Asymptotic Tools

5.2 Fisher information and the Cramer-Rao bound

STAT 461/561- Assignments, Year 2015

Ch. 5 Hypothesis Testing

Exam 2. Jeremy Morris. March 23, 2006

1 Degree distributions and data

1. Fisher Information

STAT 730 Chapter 4: Estimation

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Institute of Actuaries of India

Suggested solutions to written exam Jan 17, 2012

You may use your calculator and a single page of notes.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Maximum Likelihood Estimation

Economics 583: Econometric Theory I A Primer on Asymptotics

y = 1 N y i = 1 N y. E(W )=a 1 µ 1 + a 2 µ a n µ n (1.2) and its variance is var(w )=a 2 1σ a 2 2σ a 2 nσ 2 n. (1.

Transcription:

0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher information can be obtained by differentiating the identity 1 = f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0 = θ 1 = θ f(x θ)dx = θ f(x θ)dx = θ ln(f(x θ))f(x θ)dx where the last equality follows after multiplying and dividing by f(x θ). This last condition states that [ ] E θ ln(f(x 1 θ)) = θ ln(f(x θ))f(x θ)dx = 0 Fisher information is defined as the variance of θ ln(f(x 1 θ)) or I 1 (θ) = E ( ) 2 θ ln(f(x 1 θ)) Remark: The information about θ in a random a sample of size n from f(x θ) is I n (θ) = ne[( θ ln f( X 1 θ) ) 2 ] It is sometimes convenient to use an alternative form for Fisher information that involves second derivatives. I 1 (θ) = E[ 2 θ 2 ln f( X 1 θ) ] This is obtained by differentiating a second time to obtain 0 = θ θ ln(f(x θ))f(x θ)dx 2 = θ ln(f(x θ))f(x θ)dx + 2 θ ln(f(x θ)) θ ln(f(x θ))f(x θ)dx

2 Example Let X 1,...,X n be a random sample of size n from a normal distribution with known variance. (a) Evaluate the Fisher information I 1 (µ). (b) Evaluate the alternative form of Fisher information I 1 (µ) = E[ 2 µ 2 ln( X 1 µ) ] is Solution (a) Since the probability density function of a single observation 1 2πσ 2 e (x 1 µ) 2 /2σ 2 we have µ ln f( x 1 µ) = ( 1 µ 2 ln(2πσ2 ) 1 ) 2σ 2(x 1 µ) 2 and ( = 1 σ 2(x 1 µ) µ lnf( x 1 µ) ) ) 2 = 1 (x 1 µ) 2 σ 2 σ 2 But, E[(X 1 µ) 2 /σ 2 = var(x 1 )/σ 2 = 1 so I 1 (µ) = 1 σ 2 (b) Taking the second partial derivative gives 2 µ 2 ln f( x 1 µ) = 2 µ 2 (x 1 µ) σ 2 = 1 σ 2 so I 1 (µ) = 1/σ 2 which agrees with the original calculation involving only the first derivative. Maximum Likelihood Estimation 0.2 Maximum Likelihood Estimation A very general approach to estimation, proposed by R. A. Fisher, is called the method of maximum likelihood. To set the ideas, we begin with a special

0.2. MAXIMUM LIKELIHOOD ESTIMATION 3 Probability 0.0 0.2 0.4 0.6.3750.2500.2500.0625.0625.4116.2646.2401.0756.0081 0 1 2 3 4 0 1 2 3 4 x Distribution 1 x Distribution 2 Figure 1: The two possible distributions for X case. Suppose that one of two distributions must prevail. For example, let X take the possible values 0, 1, 2, or 3 with probabilities specified by distribution 1 or with probabilities specified by distribution 2 ( see Table 1 and Figure 1). Table 1. Two Possible Distributions for X Distribution 1 x 0 1 2 3 4 p ( x ).0625.2500.3750.2500.0625 Distribution 2 x 0 1 2 3 4 p ( x ).2401.4116.2646.0756.0081 The first is the binomial distribution with p =.5 and the second the binomial with p =.4 but this fact is not important to the argument. If we observe X = 3 should our estimate of the underlying distribution be distribution 1 or distribution 2? Suppose we take the attitude that we will

4 select the distribution for which the observed value x = 3 has the highest probability of occurring. Because this calculation is done after the data are obtained, we use the terminology of maximizing likelihood rather than probability. For the first distribution P[X = 3] =.2500 and For the second distribution P[X = 3] =.0756 so we estimate that the first distribution is the distribution that actually produced the observation 3. If, instead, we observed X = 1 the estimate would be distribution 2 since.4116 is larger than.2500. Let us take this example a step further and assume that X follows a binomial distribution with n = 3 but that 0 p 1 is unknown. The count X then has the distribution ( ) n p x (1 p) 4 x for x = 0, 1, 2, 3, 4 x If we again observe X = 3, we evaluate the binomial distribution at x = 3 and obtain 4p 3 (1 p) 4 3 for 0 p 1 which is a function of p. We now vary p to best explain the observed result. This curve, L(p), is shown in Figure 2. We take the value at which the maximum occurs as our estimate. Using calculus, the maximum occurs at the value of p for which the derivative is zero. d dp 4p3 (1 p) = 4(3p 2 4p 3 ) = 0 so our estimate is ˆp =.75. To review, this value maximizes the after the fact probability of observing the value 3. More generally, a random sample of size n is taken from a distribution that depends on a parameter θ. The random sample produces n values x 1,x 2,...,x n which we substitute into the joint probability distribution, or probability density function, and then study the resulting function of θ. Definition. The function of θ that is obtained by substituting the observed values of the random sample X 1 = x 1,...,X n = x n into the joint probability distribution or the density function for X 1,X 2,...,X n is called the likelihood function for θ. n L(θ x 1,...,x n ) = f(x i θ)

0.2. MAXIMUM LIKELIHOOD ESTIMATION 5 L( p ) 0.0 0.2 0.4 0.0 0.4 0.8 p Figure 2: The curve L(p) = 4p 3 (1 p) We often simplify the notation and write L(θ) with the understanding that the likelihood function does depend on the values x 1,x 2,...,x n from the random sample. Given the values x 1,x 2,...,x n from a random sample, one distinctive feature of the likelihood function is the value or values of θ at which it attains its maximum. Definition. A statistic ˆθ(x 1,...,x n ) is a maximum likelihood estimator of θ if ˆθ is a value for the parameter that maximizes the likelihood function L(θ x 1,...,x n ) The maximum likelihood estimators satisfy a general invariance Specifically, if g(θ) is a continuous one-to-one function of θ and ˆθ is the maximum likelihood estimator of θ, the maximum likelihood estimator of g(θ) is obtained by simple substitution. maximum likelihood estimator of g(θ) = g(ˆθ)

6 Example As in Example, let X 1,...,X n be a random sample of size n from the Poisson distribution f(x λ) = λx e λ Obtain the maximum likelihood estimator of P[X 1 = 0] = e λ. From Example the maximum likelihood estimator of λ is ˆλ = x. Consequently, by the invariance property, the maximum likelihood estimator of e λ is e x. 0.3 Large Sample Distributions of MLE s When the likelihood has continuous partial derivatives and other regularity conditions hold, the distribution of the MLE(maximum likelihood estimator) ˆθ tends to a normal distribution as the sample sizes increases. To simplify the statement, we assume that the maximum likelihood estimator is uniquely defined. Theorem [ Asymptotic Normality] Let θ be the prevailing value of the parameter and let ˆθ be the maximum likelihood estimator. Under suitable regularity conditions, P θ [ n(ˆθ θ) y] > y x! I 1/2 1 (θ) 2π exp( 1 2 I 1(θ)u 2 )du That is, n(ˆθ θ) converges in distribution to a normal distribution having mean 0 and variance 1/I 1 (θ). where I 1 (θ) is the Fisher information in a sample of size one evaluated at the prevailing parameter. The result holds with I 1 (ˆθ) or even the empirical information = ( 1 n empirical information = 1 n ( ln f( X θ))2 θ n θ ln f( X i θ)) 2 = 1 n n 2 θ 2 ln f( X i θ) which can be evaluated at the prevailing θ or the MLE ˆθ. A large sample approximate confidence interval for θ is ( ˆθ 1 z α/2 ni(ˆθ) 1, ˆθ + zα/2 ) ni(ˆθ)

0.3. LARGE SAMPLE DISTRIBUTIONS OF MLE S 7 Some students may be interested in the idea of Proof. This can be ignored Idea of Proof of Normal Limit Recall the expansion h(u) = h(u 0 ) + h (u 0 )(u u 0 ) + h (u )(u u 0 ) 2 /2. θ lnf( X θ) ˆθ = θ ln f( X θ) θ + 2 θ 2 ln f( X θ) θ (θ ˆθ)+ 3 θ 3 ln f( X θ) ˆ θ (θ ˆθ) 2 /2 The term on the left side vanishes under appropriate conditions. The first term on the right is the sum of n independent and identically distributed random variables having mean 0 and variance I 1 (θ). Dividing both sides by n, we conclude from the central limit theorem that this first term converges in distribution to a normal distribution with mean 0 and variance I 1 (θ). The second term on the right hand side can be written as 1 2 n θ ln f( X θ) n(θ ˆθ) = 1 2 n n 2 θ 2 ln f( X i θ) n(θ ˆθ) The sum converges to I 1 (θ) by the law of large numbers so the result follows. We neglect the last term on the right. Remark 2 In the vector parameter case, the limiting distribution is the multivariate normal distribution with mean vector 0 and covariance matrix I 1 (θ 1,θ 2 ). The regularity conditions do not hold for the uniform distributions f(x;θ) = 1/θ for 0 < x < θ.