Maximum Likelihood Estimation

Similar documents
1 Likelihood. 1.1 Likelihood function. Likelihood & Maximum Likelihood Estimators

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Statistics 3858 : Maximum Likelihood Estimators

Lecture 4 September 15

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Chapter 4: Asymptotic Properties of the MLE

Section 8: Asymptotic Properties of the MLE

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Infinitely Imbalanced Logistic Regression

Link lecture - Lagrange Multipliers

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Graduate Econometrics I: Maximum Likelihood I

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Review and continuation from last week Properties of MLEs

STAT215: Solutions for Homework 2

Parameter Estimation

CHAPTER 2: CONVEX SETS AND CONCAVE FUNCTIONS. W. Erwin Diewert January 31, 2008.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Preliminary draft only: please check for final version

DA Freedman Notes on the MLE Fall 2003

ECE 275A Homework 7 Solutions

Maximum Likelihood Estimation

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Gaussian Graphical Models and Graphical Lasso

ARE202A, Fall 2005 CONTENTS. 1. Graphical Overview of Optimization Theory (cont) Separating Hyperplanes 1

Introduction to Maximum Likelihood Estimation

Topic 12 Overview of Estimation

x +3y 2t = 1 2x +y +z +t = 2 3x y +z t = 7 2x +6y +z +t = a

Joint work with Nottingham colleagues Simon Preston and Michail Tsagris.

Lecture 3 September 1

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Concave and Convex Functions 1

Advanced Quantitative Methods: maximum likelihood

Maximum likelihood estimation

A Very Brief Summary of Bayesian Inference, and Examples

Maximum Likelihood. F θ, θ Θ. X 1,..., X n. L(θ) = f(x i ; θ) l(θ) = ln L(θ) = i.i.d. i=1. n ln f(x i ; θ) Sometimes

MATH Max-min Theory Fall 2016

Section 4.2: The Mean Value Theorem

SOLUTION FOR HOMEWORK 6, STAT 6331

Parameter Estimation for the Beta Distribution

1 Convexity, concavity and quasi-concavity. (SB )

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Math (P)refresher Lecture 8: Unconstrained Optimization

HT Introduction. P(X i = x i ) = e λ λ x i

A dual form of the sharp Nash inequality and its weighted generalization

Likelihood-Based Methods

Chapter 3. Point Estimation. 3.1 Introduction

STAT 512 sp 2018 Summary Sheet

STAT 513 fa 2018 hw 5

Minimax lower bounds I

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

1 Probability Model. 1.1 Types of models to be discussed in the course

Advanced Quantitative Methods: maximum likelihood

Statistics: Learning models from data

Testing Restrictions and Comparing Models

Tangent spaces, normals and extrema

Stat 451 Lecture Notes Numerical Integration

V. Graph Sketching and Max-Min Problems

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

1 Directional Derivatives and Differentiability

A Very Brief Summary of Statistical Inference, and Examples

ARE211, Fall 2005 CONTENTS. 5. Characteristics of Functions Surjective, Injective and Bijective functions. 5.2.

The properties of L p -GMM estimators

EM Algorithm II. September 11, 2018

Graphical Models for Collaborative Filtering

Parametric Techniques Lecture 3

Statistical Inference

Suppose that f is continuous on [a, b] and differentiable on (a, b). Then

Lecture Unconstrained optimization. In this lecture we will study the unconstrained problem. minimize f(x), (2.1)

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

STAT 135 Lab 2 Confidence Intervals, MLE and the Delta Method

Concave and Convex Functions 1

Nonlinear Programming Models

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Lecture 7 Introduction to Statistical Decision Theory

ExtremeValuesandShapeofCurves

Notes on Random Vectors and Multivariate Normal

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

Maximum Likelihood Estimation

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Generalized Linear Models. Kurt Hornik

Greene, Econometric Analysis (6th ed, 2008)

Differentiable Functions

Hypothesis Testing: The Generalized Likelihood Ratio Test

The outline for Unit 3

Paul Schrimpf. October 18, UBC Economics 526. Unconstrained optimization. Paul Schrimpf. Notation and definitions. First order conditions

Statistics. Lecture 4 August 9, 2000 Frank Porter Caltech. 1. The Fundamentals; Point Estimation. 2. Maximum Likelihood, Least Squares and All That

N. L. P. NONLINEAR PROGRAMMING (NLP) deals with optimization models with at least one nonlinear function. NLP. Optimization. Models of following form:

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Methods of Estimation

Master s Written Examination

Lecture 1: Introduction

Transcription:

Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The likelihood function L(θ x) and joint pdf f(x θ) are the same except that f(x θ) is generally viewed as a function of x with θ held fixed, and L(θ x) as a function of θ with x held fixed. f(x θ) is a density in x for each fixed θ. But L(θ x) is not a density (or mass function) in θ for fixed x (except by coincidence). The Maximum Likelihood Estimator (MLE) A point estimator ˆθ = ˆθ(x) is a MLE for θ if L(ˆθ x) = sup θ that is, ˆθ maximizes the likelihood. L(θ x), In most cases, the maximum is achieved at a unique value, and we can refer to the MLE, and write ˆθ(x) = argmax θ L(θ x). (But there are cases where the likelihood has flat spots and the MLE is not unique.)

Motivation for MLE s Note: We often write L(θ x) = L(θ), suppressing x, which is kept fixed at the observed data. Suppose x R n. Discrete Case: If f( θ) is a mass function (X is discrete), then L(θ) = f(x θ) = P θ (X = x). L(θ) is the probability of getting the observed data x when the parameter value is θ. Continuous Case: When f( θ) is a continuous density P θ (X = x) = 0, but if B R n is a very, very small ball (or cube) centered at the observed data x, then P θ (X B) f(x θ) Volume(B) L(θ). L(θ) is proportional to the probability the random data X will be close to the observed data x when the parameter value is θ. Thus, the MLE ˆθ is the value of θ which makes the observed data x most probable.

To find ˆθ, we maximize L(θ). This is usually done by calculus (finding a stationary point), but not always. If the parameter space Θ contains endpoints or boundary points, the maximum can be achieved at a boundary point without being a stationary point. If L(θ) is not smooth (continuous and everywhere differentiable), the maximum does not have to be achieved at a stationary point. Cautionary Example: Suppose X 1,..., X n are iid Uniform(0, θ) and Θ = (0, ). Given data x = (x 1,..., x n ), find the MLE for θ. L(θ) = n θ 1 I(0 x i θ) = θ n I(0 min x i )I(max x i θ) i=1 = { θ n for θ max x i 0 for 0 < θ < max x i (Draw this!) which is maximized at θ = max x i, which is a point of discontinuity (and certainly not a stationary point). Thus, the MLE is ˆθ = max x i = x (n). Notes: L(θ) = 0 for θ < max x i is just saying that these values of θ are absolutely ruled out by the data (which is obvious). A strange property of the MLE in this example (not typical): P θ (ˆθ < θ) = 1 The MLE is biased; it is always less than the true value.

A Similar Example: Let X 1,..., X n be iid Uniform(α, β) and Θ = {(α, β) : α < β}. Given data x = (x 1,..., x n ), find the MLE for θ = (α, β). L(α, β) = n (β α) 1 I(α x i β) i=1 = (β α) n I(α min x i )I(max x i β) { (β α) n for α min x i, max x i β = 0 otherwise which is maximized by making β α as small as possible without entering 0 otherwise region. Clearly, the maximum is achieved at (α, β) = (min x i, max x i ). Thus the MLE is ˆθ = (ˆα, ˆβ) = (min x i, max x i ). Again, P α,β (α < ˆα, ˆβ < β) = 1.

Maximizing the Likelihood (one parameter) General Remarks Basic Result: A continuous function g(θ) defined on a closed, bounded interval J attains its supremum (but might do so at one of the endpoints). (That is, there exists a point θ 0 J such that g(θ 0 ) = sup θ J g(θ). ) Consequence: Suppose g(θ) is a continuous, non-negative function defined on an open interval J = (c, d) (where perhaps c = or d = + ). If then g attains its supremum. lim g(θ) = lim g(θ) = 0, θ c θ d Thus, MLE s usually exist when the likelihood function is continuous.

Maxima at Stationary Points Suppose the function g(θ) is defined on an interval Θ (which may be open or closed, infinite or finite). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point (that is, g (θ 0 ) = 0). (1) If g (θ 0 ) = 0 and g (θ 0 ) < 0, then θ 0 is a local maximum (but might not be the global maximum). (2) If g (θ 0 ) = 0 and g (θ) < 0 for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). The condition in (1) is necessary (but not sufficient) for θ 0 to be a global maximum. Condition (2) is sufficient (but not necessary). A function satisfying g (θ) < 0 for all θ Θ is called strictly concave. It lies below any tangent line. Another useful condition (sufficient, but not necessary) is: (3) If g (θ) > 0 for θ < θ 0 and g (θ) < 0 for θ > θ 0, then θ 0 is a global maximum.

Maximizing the Likelihood (multi-parameter) Basic Result: A continuous function g(θ) defined on a closed, bounded set J R k attains its supremum (but might do so on the boundary). Consequence: Suppose g(θ) is a continuous, non-negative function defined for all θ R k. If g(θ) 0 as θ, then g attains its supremum. Thus, MLE s usually exist when the likelihood function is continuous. Suppose the function g(θ) is defined on a convex set Θ R k (that is, the line segment joining any two points in Θ lies entirely inside Θ). If g is differentiable and attains its supremum at a point θ 0 in the interior of Θ, that point must be a stationary point: g(θ 0 ) θ i = 0 for i = 1, 2,..., k. Define the gradient vector D and Hessian matrix H: D(θ) = ( g(θ) θ i ) k i=1 (a k 1 vector). H(θ) = ( 2 g(θ) θ i θ j where θ = (θ 1, θ 2,..., θ k ). ) k i,j=1 (a k k matrix).

Maxima at Stationary Points (1) If D(θ 0 ) = 0 and H(θ 0 ) is negative definite, then θ 0 is a local maximum (but might not be the global maximum). (2) If D(θ 0 ) = 0 and H(θ) is negative definite for all θ Θ, then θ 0 is a global maximum (that is, it attains the supremum). (1) is necessary (but not sufficient) for θ 0 to be a global maximum. (2) is sufficient (but not necessary). A function for which H(θ) is negative definite for all θ Θ is called strictly concave. It lies below any tangent plane. Positive and Negative Definite Matrices Suppose M is a k k symmetric matrix. Note: Hessian matrices and covariance matrices are symmetric. Definitions: M is positive definite if x Mx > 0 for all x 0 (x R k ). M is negative definite if x Mx < 0 for all x 0. M is non-negative definite (or positive semi-definite) if x Mx 0 for all x R k.

Facts: M is p.d. iff all its eigenvalues are positive. M is n.d. iff all its eigenvalues are negative. M is n.n.d. iff all its eigenvalues are non-negative. M is p.d. iff M is n.d. If M is p.d., all its diagonal elements must be positive. If M is n.d., all its diagonal elements must be negative. The determinant of a symmetric matrix is equal to the product of its eigenvalues. 2 2 Symmetric Matrices: ( ) m11 m M = (m ij ) = 12 m 21 m 22 with m 12 = m 21. M = m 11 m 22 m 12 m 21 = m 11 m 22 m 2 12. A 2 2 matrix is p.d. when the determinant is positive and the diagonal elements are positive. A 2 2 matrix is n.d. when the determinant is positive and the diagonal elements are negative. The bare minimum you need to check: M is p.d. if m 11 > 0 (or m 22 > 0) and M > 0. M is n.d. if m 11 < 0 (or m 22 < 0) and M > 0.

Example: Observe X 1,..., X n be iid Gamma(α, β). Preliminaries: (likelihood) L(α, β) = n i=1 x α 1 i e x i/β β α Γ(α). Maximizing L is same as maximizing l = log L given by l(α, β) = (α 1)T 1 T 2 /β nα log β n log Γ(α) where T 1 = i log x i, T 2 = i x i. Note that T = (T 1, T 2 ) is the natural sufficient statistic of this 2pef. l α = T 1 n log β nψ(α) where ψ(α) d dα log Γ(α) = Γ (α) Γ(α) l β = T 2 β nα 2 β = 1 β (T 2 2 nαβ) 2 l α 2 = nψ (α) 2 l = 2T 2 β 2 β 3 2 l α β = n β + nα β 2 = 1 β 3 (2T 2 nαβ)

Situation #1: Suppose α = α 0 is known. Find MLE for β. (Drop α from arguments: l(β) = l(α 0, β) etc.) l(β) is continuous and differentiable. l(β) has a unique stationary point: l (β) = l β = 1 β 2 (T 2 nα 0 β) = 0 iff T 2 = nα 0 β iff β = T 2 nα 0 ( β ). Now we check the second derivative. l (β) = 2 l β 2 = 1 β 3 (2T 2 nαβ) = 1 β 3 (T 2 + (T 2 nαβ)). Note l (β ) < 0 since T 2 nα 0 β = 0, but l (β) > 0 for β > 2T 2 /(nα 0 ). Thus, the stationary point satisfies the necessary condition for a global maximum, but not the sufficient condition (i.e., l(β) is not a strictly concave function). How can we be sure that we have found the global maximum, and not just a local maximum? In this case, there is a simple argument: The stationary point β is unique, and l (β) > 0 for β < β, and l (β) < 0 for β > β. This ensures β is the unique global maximizer. Conclusion: ˆβ = T 2 nα 0 is the MLE. (This is a function of T 2, which is a sufficient statistic for β when α is known.)

Situation #2: Suppose β = β 0 is known. Find MLE for α. (Drop β from arguments: l(α) = l(α, β 0 ) etc.) Note: l (α) and l (α) involve ψ(α) The function ψ is infinitely differentiable on the interval (0, ), and satisfies ψ (α) > 0 and ψ (α) < 0 for all α > 0. (The function is strictly increasing and strictly concave.) Also lim ψ(α) = and lim ψ(α) =. α 0 + α Thus ψ 1 : R (0, ) exists. (Draw a Picture.) l(α) is continuous and differentiable. l(α) has a unique stationary point: l (α) = T 1 n log β 0 nψ(α) = 0 iff ψ(α) = T 1 /n log β 0 iff α = ψ 1 (T 1 /n log β 0 ) This is the unique global maximizer since l (α) = nψ (α) < 0 for all α > 0. Thus ˆα = ψ 1 (T 1 /n log β 0 ) is the MLE. (This is a function of T 1, which is a sufficient statistic for α when β is known.)

Situation #3: Find MLE for θ = (α, β) l(α, β) is continuous and differentiable. A stationary point must satisfy the system of two equations: l α = T 1 n log β nψ(α) = 0 l β = 1 β 2 (T 2 nαβ) = 0 Solving the second equation for β gives β = T 2 nα. Plugging this into the first equation, and rearranging a bit leads to ( ) T 1 n log T2 = ψ(α) log α H(α) n The function H(α) is continuous and strictly increasing from (0, ) to (, 0), so that it has an inverse mapping (, 0) to (0, ). Thus, the solution to the above equation can be written: ( ( )) α = H 1 T1 n log T2. n

Thus, the unique stationary point is: Is this the MLE? ˆα = H 1 ( T1 n log ˆβ = T 2 nˆα. Let us examine the Hessian. 2 l 2 l H(α, β) = α 2 α β 2 l 2 l = H(ˆα, ˆβ) = ( T2 n )) α β β 2 nψ n (α) β n 1. β β (2T 3 2 nαβ) nψ n 2ˆα (ˆα) plugging in T 2 n 2ˆα n 3ˆα 3 ˆβ = T 2 T 2 T2 2 nˆα The diagonal elements are both negative, and the determinant is equal to n 4ˆα 2 (ˆαψ (ˆα) 1). T 2 2 This is positive since αψ (α) 1 > 0 for all α > 0. This guarantees that H(ˆα, ˆβ) is negative definite so that (ˆα, ˆβ) is (at least) a local maximum.