Statistics for Applications. Chapter 3: Maximum Likelihood Estimation 1/23

Similar documents
Stat410 Probability and Statistics II (F16)

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

ECE 901 Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation

Unbiased Estimation. February 7-12, 2008

This exam contains 19 pages (including this cover page) and 10 questions. A Formulae sheet is provided with the exam.

Chapter 6 Principles of Data Reduction

1.010 Uncertainty in Engineering Fall 2008

Maximum Likelihood Estimation

Lecture 7: Properties of Random Samples

5 : Exponential Family and Generalized Linear Models

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Statistical Theory MT 2008 Problems 1: Solution sketches

Maximum Likelihood Estimation and Complexity Regularization

Exponential Families and Bayesian Inference

Statistical Theory MT 2009 Problems 1: Solution sketches

Spring Information Theory Midterm (take home) Due: Tue, Mar 29, 2016 (in class) Prof. Y. Polyanskiy. P XY (i, j) = α 2 i 2j

Lecture 11 and 12: Basic estimation theory

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Questions and Answers on Maximum Likelihood

Lecture 19: Convergence

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 2 9/9/2013. Large Deviations for i.i.d. Random Variables

Solutions: Homework 3

Introductory statistics

Lecture Note 8 Point Estimators and Point Estimation Methods. MIT Spring 2006 Herman Bennett

AMS570 Lecture Notes #2

Algorithms for Clustering

4. Basic probability theory

Lecture 23: Minimal sufficiency

4. Understand the key properties of maximum likelihood estimation. 6. Understand how to compute the uncertainty in maximum likelihood estimates.

Brief Review of Functions of Several Variables

Lecture 6 Ecient estimators. Rao-Cramer bound.

Probability and statistics: basic terms

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

( θ. sup θ Θ f X (x θ) = L. sup Pr (Λ (X) < c) = α. x : Λ (x) = sup θ H 0. sup θ Θ f X (x θ) = ) < c. NH : θ 1 = θ 2 against AH : θ 1 θ 2

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

IIT JAM Mathematical Statistics (MS) 2006 SECTION A

Lecture 12: September 27

Unsupervised Learning 2001

Homework for 2/3. 1. Determine the values of the following quantities: a. t 0.1,15 b. t 0.05,15 c. t 0.1,25 d. t 0.05,40 e. t 0.

An Introduction to Asymptotic Theory

Final Examination Statistics 200C. T. Ferguson June 10, 2010

Differentiable Convex Functions

Last Lecture. Unbiased Test

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Review Questions, Chapters 8, 9. f(y) = 0, elsewhere. F (y) = f Y(1) = n ( e y/θ) n 1 1 θ e y/θ = n θ e yn

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 3 9/11/2013. Large deviations Theory. Cramér s Theorem

Probability, Random Variables and Random Processes

Generalized Semi- Markov Processes (GSMP)

EFFECTIVE WLLN, SLLN, AND CLT IN STATISTICAL MODELS

Lecture 9: September 19

Lecture Stat Maximum Likelihood Estimation

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 19 11/17/2008 LAWS OF LARGE NUMBERS II THE STRONG LAW OF LARGE NUMBERS

EE 4TM4: Digital Communications II Probability Theory

(A sequence also can be thought of as the list of function values attained for a function f :ℵ X, where f (n) = x n for n 1.) x 1 x N +k x N +4 x 3

5. Likelihood Ratio Tests

Lecture 7: October 18, 2017

Distribution of Random Samples & Limit theorems

Summary. Recap ... Last Lecture. Summary. Theorem

MIT Spring 2016

7.1 Convergence of sequences of random variables

This section is optional.

2.1. The Algebraic and Order Properties of R Definition. A binary operation on a set F is a function B : F F! F.

Lecture 2: Poisson Sta*s*cs Probability Density Func*ons Expecta*on and Variance Es*mators

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 6 9/23/2013. Brownian motion. Introduction

International Journal of Mathematical Archive-5(7), 2014, Available online through ISSN

Lecture 6 Simple alternatives and the Neyman-Pearson lemma

Stanford Statistics 311/Electrical Engineering 377

Mathematical Methods for Physics and Engineering

Outline. Linear regression. Regularization functions. Polynomial curve fitting. Stochastic gradient descent for regression. MLE for regression

IP Reference guide for integer programming formulations.

Approximations and more PMFs and PDFs

Lecture 5. Power properties of EL and EL for vectors

Regression and generalization

Lectures 12 and 13 - Complexity Penalized Maximum Likelihood Estimation

Advanced Stochastic Processes.

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

B Supplemental Notes 2 Hypergeometric, Binomial, Poisson and Multinomial Random Variables and Borel Sets

Elements of Statistical Methods Lots of Data or Large Samples (Ch 8)

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Binomial Distribution

Classification with linear models

Pattern Classification

7.1 Convergence of sequences of random variables

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Statistics for Applications Fall Problem Set 7

4.1 Data processing inequality

Lecture 16: UMVUE: conditioning on sufficient and complete statistics

J. Stat. Appl. Pro. Lett. 2, No. 1, (2015) 15

CSE 527, Additional notes on MLE & EM

Probability and Statistics

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Math 220B Final Exam Solutions March 18, 2002

Department of Mathematics

Point Estimation: properties of estimators 1 FINITE-SAMPLE PROPERTIES. finite-sample properties (CB 7.3) large-sample properties (CB 10.

Chapter 2 The Monte Carlo Method

Optimization Methods MIT 2.098/6.255/ Final exam

Transcription:

18.650 Statistics for Applicatios Chapter 3: Maximum Likelihood Estimatio 1/23

Total variatio distace (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that there exists θ Θ such that X 1 IP θ : θ is the true parameter. Statisticia s goal: give X 1,...,X, fid a estimator θ ˆ= θ(x ˆ 1,...,X ) such that IPˆ is close to IP θ θ for the true parameter θ. This meas: IPˆ(A) IP θ θ (A) is small for all A E. Defiitio The total variatio distace betwee two probability measures IP θ ad IP θ is defied by TV(IP θ,ip θ ) = max IP θ (A) IP θ (A). A E 2/23

Total variatio distace (2) Assume that E is discrete (i.e., fiite or coutable). This icludes Beroulli, Biomial, Poisso,... Therefore X has a PMF (probability mass fuctio): IP θ (X = x) = p θ (x) for all x E, L p θ (x) 0, p θ (x) = 1. x E The total variatio distace betwee IP θ ad IP θ is a simple fuctio of the PMF s p θ ad p θ : 1 L TV(IP θ,ip θ ) = p θ (x) p θ (x). 2 x E 3/23

Total variatio distace (3) Assume that E is cotiuous. This icludes Gaussia, Expoetial,... J Assume that X has a desity IP θ (X A) = A fθ (x)dx for all A E. l f θ (x) 0, f θ (x)dx = 1. E The total variatio distace betwee IP θ ad IP θ is a simple fuctio of the desities f θ ad f θ : l 1 TV(IP θ,ip θ ) = f θ (x) f θ (x) dx. 2 E 4/23

Total variatio distace (4) Properties of Total variatio: TV(IP θ,ip θ ) = TV(IP θ,ip θ ) (symmetric) TV(IP θ,ip θ ) 0 If TV(IP θ,ip θ ) = 0 the IP θ = IP θ (defiite) TV(IP θ,ip θ ) TV(IP θ,ip θ )+TV(IP θ,ip θ ) (triagle iequality) These imply that the total variatio is a distace betwee probability distributios. 5/23

Total variatio distace (5) A estimatio strategy: Build a estimator TV(IP T θ,ip θ ) for all θ Θ. The fid θ ˆthat miimizes the fuctio θ TV(IP T θ,ip θ ). 6/23

Total variatio distace (5) A estimatio strategy: Build a estimator TV(IP T θ,ip θ ) for all θ Θ. The fid θ ˆthat miimizes the fuctio θ TV(IP T θ,ip θ ). problem: Uclear how to build T TV(IPθ,IP θ )! 6/23

Kullback-Leibler (KL) divergece (1) There are may distaces betwee probability measures to replace total variatio. Let us choose oe that is more coveiet. Defiitio The Kullback-Leibler (KL) divergece betwee two probability measures IP θ ad IP θ is defied by L p θ (x)log KL(IP θ,ip θ ) = x E l E ( pθ (x) ) p θ (x) if E is discrete ( fθ (x) ) f θ (x)log dx if E is cotiuous f θ (x) 7/23

Kullback-Leibler (KL) divergece (2) Properties of KL-divergece: KL(IP θ,ip θ ) = KL(IP θ,ip θ ) i geeral KL(IP θ,ip θ ) 0 If KL(IP θ,ip θ ) = 0 the IP θ = IP θ (defiite) KL(IP θ,ip θ ) i KL(IP θ,ip θ )+KL(IP θ,ip θ ) i geeral Not a distace. This is is called a divergece. Asymmetry is the key to our ability to estimate it! 8/23

Kullback-Leibler (KL) divergece (3) KL(IP θ,ip θ ) = IE θ [ ( pθ (X) )] log p θ (X) [ = IE θ logpθ (X) ] [ IE θ logpθ (X) ] So the fuctio θ KL(IP θ [,IP θ ) is of the form: costat IE θ logpθ (X) ] 1 Ca be estimated: IE θ [h(x)] - L h(x i ) (by LLN) 1 KL(IP T θ,ip θ ) = costat L logpθ (X i ) 9/23

Kullback-Leibler (KL) divergece (4) KL(IP T 1 L θ,ip θ ) = costat logp θ (X i ) KL(IP T 1 L mi θ,ip θ ) mi logp θ (X i ) θ Θ θ Θ 1 L max logp θ (X i ) max θ Θ L θ Θ logp θ (X i ) max p θ (X i ) θ Θ This is the maximum likelihood priciple. 10/23

Iterlude: maximizig/miimizig fuctios (1) Note that mi h(θ) max h(θ) θ Θ θ Θ I this class, we focus o maximizatio. Maximizatio of arbitrary fuctios ca be difficult: Example: θ (θ X i ) 11/23

Iterlude: maximizig/miimizig fuctios (2) Defiitio A fuctio twice differetiable fuctio h : Θ IR IR is said to be cocave if its secod derivative satisfies h (θ) 0, θ Θ It is said to be strictly cocave if the iequality is strict: h (θ) < 0 Moreover, h is said to be (strictly) covex if h is (strictly) cocave, i.e. h (θ) 0 (h (θ) > 0). Examples: Θ = IR, h(θ) = θ 2, Θ = (0, ), h(θ) = θ, Θ = (0, ), h(θ) = logθ, Θ = [0,π], h(θ) = si(θ) Θ = IR, h(θ) = 2θ 3 12/23

Iterlude: maximizig/miimizig fuctios (3) More geerally for a multivariate fuctio: h : Θ IR d IR, d 2, defie the gradiet vector: h(θ) = Hessia matrix: 2 h(θ) = h θ 1 (θ).. h θ d (θ) IR d 2 h 2 h θ 1 θ 1 (θ) θ 1 θ d (θ)... IR d d 2 h θ d θ d (θ) 2 h θ d θ d (θ) h is cocave x 2 h(θ)x 0 x IR d, θ Θ. h is strictly cocave x 2 h(θ)x < 0 x IR d, θ Θ. Examples: Θ = IR 2, h(θ) = θ2 1 2θ2 2 or h(θ) = (θ 1 θ 2 ) 2 Θ = (0, ), h(θ) = log(θ 1 +θ 2 ), 13/23

Iterlude: maximizig/miimizig fuctios (4) Strictly cocave fuctios are easy to maximize: if they have a maximum, the it is uique. It is the uique solutio to or, i the multivariate case h (θ) = 0, h(θ) = 0 IR d. There are may algorithms to fid it umerically: this is the theory of covex optimizatio. I this class, ofte a closed form formula for the maximum. 14/23

Likelihood, Discrete case (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that E is discrete (i.e., fiite or coutable). Defiitio The likelihood of the model is the map L (or just L) defied as: L : E Θ IR (x 1,...,x,θ) IP θ [X 1 = x 1,...,X = x ]. 15/23

Likelihood, Discrete case (2) iid Example 1 (Beroulli trials): If X 1,...,X Ber(p) for some p (0,1): E = {0,1}; Θ = (0,1); (x 1,...,x ) {0,1}, p (0,1), L(x 1,...,x,p) = IPp [X i = x i ] = p x i (1 p) 1 x i = p x i (1 p) x i. 16/23

Likelihood, Discrete case (3) Example 2 (Poisso model): iid If X 1,...,X Poiss(λ) for some λ > 0: E = IN; Θ = (0, ); (x 1,...,x ) IN, λ > 0, L(x 1,...,x,p) = IPλ [X i = x i ] λ λ x i = e x i! λ x i λ = e. x 1!...x! 17/23

Likelihood, Cotiuous case (1) ( ) Let E,(IPθ ) θ Θ be a statistical model associated with a sample of i.i.d. r.v. X 1,...,X. Assume that all the IP θ have desity f θ. Defiitio The likelihood of the model is the map L defied as: L : E Θ IR (x 1,...,x,θ) f θ (x i ). 18/23

Likelihood, Cotiuous case (2) iid Example 1 (Gaussia model): If X 1,...,X N(µ,σ 2 ), for some µ IR,σ 2 > 0: E = IR; Θ = IR (0, ) (x 1,...,x ) IR, (µ,σ 2 ) IR (0, ), 1 1 L(x 1,...,x,µ,σ 2 ) = exp (σ 2π) 2σ 2 L (xi µ) 2. 19/23

Maximum likelihood estimator (1) Let X 1 (,...,X be ) a i.i.d. sample associated with a statistical model E,(IPθ ) θ Θ ad let L be the correspodig likelihood. Defiitio The likelihood estimator of θ is defied as: provided it exists. θˆmle = argmax L(X 1,...,X,θ), θ Θ Remark (log-likelihood estimator): I practice, we use the fact that θˆmle = argmax logl(x 1,...,X,θ). θ Θ 20/23

Maximum likelihood estimator (2) Examples Beroulli trials: pˆmle = X. Poisso model: ˆ = X. Gaussia model: λ MLE ( µˆ,σˆ2) = (X,Ŝ ). 21/23

Maximum likelihood estimator (3) Defiitio: Fisher iformatio Defie the log-likelihood for oe observatio as: l(θ) = logl 1 (X,θ), θ Θ IR d Assume that l is a.s. twice differetiable. Uder some regularity coditios, the Fisher iformatio of the statistical model is defied as: I(θ) = IE [ l(θ) l(θ) ] IE [ l(θ) ] IE [ l(θ) ] = IE [ 2 l(θ) ]. If Θ IR, we get: I(θ) = var [ l (θ) ] = IE [ l (θ) ] 22/23

Maximum likelihood estimator (4) Theorem Let θ Θ (the true parameter). Assume the followig: 1. The model is idetified. 2. For all θ Θ, the support of IP θ does ot deped o θ; 3. θ is ot o the boudary of Θ; 4. I(θ) is ivertible i a eighborhood of θ ; 5. A few more techical coditios. The, ˆ θ MLE satisfies: θ MLE IP ˆ θ w.r.t. IP θ ; ( ) (d) θˆmle θ N ( 0,I(θ ) 1) w.r.t. IP θ. 23/23

MIT OpeCourseWare https://ocw.mit.edu 18.650 / 18.6501 Statistics for Applicatios Fall 2016 For iformatio about citig these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.