Methods of Estimation

Similar documents
MIT Spring 2015

Maximum Likelihood Large Sample Theory

MIT Spring 2016

Statistics 3858 : Maximum Likelihood Estimators

MIT Spring 2016

STAT 512 sp 2018 Summary Sheet

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Basic math for biology

COMP2610/COMP Information Theory

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Lecture 2: Statistical Decision Theory (Part I)

Prediction. Prediction MIT Dr. Kempthorne. Spring MIT Prediction

1 General problem. 2 Terminalogy. Estimation. Estimate θ. (Pick a plausible distribution from family. ) Or estimate τ = τ(θ).

Mathematical statistics

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions

STAT 100C: Linear models

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

MIT Spring 2016

Econometrics I, Estimation

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Mathematical statistics

Parameter Estimation

Statistics: Learning models from data

Mathematical statistics

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Lecture 1: Introduction

A Resampling Method on Pivotal Estimating Functions

Review. December 4 th, Review

BTRY 4090: Spring 2009 Theory of Statistics

Chapter 4: Constrained estimators and tests in the multiple linear regression model (Part III)

Post-exam 2 practice questions 18.05, Spring 2014

Maximum Likelihood Estimation

Convergence of Random Variables Probability Inequalities

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Cherry Blossom run (1) The credit union Cherry Blossom Run is a 10 mile race that takes place every year in D.C. In 2009 there were participants

Model Selection for Geostatistical Models

Introduction to Machine Learning Lecture 14. Mehryar Mohri Courant Institute and Google Research

For more information about how to cite these materials visit

Minimum Message Length Analysis of the Behrens Fisher Problem

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

6.041/6.431 Fall 2010 Final Exam Wednesday, December 15, 9:00AM - 12:00noon.

Brief Review on Estimation Theory

Lecture 6: Linear models and Gauss-Markov theorem

Spring 2012 Math 541B Exam 1

A Very Brief Summary of Statistical Inference, and Examples

6.041/6.431 Fall 2010 Final Exam Solutions Wednesday, December 15, 9:00AM - 12:00noon.

ENEE 621 SPRING 2016 DETECTION AND ESTIMATION THEORY THE PARAMETER ESTIMATION PROBLEM

Chapter 4: Asymptotic Properties of the MLE

Lecture 2: August 31

Chapter 3 : Likelihood function and inference

MIT Spring 2015

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Information geometry for bivariate distribution control

Bayesian estimation of the discrepancy with misspecified parametric models

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

5 Mutual Information and Channel Capacity

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

LECTURE 2 LINEAR REGRESSION MODEL AND OLS

Modelling of Dependent Credit Rating Transitions

Lecture 1: August 28

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Graphical Models for Collaborative Filtering

MIT Spring 2016

Central Limit Theorem ( 5.3)

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Foundations of Statistical Inference

Continuous Data with Continuous Priors Class 14, Jeremy Orloff and Jonathan Bloom

arxiv: v2 [math.st] 13 Apr 2012

Introduction to Estimation Methods for Time Series models. Lecture 1

Parametric Techniques Lecture 3

Regression Models - Introduction

SOLUTION FOR HOMEWORK 6, STAT 6331

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

The Method of Types and Its Application to Information Hiding

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Lecture 4 September 15

Graduate Econometrics I: Maximum Likelihood II

Estimating Unnormalised Models by Score Matching

Problem 1 (20) Log-normal. f(x) Cauchy

Section 8.2. Asymptotic normality

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Week 3: The EM algorithm

Random vectors X 1 X 2. Recall that a random vector X = is made up of, say, k. X k. random variables.

A Very Brief Summary of Bayesian Inference, and Examples

A General Overview of Parametric Estimation and Inference Techniques.

15 Discrete Distributions

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

6.1 Variational representation of f-divergences

Part III. A Decision-Theoretic Approach and Bayesian testing

Quantitative Biology II Lecture 4: Variational Methods

ECE534, Spring 2018: Solutions for Problem Set #3

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Transcription:

Methods of Estimation MIT 18.655 Dr. Kempthorne Spring 2016 1

Outline Methods of Estimation I 1 Methods of Estimation I 2

X X, X P P = {P θ, θ Θ}. Problem: Finding a function θˆ(x ) which is close to θ. Consider ρ : X Θ R. and define D(θ 0, θ) to measure the discrepancy between θ and the true value θ 0. D(θ 0, θ) = E θ0 ρ(x, θ). As a discrepancy measure, D makes sense if the value of θ minimizing the function is θ = θ 0. If P θ0 were true, and we knew D(θ 0, θ), we could obtain θ 0 as the minimizer. Instead of observing D(θ 0, θ), we observe ρ(x, θ). ρ(, ) is a contrast function ˆθ(X ) is a minimum-contrast estimate. 3

The definition extends to Euclidean Θ R d. θ 0 an interior point of Θ. Smooth mapping: θ D(θ 0, θ). θ = θ 0 solves ' θ D(θ 0, θ) = 0. where ' θ = (,..., θ 1 θ ) T d Substitute ρ(x, θ) for D(θ 0, θ) and solve ˆ ' θ ρ(x, θ) = 0 at θ = θ. 4

Estimating Equations: Ψ : X R d R d, where Ψ = (ψ 1,..., ψ d ) T. For every θ 0 Θ, the expectation of Ψ given P θ0 has a unique solution V (θ 0, θ) = E θ0 [Ψ(X, θ)] = 0 at θ = θ 0. 5

Example 2.1.1 Least Squares. µ(z) = g(β, z), β R d. x = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. 1 Define ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Consider Y i = µ(z i ) + E i, where µ(z i ) = g(β, z i ) and the E i are iid N(0, σ 2 0 ). Then, β parametrizes the model and we can write: D(β 0, β) = E β0 ρ(x 1, β) = nσ2 n 0 + i=1[g(β 0, z i ) g(β, z i )] 2 ]. This is minimized by β = β 0 and uniquely so iff β identifiable. 6

ˆ The least-squares estimate β minimizes ρ(x, β). Conditions to guarantee existence of βˆ: Continuity of g(, z i ). Minimum of ρ(x, ) existing on compact set {β} e.g., lim g(β, z i ) =. β If g(β, z i ) is differentiable in β, then βˆ satisfies the Normal Equations obtained by 1 taking partial derivatives of n ρ(x, β) = Y µ 2 = i=1[y i g(β, z i )] 2 and solving: ρ(x, β) β j = 0 7

1 n ρ(x, β) = Y µ 2 = [Y i g(β, z i )] 2 i=1 Solve: ρ(x, β) β j = 0 n g(β, z i ) 2[Y i g(β, z i )] ( 1) β j = 0 i=1 n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 8

Linear case: g(β, z i ) = 1 d j=1 z ij β = z T j i β ρ(x, β) = 0 β j n n g(β, z i ) g(β, z i ) Y i g(β, z i ) = 0 β j β j i=1 i=1 n n T z ij Y i z i,j (z β) = 0 i=1 i=1 n d n i z ij Y i z i,j z i,k β k = 0, j = 1,..., d i=1 k=1 i=1 Z T D Y ZT D Z D β = 0 where Z D is the (n d) design matrix with (i, j) element z i,j 9

Note: Least Squares exemplifies minimum contrast and estimating equation methodology. Distribution assumptions are not necessary to motivate the estimate as a mathematical approximation. 10

Method of Moments Methods of Estimation I Method of Moments X 1,..., X n iid X P θ, θ R d. µ 1 (θ), µ 2 (θ),..., µ d (θ): µ j (θ) = µ j = E [X j θ] the jth moment of X. Sample moments: n µˆj = X j i, j = 1,..., d. i=1 Method of Moments: Solve for θ in the system of equations µ 1 (θ) = µˆ1 µ 2 (θ) = µˆ2.. µ d (θ) = µˆd 11

Note:. θ must be identifiable Existence of µ j : lim µˆj = µ j with µ j <. n If q(θ) = h(µ 1,..., µ d ), then the Method-of-Moments Estimate of q(θ) is qˆ(θ) = h(ˆµ 1,..., µˆd ). The MOM estimate of θ may not be unique! (See Problem 2.1.11) 12

Plug-In and Extension Principles Frequency Plug-In Multinomial Sample: X 1,..., X n with K values v 1,..., v K P(X i = v j ) = p j j = 1,..., K Plug in estimates: ˆp j = N j /n where N j = count({i : X i = v j }) Apply to any function q(p 1,..., p K ): qˆ = q(ˆp 1,..., pˆk ) Equivalent to substituting the true distribution function P θ (t) = P(X t θ) underlying an iid sample with the empirical distribution function: n ˆP(t) = 1 1{x i t} n i=1 Pˆ is an estimate of P, and ν(pˆ) is an estimate of ν(p). 13

14 Example: αth population quantile ν α (P) = 1 [F 1 (α) + F 1 2 U (α)], with 0 < α < 1: where F 1 (α) = inf {x : F (x) α} F 1 U (α) = sup{x : F (x) α} The plug-in estimate is νˆα(p) = ν α (Pˆ) = 1 [Fˆ 1 (α) + Fˆ 1 2 U (α)]. Example: Method of Moments Estimates of jth Moment ν(p) = µ j = E (X j ) 1 νˆ(p) = = ν(ˆ n j µˆj P) = 1 Extension Principle n i=1 x i Objective: estimate q(θ), a function of θ. Assume q(θ) = h(p 1 (θ),..., p K (θ)), where h() is continuous. The extension principle estimates q(θ) with qˆ(θ) = h(ˆp 1,..., pˆk ) h() may not be unique: what h() is optimal? MIT 18.655 Methods of Estimation

Notes on Method-of-Moments/Frequency Plug-In Estimates Easy to compute Valuable as initial estimates in iterative algorithms. Consistent estimates (close to true parameter in large samples). Best Frequency Plug-In Estimates are Maximum-Likelihood Estimates. In some cases, MOM estimators are foolish (See Example 2.1.7). 15

Outline Methods of Estimation I 1 Methods of Estimation I 16

Least Squares Methods of Estimation I General Model: Only Y Random X = {(z i, Y i ) : 1 i n}, where Y 1,..., Y n are independent. z 1,..., z n R d are fixed, non-random. For cases i = 1,..., n Y i = µ(z i ) + E i, where µ(z) = g(β, z), β R d. E i are independent with E [E i ] = 0. The Least-Squares Contrast function 1 is ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. β parametrizes the model and we can write the discrepancy function D(β 0, β) = E β0 ρ(x, β) 17

Least Squares: Only Y Random Contrast Function: 1 ρ(x, β) = Y µ 2 = n i=1[y i g(β, z i )] 2. Discrepancy Function: D(β 0, β) = E 1 β0 ρ(x, β) n 1n = Var(E i ) + [g(β 0, z i ) g(β, z i )] 2 ]. i=1 i=1 The model is semiparametric with unknown parameter β and unknown (joint) distribution P e of E= (E 1,..., E n ). Gauss-Markov Assumptions Assume that the distribution of E satisfy: E (E i ) = 0 Var(E i ) = σ 2 Cov(E i, E j ) = 0 for i = j 18

General Model: (Y,Z) Both Random (Y 1, Z 1 ),..., (Y n, Z n ) are i.i.d. as X = (Y, Z ) P Define µ(z) = E [Y Z = z] = g(β, z), where g(, ) is a known function and β R d is unknown parameter Given Z i = z i, define E i = Y i µ(z i ) for i = 1,..., n Conditioning on the z i we can write: Y i = g(β, z i ) + E i, i = 1, 2,..., n where E = (E 1,..., E n ) has (joint) distribution P e The Least-Squares Estimate of βˆ is the plug-in estimate β(pˆ), where Pˆ is the empirical distribution for the sample {(Z i, Y i ), i = 1,..., n} The function g(β, z) can be linear in β and z or nonlinear. Closed-form solutions exist for βˆ when g is linear in β. 19

Outline Methods of Estimation I 1 Methods of Estimation I 20

: Assumptions y1 x 1,1 x 1,2 x 1,p y 2 Data y =. and X = x 2,1 x 2,2 x 2,p...... y n x n,1 x n,2 x p,n follow a linear model satisfying the Gauss-Markov Assumptions if y is an observation of random vector Y = (Y 1, Y 2,... Y N ) T and E (Y X, β) = Xβ, where β = (β 1, β 2,... β p ) T is the p-vector of regression parameters. Cov(Y X, β) = σ 2 I n, for some σ 2 > 0. I.e., the random variables generating the observations are uncorrelated and have constant variance σ 2 (conditional on X, and β). 21

For known constants c 1, c 2,..., c p, c p+1, consider the problem of estimating θ = c 1 β 1 + c 2 β 2 + c p β p + c p+1. Under the Gauss-Markov assumptions, the estimator θˆ = c 1 βˆ1 + c 2 βˆ2 + c p βˆp + c p+1, where βˆ1, βˆ 2,... βˆ p are the least squares estimates is 1) An Unbiased Estimator of θ 2) A Linear Estimator 1 of θ, that is θ = n b i y i, for some known (given X) constants b i. i=1 Theorem: Under the Gauss-Markov Assumptions, the estimator ˆθ has the smallest (Best) variance among all Linear Unbiased Estimators of θ, i.e., θˆ is BLUE. 22

: Proof Proof: Without loss of generality, assume c p+1 = 0 and define c =(c 1, c 2,..., c p ) T. The Least Squares Estimate of θ = c T β is: θˆ = c T βˆ = c T (X T X) 1 X T y d T y a linear estimate in y given by coefficients d = (d 1, d 2,..., d n ) T. Consider an alternative linear estimate of θ: θ = b T y with fixed coefficients given by b = (b 1,..., b n ) T. Define f = b d and note that θ = b T y = (d + f) T y = θˆ + f T y If θ is unbiased then because θˆ is unbiased 0 = E (f T y) = f T E (y) = f T (Xβ) for all β R p = f is orthogonal to column space of X = f is orthogonal to d = X(X T X) 1 c 23

If θ is unbiased then The orthogonality of f to d implies Var(θ ) = Var(b T y) = Var(d T y + f T y) = Var(d T y) + Var(f T y) + 2Cov(d T y, f T y) = Var(θˆ) + Var(f T y) + 2d T Cov(y)f = Var(θˆ) + Var(f T y) + 2d T (σ 2 I n )f = Var(θˆ) + Var(f T y) + 2σ 2 d T f = Var(θˆ) + Var(f T y) + 2σ 2 0 Var(θˆ) 24

Outline Methods of Estimation I 1 Methods of Estimation I 25

Estimates Consider generalizing the Gauss-Markov assumptions for the linear regression model to Y = Xβ + E where the random n-vector E: E [E] = 0 n and E [EE T ] = σ 2 Σ. σ 2 is an unknown scale parameter Σ is a known (n n) positive definite matrix specifying the relative variances and correlations of the component observations. Transform the data (Y, X) to Y = Σ 1 2 Y and X = Σ 1 2 X and the model becomes Y = X β + E, where E [E ] = 0 n and E [E (E ) T ] = σ 2 I n By the, the BLUE ( GLS ) of β is βˆ = [(X ) T (X )] 1 (X ) T (Y ) = [X T Σ 1 X] 1 (X T Σ 1 Y) 26

Outline Methods of Estimation I 1 Methods of Estimation I 27

Estimation 28 X P θ, θ Θ with density or pmf function p(x θ). Given an observation X = x, define the likelihood function L x (θ) = p(x θ) : a mapping: Θ R. θˆ ML = θˆ ML (x): the Maximum-Likelihood Estimate of θ is the value making L x ( ) a maximum θ ˆis the MLE if L x (θˆ) = max L x (θ). θ Θ The MLE θˆml (x) identifies the distribution making x most likely The MLE coincides with the mode of the Posterior Distribution if the Prior Distribution on Θ is uniform: π(θ x) p(x θ)π(θ) p(x θ). MIT 18.655 Methods of Estimation

Methods of Estimation I Examples Example 2.2.4: Normal Distribution with Known Variance Example 2.2.5: Size of a Population X 1,..., X n are iid U{1, 2,..., θ}, with θ {1, 2,...}. For x = (x 1,.., x n ), L n x (θ) = i=1 θ 1 1(1 x i θ) = θ n 1(max(x 1,..., x n )) θ) 0, if θ = 0, 1,..., max(x i ) 1 = θ n if θ max(x i ) 29

As a Minimum Contrast Method Define l x (θ) = log L x (θ) = log p(x θ) Because log( ) is monotone decreasing, ˆ θ ML (x) minimizes l x (θ) For an iid sample X = (X 1,..., X n ) with densities p(x i θ), l X (θ) = log p(x. 1,..., x n theta) = log [ n 1 i=1 p(x i θ)] = n i=1 log p(x i θ) As a minimum contrast function, ρ(x, θ) = l X (θ) yields the MLE θˆml (x) The discrepancy function corresonding to the contrast function ρ(x, θ) is D(θ 0, θ) = E [ρ(x, θ) θ 0 ] = E [log p(x θ) θ 0 ] 30

Suppose that θ = θ 0 uniquely minimizes D(θ 0, ). Then D(θ 0, θ) D(θ 0, θ 0 ) = E [log p(x θ) θ 0 ] ( E[log p(x θ 0 ) θ 0 ]) p( x θ) = E [log p ( x θ0) θ 0] > 0, unless θ = θ 0. This difference is the Kullback-Leibler Information Divergence between distribution P θ0 and P θ : p( x θ) K (P θ0, P θ ) = E [log( p ( x θ0) ) θ 0] Lemma 2.2.1 (Shannon, 1948) The mutual entropy K (P θ0, P θ ) is always well defined and K (P θ0, P θ ) 0 Equality holds if and only if {x : p(x θ) = p(x θ 0 )} has probability 1 under both P θ0 and P θ. Proof Apply Jensen s Inequality (B.9.3) 31

Likelihood Equations Suppose: X P θ, with θ Θ, an open parameter space the likelihood function l X (θ) is differentiable in θ ˆθ ML (x) exists Then: θˆml (x) must satisfy the Likelihood Equation(s) ' θ l X (θ) = 0. Important Cases For independent X i with 1 densities/pmfs p i (x i θ), ' n θ l X (θ) = i=1 ' θ log p i (x i θ) = 0 NOTE: p i ( θ) may vary with i. 32

Examples Hardy-Weinberg Proportions (Example 2.2.6) Queues: Poisson Process Models (Exponential Arrival Times and Poisson Counts) (Example 2.2.7) Multinomial Trials (Example 2.2.8) Normal Regression Models (Example 2.2.9). 33

MIT OpenCourseWare http://ocw.mit.edu 18.655 Mathematical Statistics Spring 2016 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.