The loss function and estimating equations

Similar documents
Parameter Estimation

The Maximum Likelihood Estimator

6. MAXIMUM LIKELIHOOD ESTIMATION

Link lecture - Lagrange Multipliers

Lecture 7 Introduction to Statistical Decision Theory

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Review Quiz. 1. Prove that in a one-dimensional canonical exponential family, the complete and sufficient statistic achieves the

4.2 Estimation on the boundary of the parameter space

A Very Brief Summary of Statistical Inference, and Examples

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Lecture 32: Asymptotic confidence sets and likelihoods

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Brief Review on Estimation Theory

Statistics & Data Sciences: First Year Prelim Exam May 2018

Information in a Two-Stage Adaptive Optimal Design

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Graduate Econometrics I: Unbiased Estimation

Z-estimators (generalized method of moments)

Introduction to Estimation Methods for Time Series models Lecture 2

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

A General Overview of Parametric Estimation and Inference Techniques.

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

LECTURE 18: NONLINEAR MODELS

A Very Brief Summary of Statistical Inference, and Examples

A Few Notes on Fisher Information (WIP)

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

STAT5044: Regression and Anova. Inyoung Kim

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Regression Models - Introduction

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Measuring robustness

Regression Models - Introduction

Chapter 3. Point Estimation. 3.1 Introduction

1. (Regular) Exponential Family

The properties of L p -GMM estimators

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

7. Estimation and hypothesis testing. Objective. Recommended reading

CALCULATION METHOD FOR NONLINEAR DYNAMIC LEAST-ABSOLUTE DEVIATIONS ESTIMATOR

STA 260: Statistics and Probability II

All other items including (and especially) CELL PHONES must be left at the front of the room.

BTRY 4090: Spring 2009 Theory of Statistics

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

p y (1 p) 1 y, y = 0, 1 p Y (y p) = 0, otherwise.

Limiting Distributions

ECE531 Lecture 10b: Maximum Likelihood Estimation

Spring 2012 Math 541B Exam 1

STAT Financial Time Series

Central Limit Theorem ( 5.3)

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Chapter 7. Confidence Sets Lecture 30: Pivotal quantities and confidence sets

10. Linear Models and Maximum Likelihood Estimation

Theory of Statistics.

Lecture 8: Information Theory and Statistics

Generalized Estimating Equations

p(z)

Spring 2012 Math 541A Exam 1. X i, S 2 = 1 n. n 1. X i I(X i < c), T n =

ECE531 Lecture 8: Non-Random Parameter Estimation

6.1 Variational representation of f-divergences

The Hybrid Likelihood: Combining Parametric and Empirical Likelihoods


Economic modelling and forecasting

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Review and continuation from last week Properties of MLEs

Mathematics Qualifying Examination January 2015 STAT Mathematical Statistics

Normalising constants and maximum likelihood inference

STAT 730 Chapter 4: Estimation

First Year Examination Department of Statistics, University of Florida

Robustness of location estimators under t- distributions: a literature review

1. Fisher Information

Large Sample Properties of Estimators in the Classical Linear Regression Model

Problem 1 (20) Log-normal. f(x) Cauchy

Ch. 5 Hypothesis Testing

Limiting Distributions

Math 494: Mathematical Statistics

High Dimensional Empirical Likelihood for Generalized Estimating Equations with Dependent Data

Detection Theory. Composite tests

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

Multivariate Analysis and Likelihood Inference

Final Examination Statistics 200C. T. Ferguson June 11, 2009

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Generalized Method of Moments: I. Chapter 9, R. Davidson and J.G. MacKinnon, Econometric Theory and Methods, 2004, Oxford.

Section 9: Generalized method of moments

Eco517 Fall 2014 C. Sims MIDTERM EXAM

Association studies and regression

simple if it completely specifies the density of x

Minimum Message Length Analysis of the Behrens Fisher Problem

Online Appendix. j=1. φ T (ω j ) vec (EI T (ω j ) f θ0 (ω j )). vec (EI T (ω) f θ0 (ω)) = O T β+1/2) = o(1), M 1. M T (s) exp ( isω)

Econ 583 Final Exam Fall 2008

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Generalized Linear Models Introduction

Transcription:

Chapter 6 he loss function and estimating equations 6 Loss functions Up until now our main focus has been on parameter estimating via the maximum likelihood However, the negative maximum likelihood is criterion belong to a group of criterions known as loss functions Loss functions are usually distances, such as the and 2 distance ypically we estimate a parameter by minimising the loss function, and using as the estimator the parameter which minimises the loss Usually (but not always) the way to solve the loss function is to differentiate it and equate it to zero his motivates another means of estimating parameters, by finding the solution of a equation (these are usually called estimating functions) For many examples, there is an estimating function which corresponds to a loss function and loss function which corresponds to an estimating equation But this will not always be the case Example 6 Examples where the loss function does not have a derivative) Consider the Laplacian (also known as the double exponential), which is defined as f(y; θ µ) = 2θ exp( xy θ /θ) = 2θ exp((y θ)/θ) y < θ exp( (y θ)/θ) y θ We observe {Y t } and our objective is to estimate the location parameter µ, for now the scale parameter θ is not of interest he log-likelihood is L (θ) = log 2θ θ 6 2θ Y i θ i=

We maximise the above to estimate µ We see that this is equivalent to minimising the loss function L (θ) = Y i θ = 2 (Y i θ) + 2 (θ Y i) i= Y i >θ Y i θ If we make a plot of L over θ, and consider how L, behaves at the ordered observations {Y (t) }, we see that it is piecewise continuous (that is it is continuous but non-differentiable at the points Y (t) ) On closer inspection (if is odd) we see that L has its minimum at θ = Y (/2), which is the sample median (to prove this to yourself take an example of four observation and construct L, you can then extend this argument via a psuedo induction method) Heuristically, the derivative of the loss function is (number of Y i less than µ) minus (number of Y i greater than or equal to µ), using this reasoning gives us a another argument as to why the sample median minimises the loss Note that it is a little ambigiuous what the minimum is when is even In summary, the normal distribution gives rise to the 2 -loss function and the sample mean In contrast the Laplacian gives rise to the -loss function and the sample median Consider the generalisation of the Laplacian, usually called the assymmetric Laplacian, which is defined as f(y; θ µ) = where 0 < p < he log-likelihood is p θ exp(p(y θ)/θ) y < θ ( p) θ exp( ( p)(y θ)/θ) y θ L (θ) = ( p)(y i θ) + p(θ Y i ) Y i >θ Y i θ Using similar arguments to those in part (i), it can be shown that the minimum of L is approximately the pth quantile 62 Estimating Functions See also Section 72 in Davison (2002) and Lars Peter Hansen (982, Econometrica) Estimating functions is a unification and generalisation of the maximum likelihood methods and the method of moments It should be noted that it is a close cousin of the generalised method of moments and generalised estimating equation We first consider a few examples and will later describe a feature common to all these examples 62

Example 62 (i) Let us suppose that {Y t } are iid random variables with Y t N (µ σ 2 ) he log-likelihood in proportional to L (µ σ 2 ) = 2 log σ2 2σ 2 (X t µ) 2 We know that to estimate µ and σ 2 we use the µ and σ 2 which are the solution of 2σ 2 + 2σ 4 (X t µ) 2 = 0 2σ 2 (X t µ) 2 = 0 (6) (ii) In general suppose {Y t } are iid random variables with Y i f( ; θ) he log-likelihood is L (θ) = log f(θ; Y t) If the regularity conditions are satisfied then to estimate θ we use the solution of L (θ) = 0 (62) (iii) Let us suppose that {Y t } are iid random variables with a Weibull distribution f(x; θ) = ( α φ )( x φ )α exp( (x/φ) α ), where α φ > 0 We know that (see Section 9, Example 92) (X) = φγ(+α ) and (X 2 ) = φ 2 Γ(+ 2α ) herefore (X) φγ( + α ) = 0 and (X 2 ) φ 2 Γ( + 2α ) = 0 Hence by solving X t φγ( + α ) = 0 Xt 2 φ 2 Γ( + 2α ) = 0 (63) we obtain estimators of α and Γ his is essentially a method of moments estimator of the parameters in a Weibull distribution (iv) We can generalise the above It can be shown that (X r ) = φ r Γ( + rα ) herefore, for any distinct s and r we can estimate α and Γ using the solution of Xt r φ r Γ( + rα ) = 0 Xt s φ s Γ( + sα ) = 0 (64) (v) Consider the simple linear regression Y t = αx t + ε t, with (ε t ) = 0 and var(ε t ) =, the least squares estimator of α is the solution of (Y t ax t )x t = 0 (65) 63

We observe that all the above estimators can be written as the solution of a homogenous equations - see equations (6), (62), (63), (64) and (65) In other words, for each case we can define a random function G (θ), such that the above estimators are the solutions of G ( θ ) = 0 In the case that {Y t } are iid then G (θ) = g(y t; θ), for some function g(y t ; θ) he function G ( θ) is called an estimating function All the function G, defined above, satisfy the unbiased property which we define below Definition 62 An estimating function G is called unbiased if at the true parameter θ 0 G ( ) satisfies G (θ 0 ) = 0 Hence the estimating function is alternative way of viewing a parameter estimator Until now, parameter estimators have been defined in terms of the maximum of the likelihood However, an alternative method for defining an estimator is as the solution of a function For example, suppose that {Y t } are random variables, whose distribution depends in some way on the parameter θ 0 We want to estimate θ 0, and we know that there exists a function such that G(θ 0 ) = 0 herefore using the data {Y t } we can define a random function, G where (G (θ)) = G(θ) Hence we can use the parameter θ, which satisfies G ( θ) = 0, as an estimator of θ We observe that such estimators include most maximum likelihood estimators and method of moment estimators Example 622 Based on the examples above we see that (i) he estimating function is G (µ σ) = (ii) he estimating function is G (θ) = L (θ) 2σ 2 + 2σ 4 (X t µ) 2 2σ 2 (X t µ) 2 (iii) he estimating function is G (α φ) = X t φγ( + α ) X2 t φ 2 Γ( + 2α ) (iv) he estimating function is G (α φ) = Xs t φ s Γ( + sα ) Xr t φ r Γ( + rα ) 64

(v) he estimating function is G (a) = (Y t ax t )x t he advantage of this approach is that sometimes the solution of an estimating equation will have a smaller finite sample variance than the MLE Even though asymptotically (under certain conditions) the MLE will asymptotically attain the Cramer-Rao bound (which is the smallest variance) Moreover, MLE estimators are based on the assumption that the distribution is known (else the estimator is misspecified - see Section 9), however sometimes an estimating equation can be free of such assumptions Recall that in Example 62(v) that for (Y t αx t )x t = 0 (66) is true regardless of the distribution of Y t ({ε t }) and is also true if there {Y t } are dependent random variables We recall that Example 62(v) is the least squares estimato, which is a consistent estimator of the parameters for a variety of settings (see Rao (973), Linear Statistical Inference and its applications, and your SA62 notes) Example 623 In many statistical situations it is relatively straightforward to find a suitable estimating function rather than find the likelihood Consider the time series {X t } which satisfies X t = a X t + a 2 X t 2 + σ(a a 2 )ε t where {ε t } are iid random variables We do not know the distribution of ε t, but because ε t is independent of X t and X t 2, by multiplying the above equation by X t an taking expections we have (X t X t ) = a (X 2 t ) + a 2 (X t X t 2 ) (X t X t 2 ) = a (X 2 t 2) + a 2 (X 2 t 2) Since the above time series is stationary (we have not formally defined this - but basically it means the properties of {X t } do not vary over time), it can be shown that X tx t r is an estimator of (X t X t ) and that ( X tx t r ) = (X t X t r ) Hence replacing the above with its estimators we obtain the estimating equations G (a a 2 ) = X tx t a X2 t a 2 t X t X t 2 X tx t 2 a X t X t 2 a 2 X2 t 2 We now show that under certain conditions θ is a consistent estimator of θ 65

heorem 62 Suppose that G (θ) is an unbiased estimating function, where G ( θ ) = 0 and (G (θ 0 )) = 0 (i) If θ is a scalar, for every G (θ) is a continuous monotonically decreasing function in θ and (G (θ)) 0, then we have θ P θ0 (ii) If we can show that sup θ G (θ) (G (θ)) P 0 and (G (θ)) is uniquely zero at θ 0 then we have θ P θ0 PROOF he proof of case (i) is relatively straightforward (see also page 38 in Davison (2002)), and is best understand by making a plot of G (θ) with θ < θ 0 ε < theta 0 We first note that since (G (θ 0 )), for any fixed ε > 0 we have G (θ 0 ε) P G (θ 0 ε) > 0 (67) since G is monotonically decreasing for all Now, since G (θ) is monotonically decreasing we see that θ < (θ 0 ε) implies G ( θ ) G (θ 0 ε) > 0 (and visa-versa) hence P θ (θ 0 ε) 0 = P G ( θ ) G (θ 0 ε) > 0 But we have from (67) that (G (θ 0 ε)) P (G (θ 0 ε)) > 0 hus P G ( θ ) G (θ 0 ε) > 0 P 0 and P θ (θ 0 ε) < 0 P 0 as A similar argument can be used to show that that P θ (θ 0 + ε) > 0 P 0 as As the above is true for all ε, together they imply that θ P θ0 as he proof of (ii) is more involved, but essentially follows the lines of the proof of heorem 4 (exercise for those wanting to do some more theory) We now show normality, which will give us the variance of the limiting distribution of θ heorem 622 Let us suppose that {Y t } are iid random variables and G (θ) = g(y t; θ) Suppose that θ P θ0 and the first and second derivatives of G ( ) have a finite expectation (we will assume that θ is a scalar to simplify notation) hen we have as P var(g(y t ; θ 0 )) θ θ 0 N 0 g(yt;θ) 2 θ0 66

PROOF We use the standard aylor expansion to prove the result (which you should be expert in by now) Using a aylor expansion we have G ( θ ) = G (θ 0 ) + ( θ θ 0 ) G (θ) θ (68) ( θ θ 0 ) = ( G (θ) θ0 ) G (θ 0 ) + o p ( ) /2 where the above is due to G (θ) P θ ( G (θ) θ0 ) as Now, since G (θ 0 ) = t g(y t; θ) is the sum of iid random variables we have G (θ 0 ) P N 0 var(g (θ 0 )) (69) (since (g(y t ; θ 0 )) = 0) herefore (68) and (69) together give P θ θ 0 N 0 as required var(g(y t ; θ 0 )) g(yt;θ) θ0 2 Remark 62 In general if {Y t } are not iid random variables then (under certain conditions) we have P θ θ 0 N 0 var(g (θ 0 )) G (θ) 2 θ0 Example 624 he Huber estimator) We describe the Huber estimator which is a well known estimator of the mean which is robust to outliers he estimator can be written as an estimating function Let us suppose that {Y t } are iid random variables with mean θ, and density function which is symmetric about the mean θ So that outliers do not effect the mean, a robust method of estimation is to truncate the outliers and define the function c Y t < c + θ g (c) (Y t ; θ) = Y t c c + θ Y t c + θ c Y t > c + θ he estimating equation is G c (θ) = g (c) (Y t ; θ) And we use as an estimator of θ, the θ which solves G c ( θ ) = 0 67

(i) In the case that c =, then we observe that G (θ) = (Y t θ), and the estimator is θ = Ȳ Hence without truncation, the estimator of the mean is the sample mean (ii) In the case that c is small, then we have truncated many observations var(g(y t;θ 0)) gyt ;θ) It is clear that 2 I(θ 0 ) (where I(θ) is the Fisher information) Hence for θ0 extremely large sample sizes, the MLE is the best estimator 62 Optimal estimating functions As in illustrated in Example 622(iii,iv) there are several different estimators of the same parameters A natural question, is which estimator does one use? We can answer this question by using the result in heorem 622 Roughly speaking heorem 622 implies (though this is not true in the strictest sense) that var θ var(g(y t ; θ 0 )) g(yt;θ) 2 (60) θ0 In the case that θ is a multivariate vector the above generalises to var θ ( g(y t; θ) ) var(g(y t ; θ 0 )) ( g(y t; θ) ) (6) Hence, we should choose the estimating function which leads to the smallest variance Example 625 Consider Example 622(iv) where we are estimating the Weibull parameters θ and α Different s and r give different equations, but are all are estimating the same parameters o decide which s and r to use, we calculate the variances for different s and r and choose the one with the smallest variance Now substituting ( g(y t; θ) rφ r Γ( + rα ) ) = sφ s Γ( + sα ) and (using that (X r ) = φ r Γ( + rα ) var(x r t ) cov(xt r Xt s ) t cov(xt r Xt s ) var(xt s ) φr Γ( + rα ) α 2 φs Γ( + sα ) α 2 into (6) for different r and s we can calculate the variance and and choose the estimating function with the smallest variance We now consider an example, which is a close precursor to GEE (generalised estimating equations) Usually GEE is a method for estimating the parameters in generalised linear models, where there is dependence in the data 68

Example 626 Suppose that {Y t } are independent random variables with mean {µ t (θ 0 )} and variance {V t (θ 0 )}, where the parametric form of {µ t ( )} and {V t ( )} are known, but θ 0 is unknown A possible estimating function is where we note that (G (θ 0 )) = 0 G (θ) = (Y t µ(θ)) variance, instead let us consider the general weighted function G (W ) (θ) = However, this function does not take into account the w t (θ)(y t µ(θ)) Again we observe that (G (W ) (θ 0 )) = 0 But we need to know which weight w(θ) to use Hence we choose the weight which minimises the variance in (60) Since {Y t } are independent we observe that var(g (W ) (θ 0 )) = w t (θ 0 ) 2 V t (θ 0 ) G (W ) (θ) θ0 = ( w t(θ 0 )(Y t µ t (θ 0 )) w t (θ 0 )µ t(θ 0 )) = w t (θ 0 )µ t(θ 0 ) Hence we have var θ w t(θ 0 ) 2 V t (θ 0 ) ( w t(θ 0 )µ t (θ 0)) 2 Now we want to choose the weights, thus the estimation function, which has the smallest variance herefore we look for weights which minimise the above Since the above is a ratio, and we observe that a small w t (θ) leads to a large denominator but a large numerator, we do the minimisation by using Lagrangian multiplers hat is set ( w t(θ 0 )µ (θ 0 )) 2 = (which means w t(θ 0 )µ (θ 0 )) = ) and minimise w t (θ 0 ) 2 V t (θ 0 ) + λ w t (θ 0 )µ (θ 0 ) with respect to {w t (θ 0 )} and λ Partially differentiating the above with respect to {w t (θ 0 )} and λ and setting to zero leads to w t (θ) = λµ t(θ)/v t (θ)/2 µ t(θ)/v t Hence the optimal estimating function is G (µ V ) (θ) = µ t(θ) V t (θ) (Y t µ t (θ)) 69

Observe that the above equations resembles the derivative of weighted least squared criterion V t (θ) (Y t µ t (θ)) 2 But an important difference us that the term involving the derivative of V t (θ)) has been ignored Remark 622 We conclude this section by mentioning that one generalisation of estimating equations is the generalised method of moments We observe the random vectors {Y t } and it is known that there exist a function g( ; θ) such that (g(y t ; θ 0 )) = 0 o estimate θ 0, rather than find the solution of g(y t; θ), a matrix M is defined and the parameter which mimimises is used as an estimator of θ g(y t ; θ) M g(y t ; θ) 70