The loss function and estimating equations

Chapter 6 he loss function and estimating equations 6 Loss functions Up until now our main focus has been on parameter estimating via the maximum likelihood However, the negative maximum likelihood is criterion belong to a group of criterions known as loss functions Loss functions are usually distances, such as the and 2 distance ypically we estimate a parameter by minimising the loss function, and using as the estimator the parameter which minimises the loss Usually (but not always) the way to solve the loss function is to differentiate it and equate it to zero his motivates another means of estimating parameters, by finding the solution of a equation (these are usually called estimating functions) For many examples, there is an estimating function which corresponds to a loss function and loss function which corresponds to an estimating equation But this will not always be the case Example 6 Examples where the loss function does not have a derivative) Consider the Laplacian (also known as the double exponential), which is defined as f(y; θ µ) = 2θ exp( xy θ /θ) = 2θ exp((y θ)/θ) y < θ exp( (y θ)/θ) y θ We observe {Y t } and our objective is to estimate the location parameter µ, for now the scale parameter θ is not of interest he log-likelihood is L (θ) = log 2θ θ 6 2θ Y i θ i=

We maximise the above to estimate µ We see that this is equivalent to minimising the loss function L (θ) = Y i θ = 2 (Y i θ) + 2 (θ Y i) i= Y i >θ Y i θ If we make a plot of L over θ, and consider how L, behaves at the ordered observations {Y (t) }, we see that it is piecewise continuous (that is it is continuous but non-differentiable at the points Y (t) ) On closer inspection (if is odd) we see that L has its minimum at θ = Y (/2), which is the sample median (to prove this to yourself take an example of four observation and construct L, you can then extend this argument via a psuedo induction method) Heuristically, the derivative of the loss function is (number of Y i less than µ) minus (number of Y i greater than or equal to µ), using this reasoning gives us a another argument as to why the sample median minimises the loss Note that it is a little ambigiuous what the minimum is when is even In summary, the normal distribution gives rise to the 2 -loss function and the sample mean In contrast the Laplacian gives rise to the -loss function and the sample median Consider the generalisation of the Laplacian, usually called the assymmetric Laplacian, which is defined as f(y; θ µ) = where 0 < p < he log-likelihood is p θ exp(p(y θ)/θ) y < θ ( p) θ exp( ( p)(y θ)/θ) y θ L (θ) = ( p)(y i θ) + p(θ Y i ) Y i >θ Y i θ Using similar arguments to those in part (i), it can be shown that the minimum of L is approximately the pth quantile 62 Estimating Functions See also Section 72 in Davison (2002) and Lars Peter Hansen (982, Econometrica) Estimating functions is a unification and generalisation of the maximum likelihood methods and the method of moments It should be noted that it is a close cousin of the generalised method of moments and generalised estimating equation We first consider a few examples and will later describe a feature common to all these examples 62

Example 62 (i) Let us suppose that {Y t } are iid random variables with Y t N (µ σ 2 ) he log-likelihood in proportional to L (µ σ 2 ) = 2 log σ2 2σ 2 (X t µ) 2 We know that to estimate µ and σ 2 we use the µ and σ 2 which are the solution of 2σ 2 + 2σ 4 (X t µ) 2 = 0 2σ 2 (X t µ) 2 = 0 (6) (ii) In general suppose {Y t } are iid random variables with Y i f( ; θ) he log-likelihood is L (θ) = log f(θ; Y t) If the regularity conditions are satisfied then to estimate θ we use the solution of L (θ) = 0 (62) (iii) Let us suppose that {Y t } are iid random variables with a Weibull distribution f(x; θ) = ( α φ )( x φ )α exp( (x/φ) α ), where α φ > 0 We know that (see Section 9, Example 92) (X) = φγ(+α ) and (X 2 ) = φ 2 Γ(+ 2α ) herefore (X) φγ( + α ) = 0 and (X 2 ) φ 2 Γ( + 2α ) = 0 Hence by solving X t φγ( + α ) = 0 Xt 2 φ 2 Γ( + 2α ) = 0 (63) we obtain estimators of α and Γ his is essentially a method of moments estimator of the parameters in a Weibull distribution (iv) We can generalise the above It can be shown that (X r ) = φ r Γ( + rα ) herefore, for any distinct s and r we can estimate α and Γ using the solution of Xt r φ r Γ( + rα ) = 0 Xt s φ s Γ( + sα ) = 0 (64) (v) Consider the simple linear regression Y t = αx t + ε t, with (ε t ) = 0 and var(ε t ) =, the least squares estimator of α is the solution of (Y t ax t )x t = 0 (65) 63

We observe that all the above estimators can be written as the solution of a homogenous equations - see equations (6), (62), (63), (64) and (65) In other words, for each case we can define a random function G (θ), such that the above estimators are the solutions of G ( θ ) = 0 In the case that {Y t } are iid then G (θ) = g(y t; θ), for some function g(y t ; θ) he function G ( θ) is called an estimating function All the function G, defined above, satisfy the unbiased property which we define below Definition 62 An estimating function G is called unbiased if at the true parameter θ 0 G ( ) satisfies G (θ 0 ) = 0 Hence the estimating function is alternative way of viewing a parameter estimator Until now, parameter estimators have been defined in terms of the maximum of the likelihood However, an alternative method for defining an estimator is as the solution of a function For example, suppose that {Y t } are random variables, whose distribution depends in some way on the parameter θ 0 We want to estimate θ 0, and we know that there exists a function such that G(θ 0 ) = 0 herefore using the data {Y t } we can define a random function, G where (G (θ)) = G(θ) Hence we can use the parameter θ, which satisfies G ( θ) = 0, as an estimator of θ We observe that such estimators include most maximum likelihood estimators and method of moment estimators Example 622 Based on the examples above we see that (i) he estimating function is G (µ σ) = (ii) he estimating function is G (θ) = L (θ) 2σ 2 + 2σ 4 (X t µ) 2 2σ 2 (X t µ) 2 (iii) he estimating function is G (α φ) = X t φγ( + α ) X2 t φ 2 Γ( + 2α ) (iv) he estimating function is G (α φ) = Xs t φ s Γ( + sα ) Xr t φ r Γ( + rα ) 64

(v) he estimating function is G (a) = (Y t ax t )x t he advantage of this approach is that sometimes the solution of an estimating equation will have a smaller finite sample variance than the MLE Even though asymptotically (under certain conditions) the MLE will asymptotically attain the Cramer-Rao bound (which is the smallest variance) Moreover, MLE estimators are based on the assumption that the distribution is known (else the estimator is misspecified - see Section 9), however sometimes an estimating equation can be free of such assumptions Recall that in Example 62(v) that for (Y t αx t )x t = 0 (66) is true regardless of the distribution of Y t ({ε t }) and is also true if there {Y t } are dependent random variables We recall that Example 62(v) is the least squares estimato, which is a consistent estimator of the parameters for a variety of settings (see Rao (973), Linear Statistical Inference and its applications, and your SA62 notes) Example 623 In many statistical situations it is relatively straightforward to find a suitable estimating function rather than find the likelihood Consider the time series {X t } which satisfies X t = a X t + a 2 X t 2 + σ(a a 2 )ε t where {ε t } are iid random variables We do not know the distribution of ε t, but because ε t is independent of X t and X t 2, by multiplying the above equation by X t an taking expections we have (X t X t ) = a (X 2 t ) + a 2 (X t X t 2 ) (X t X t 2 ) = a (X 2 t 2) + a 2 (X 2 t 2) Since the above time series is stationary (we have not formally defined this - but basically it means the properties of {X t } do not vary over time), it can be shown that X tx t r is an estimator of (X t X t ) and that ( X tx t r ) = (X t X t r ) Hence replacing the above with its estimators we obtain the estimating equations G (a a 2 ) = X tx t a X2 t a 2 t X t X t 2 X tx t 2 a X t X t 2 a 2 X2 t 2 We now show that under certain conditions θ is a consistent estimator of θ 65

heorem 62 Suppose that G (θ) is an unbiased estimating function, where G ( θ ) = 0 and (G (θ 0 )) = 0 (i) If θ is a scalar, for every G (θ) is a continuous monotonically decreasing function in θ and (G (θ)) 0, then we have θ P θ0 (ii) If we can show that sup θ G (θ) (G (θ)) P 0 and (G (θ)) is uniquely zero at θ 0 then we have θ P θ0 PROOF he proof of case (i) is relatively straightforward (see also page 38 in Davison (2002)), and is best understand by making a plot of G (θ) with θ < θ 0 ε < theta 0 We first note that since (G (θ 0 )), for any fixed ε > 0 we have G (θ 0 ε) P G (θ 0 ε) > 0 (67) since G is monotonically decreasing for all Now, since G (θ) is monotonically decreasing we see that θ < (θ 0 ε) implies G ( θ ) G (θ 0 ε) > 0 (and visa-versa) hence P θ (θ 0 ε) 0 = P G ( θ ) G (θ 0 ε) > 0 But we have from (67) that (G (θ 0 ε)) P (G (θ 0 ε)) > 0 hus P G ( θ ) G (θ 0 ε) > 0 P 0 and P θ (θ 0 ε) < 0 P 0 as A similar argument can be used to show that that P θ (θ 0 + ε) > 0 P 0 as As the above is true for all ε, together they imply that θ P θ0 as he proof of (ii) is more involved, but essentially follows the lines of the proof of heorem 4 (exercise for those wanting to do some more theory) We now show normality, which will give us the variance of the limiting distribution of θ heorem 622 Let us suppose that {Y t } are iid random variables and G (θ) = g(y t; θ) Suppose that θ P θ0 and the first and second derivatives of G ( ) have a finite expectation (we will assume that θ is a scalar to simplify notation) hen we have as P var(g(y t ; θ 0 )) θ θ 0 N 0 g(yt;θ) 2 θ0 66

PROOF We use the standard aylor expansion to prove the result (which you should be expert in by now) Using a aylor expansion we have G ( θ ) = G (θ 0 ) + ( θ θ 0 ) G (θ) θ (68) ( θ θ 0 ) = ( G (θ) θ0 ) G (θ 0 ) + o p ( ) /2 where the above is due to G (θ) P θ ( G (θ) θ0 ) as Now, since G (θ 0 ) = t g(y t; θ) is the sum of iid random variables we have G (θ 0 ) P N 0 var(g (θ 0 )) (69) (since (g(y t ; θ 0 )) = 0) herefore (68) and (69) together give P θ θ 0 N 0 as required var(g(y t ; θ 0 )) g(yt;θ) θ0 2 Remark 62 In general if {Y t } are not iid random variables then (under certain conditions) we have P θ θ 0 N 0 var(g (θ 0 )) G (θ) 2 θ0 Example 624 he Huber estimator) We describe the Huber estimator which is a well known estimator of the mean which is robust to outliers he estimator can be written as an estimating function Let us suppose that {Y t } are iid random variables with mean θ, and density function which is symmetric about the mean θ So that outliers do not effect the mean, a robust method of estimation is to truncate the outliers and define the function c Y t < c + θ g (c) (Y t ; θ) = Y t c c + θ Y t c + θ c Y t > c + θ he estimating equation is G c (θ) = g (c) (Y t ; θ) And we use as an estimator of θ, the θ which solves G c ( θ ) = 0 67

(i) In the case that c =, then we observe that G (θ) = (Y t θ), and the estimator is θ = Ȳ Hence without truncation, the estimator of the mean is the sample mean (ii) In the case that c is small, then we have truncated many observations var(g(y t;θ 0)) gyt ;θ) It is clear that 2 I(θ 0 ) (where I(θ) is the Fisher information) Hence for θ0 extremely large sample sizes, the MLE is the best estimator 62 Optimal estimating functions As in illustrated in Example 622(iii,iv) there are several different estimators of the same parameters A natural question, is which estimator does one use? We can answer this question by using the result in heorem 622 Roughly speaking heorem 622 implies (though this is not true in the strictest sense) that var θ var(g(y t ; θ 0 )) g(yt;θ) 2 (60) θ0 In the case that θ is a multivariate vector the above generalises to var θ ( g(y t; θ) ) var(g(y t ; θ 0 )) ( g(y t; θ) ) (6) Hence, we should choose the estimating function which leads to the smallest variance Example 625 Consider Example 622(iv) where we are estimating the Weibull parameters θ and α Different s and r give different equations, but are all are estimating the same parameters o decide which s and r to use, we calculate the variances for different s and r and choose the one with the smallest variance Now substituting ( g(y t; θ) rφ r Γ( + rα ) ) = sφ s Γ( + sα ) and (using that (X r ) = φ r Γ( + rα ) var(x r t ) cov(xt r Xt s ) t cov(xt r Xt s ) var(xt s ) φr Γ( + rα ) α 2 φs Γ( + sα ) α 2 into (6) for different r and s we can calculate the variance and and choose the estimating function with the smallest variance We now consider an example, which is a close precursor to GEE (generalised estimating equations) Usually GEE is a method for estimating the parameters in generalised linear models, where there is dependence in the data 68

Example 626 Suppose that {Y t } are independent random variables with mean {µ t (θ 0 )} and variance {V t (θ 0 )}, where the parametric form of {µ t ( )} and {V t ( )} are known, but θ 0 is unknown A possible estimating function is where we note that (G (θ 0 )) = 0 G (θ) = (Y t µ(θ)) variance, instead let us consider the general weighted function G (W ) (θ) = However, this function does not take into account the w t (θ)(y t µ(θ)) Again we observe that (G (W ) (θ 0 )) = 0 But we need to know which weight w(θ) to use Hence we choose the weight which minimises the variance in (60) Since {Y t } are independent we observe that var(g (W ) (θ 0 )) = w t (θ 0 ) 2 V t (θ 0 ) G (W ) (θ) θ0 = ( w t(θ 0 )(Y t µ t (θ 0 )) w t (θ 0 )µ t(θ 0 )) = w t (θ 0 )µ t(θ 0 ) Hence we have var θ w t(θ 0 ) 2 V t (θ 0 ) ( w t(θ 0 )µ t (θ 0)) 2 Now we want to choose the weights, thus the estimation function, which has the smallest variance herefore we look for weights which minimise the above Since the above is a ratio, and we observe that a small w t (θ) leads to a large denominator but a large numerator, we do the minimisation by using Lagrangian multiplers hat is set ( w t(θ 0 )µ (θ 0 )) 2 = (which means w t(θ 0 )µ (θ 0 )) = ) and minimise w t (θ 0 ) 2 V t (θ 0 ) + λ w t (θ 0 )µ (θ 0 ) with respect to {w t (θ 0 )} and λ Partially differentiating the above with respect to {w t (θ 0 )} and λ and setting to zero leads to w t (θ) = λµ t(θ)/v t (θ)/2 µ t(θ)/v t Hence the optimal estimating function is G (µ V ) (θ) = µ t(θ) V t (θ) (Y t µ t (θ)) 69

Observe that the above equations resembles the derivative of weighted least squared criterion V t (θ) (Y t µ t (θ)) 2 But an important difference us that the term involving the derivative of V t (θ)) has been ignored Remark 622 We conclude this section by mentioning that one generalisation of estimating equations is the generalised method of moments We observe the random vectors {Y t } and it is known that there exist a function g( ; θ) such that (g(y t ; θ 0 )) = 0 o estimate θ 0, rather than find the solution of g(y t; θ), a matrix M is defined and the parameter which mimimises is used as an estimator of θ g(y t ; θ) M g(y t ; θ) 70