The Maximum Likelihood Estimator

Size: px

Start display at page:

Download "The Maximum Likelihood Estimator"

Philip Nelson
6 years ago
Views:

1 Chapter 4 he Maximum Likelihood Estimator 4. he Maximum likelihood estimator As illustrated in the exponential family of distributions, discussed above, the maximum likelihood estimator of θ 0 the true parameter is defined as ˆθ = arg max θ Θ L X; θ = arg max θ Θ L θ. Often we find that L θ θ=ˆθ = 0, hence solution can be obtained by solving the derivative of the log likelihood often called the score function. However, if θ 0 lies on or close to the boundary of the parameter space this will not necessarily be true. Below we consider the sampling properties of ˆθ interior of the parameter space Θ. when the true parameter θ 0 lies in the We note that the likelihood is invariant to transformations of the data. For example if X has the density f ; θ and we define the transformed random variable Z = gx, where the function g has an inverse its a - transformation, then it is easy to show that the density of Z is fg z; θ g z z. herefore the likelihood of {Z t = gx t } is fg Z t ; θ g z z=zt = z fx t ; θ g z z=zt. z Hence it is proportional to the likelihood of {X t } and the maximum of the likelihood of {Z t = gx t } is the same as the maximum of the likelihood of {X t }.

2 4.. Evaluating the MLE Examples Example 4.. {X t } are iid random variables, which follow a Normal Gaussian distribution N µ, σ.helikelihoodisproportionalto L X; µ, σ = log σ σ X t µ. Maximising the above with respect to µ and σ gives ˆµ = X and ˆσ = X t X. Example 4.. Question: Solution: {X t } are iid random variables, which follow a Weibull distribution, which has the density αy α θ α exp y/θ α θ, α > 0. Suppose that α is known, butθ is unknown and we need to estimate it. What is the maximum likelihood estimator of θ? he log-likelihood of interest is proportional to L X; θ = log α +α log Y t α log θ Y t α θ α log θ Y t α.. θ he derivative of the log-likelihood wrt to θ is L = α θ + α θ α+ Y α t =0. Solving the above gives ˆθ = Y α t /α. Example 4..3 Notice that if α is given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with respect to α and θ, ie. arg max θ,α log α +α log Y t α log θ Y t α. θ

3 he derivative of the likelihood is L L α = α θ + α θ α+ = α Y α t =0 log Y t log θ α θ + log Y t θ Y t θ α =0. It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood we cover this later on. Numerical Routines In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it L θ θ=ˆθ = 0, and an explicit expression would exist for this solution. In reality this rarely happens as we illustrated in the section above. Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. ypically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum. However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {X t } are iid random variables which follow the classical normal mixture distribution fy; θ =pf y; θ + pf y; θ, where f is the density of the normal with mean µ and variance σ and f is the density of the normal with mean µ and variance σ. he log likelihood is L Y ; θ = log p exp πσ σ X t µ + p exp πσ σ X t µ. Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood. We discuss a few such methods below. 3

4 he Newton Raphson Routine he Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the R functions optim or nlm. o apply Newton-Raphson, we have to assume that the derivative of the likelihood exists this is not always the case - think about the l - norm based estimators! L θ and the minimum lies inside the parameter space such that θ=ˆθ = 0. We choose an initial value θ and apply the routine L θ L θ n θ n = θ n + θn θn. Where this routine comes from will be clear by using the aylor expansion of L θ n about θ 0 see Section If the likelihood has just one global maximum and no local maximums hence it is convex, then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial value θ is not chosen close enough to the true maximum, then the routine may converge to a local maximum not good!. In this case it may be a good idea to do the routine several times for several different initial values θ i i. For each convergence value ˆθ evaluate the likelihood L ˆθ i and select the value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value. Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods often based on the Newton-Raphson. Iterative least squares his is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented see later in the course. he EM-algorithm his is done by the introduction of dummy variables, which leads to a new unobserved likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course. See Example 4.3 on page 7 in Davison 00. he likelihood for dependent data We mention that the likelihood for dependent data can also be constructed though often the estimation and the asymptotic properties can be a lot harder to derive. Using Bayes rule ie. 4

5 P A,A,...,A =P A i= P A i A i,...,a we have L X; θ =fx ; θ fx t X t,...,x ; θ. t= Under certain conditions on {X t } the structure above t= fx t X t,...,x ; θ can be simplified. For example if X t were Markovian then we have X t conditioned on the past on depends the most recent past observation, ie. fx t X t,...,x ; θ =fx t X t ; θ in this case the above likelihood reduces to L X; θ =fx ; θ n fx t X t ; θ. 4. t= Example 4..4 Alotofthematerialwewillcoverinthisclasswillbeforindependent observations. However likelihood methods can also work for dependent observations too. Consider the AR time series X t = ax t + ε t, where ε t are iid random variables with mean zero. We will assume that a <. We see from the above that the observation X t as a linear influence on the next observation and it is Markovian, that it given X t,therandomvariablex t has no influence on X t to see this consider the distribution function P X t x X t,x t. herefore by using 4. the likelihood of {X t } t is L X; a =fx ; a f ε X t ax t, 4. t= where f ε is the density of ε and fx ; a is the marginal density of X.hismeansthelikelihood of {X t } only depends on f ε and the marginal density of X t.weuseâ = arg max L X; a as the mle estimator of a. Often we ignore the term fx ; a because this is often hard to know - try and figure it out -itsrelativelyeasyinthegaussiancaseandconsiderwhatis called the conditional likelihood Q X; a = f ε X t ax t. 4.3 t= ã = arg max L X; a as the quasi-mle estimator of a. Exercise: What is the quasi-likelihood proportional to in the case that {ε t } are Gaussian random variables with mean zero. It should be mentioned that often the conditional likelihood is derived as if the errors {ε t } are Gaussian - even if they are not. his is often called the quasi or pseudo likelihood. 5

6 4.. A quick review of the central limit theorem In this section we will not endeavour to proof the central limit theorem which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution. However, we will recall the general statement of the CL and generalisations of it. he purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal or not. Lemma 4.. he famous CL Let us suppose that {X t } are iid random variables, let µ = EX t < and σ = varx t <. Define X = X t.henwehave D X µ N 0, σ, alternatively X µ D N 0, σ. What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal. Remark 4.. i he above lemma appears to be restricted to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average or approximately as an average. At first appearance, the MLE of the Weibull parameters given in Example 4..3 does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CL applies to them too. ii he CL can be extended in several ways. a o random variables whose variance are not all the same ie. indepedent but identically distributed random variables. b Dependent random variables so long as the dependency decays in some way. c o not just averages but weighted averages too so long as the weight depends in certain way. However, the weights should be distributed well over all the random variables. Ie. suppose that {X t } are iid random variables. hen it is clear that 0 0 X t will never be normal unless {X t } is normal - observe 0 is fixed!, but it seems plausible that n n sinπt/x t is normal despite this not being the sum of iid random variables. 6

7 here exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations. Example 4..5 Some problem cases One should think a little before blindly applying the CL. Suppose that the iid random variables {X t } follow a t-distribution with degrees of freedom, ie. the density function is fx = Γ3/ π + x 3/. Let X = n n X t denote the sample mean. It is well known that the mean of the t-distribution with two degrees of freedom exists, but the variance does not it is too thick tailed. hus, the assumptions required for the CL to hold are violated and X is not normally distributed in fact it follows a stable law distribution. Intuitively this is clear, recall that the chance of outliers for a t-distribution with a small number of degrees of freedom, if large. his makes it impossible that even averages should be well behaved there is a large chance that an average could also be too large or too small. o see why the variance is infinite, study the form of t-distribution with two degrees. For the variance to be finite, the tails of the distribution should convergetozerofastenoughinother words the probability of outliers should not be too large. See that the tails of the t-distribution for large x behaveslikefx Cx 3 make a plot in Maple to check, thus the second moment EX M Cx 3 x dx = M Cx dx for some C and M, is clearly not finite! his argument can be made precise he aylor series expansion - the statisticians tool he aylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality based on the assumption that averages converge to a normal distribution, obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood. We recall that the mean value in the univariate case states that fx =fx 0 +x x 0 f x fx =fx 0 +x x 0 f x 0 + x x 0 f x, where x and x both lie between x and x 0. In the case that f is a multivariate function, then we have fx = fx 0 +x x 0 fx x= x fx = fx 0 +x x 0 fx x=x0 + x x 0 fx x= x x x 0, 7

8 where x and x both lie between x and x 0. In the case that fx is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that fx = fx 0 +x x 0 fx x= x, where x lies between x and x 0. However, it is quite straightforward to overcome this inconvience. he mean value theorem does hold pointwise, for every element of the vector fx =f x,...,f d x, ie. for every i d we have f i x = f i x 0 +x x 0 f i x x= xi, where x i lies between x and x 0.husif f i x x= xi f i x x=x0, we do have that fx fx 0 +x x 0 fx x=x0. We use the above below. Application An expression for L ˆθ L θ 0 in terms of ˆθ θ 0 : he expansion of L ˆθ about θ 0 the true parameter L θ 0 L ˆθ = L θ θ ˆθ 0 ˆθ + θ 0 ˆθ L θ θ θ 0 ˆθ where θ lies between θ 0 and ˆθ. If ˆθ lies in the interior of the parameter space this is an extremely important assumption here then L θ = 0. Moreover, if it can ˆθ P be shown that ˆθ θ 0 0 we show this in the section below, then under certain conditions on L θ such as the existence of the third derivative etc. it can be shown that L θ P θ E L θ θ0 =Iθ 0. Hence the above is roughly L ˆθ L θ 0 ˆθ θ 0 Iθ 0 ˆθ θ 0 Note that in many of the derivations below we will use that L θ θ P E L θ θ0 = Iθ0. But it should be noted that this only true if i ˆθ θ 0 P 0 and ii L θ uniformly to E L θ θ0. We consider below another closely related application. converges 8

9 Application An expression for ˆθ θ 0 in terms of L θ θ0 : he expansion of the p-dimension vector L θ ˆθ pointwise about θ 0 the true parameter gives for i d L i, θ ˆθ = L i, θ θ0 + L i, θ θ ˆθ θ 0. Now by using the same argument as in Application we have L θ θ0 Iθ 0 ˆθ θ 0. We mention that U θ 0 = L θ θ0 is often called the score or U statistic. And we see that the asymptotic sampling properties of U determine the sampling properties of ˆθ θ 0. Example 4..6 he Weibull Evaluate the second derivative of the likelihood given in Example 4..3, take the expection on this, Iθ, α =E L we use the to denote the second derivative with respect to the parameters α and θ. Exercise: Evaluate Iθ, α. Application implies that the maximum likelihood estimators ˆθ and ˆα recalling that no explicit expression for them exists can be written as ˆθ θ Iθ, α ˆα α α θ + α Y α θ α+ t α log Y t log θ α Yt θ + log θ Yt θ α 4..4 Sampling properties of the maximum likelihood estimator See also Section 4.4. p8, Davison 00. hese proofs will not be examined, but you should have some idea why heorem 4.. is true. We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator for example, in the case of exponential family of distributions. However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample varˆθ >Iθ. However, it can be shown that asymptotically the variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that l the log likelihood of independent, identically distributed random variables. he proof can be generalised to the case of non-identically distributed random variables. We first state sufficient conditions for this to be true. Assumption 4.. [Regularity Conditions ] Let {X t } be iid random variables with density fx; θ. 9 is

10 i Suppose the conditions in Assumption.. hold. ii Almost sure uniform convergence his is optional For every ε > 0 there exists a δ such that P lim sup θ θ >δ L X; θ EL θ > ε 0. We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability. iii Model identifiability For every θ Θ, theredoesnotexistanother θ Θ such that fx; θ =fx; θ for all x. iv he parameter space Θ is finite and compact. v sup E L X;θ <. We require Assumption 4..ii,iii to show consistency and Assumptions.. and 4..iiiv to show asymptotic normality. heorem 4.. Supppose Assumption 4..ii,iii holds. Let θ 0 be the true parameter and ˆθ be the mle. hen we have ˆθ a.s. θ 0 consistency. PROOF. o prove the result we first need to show that the expectation of the maximum likelihood is maximum at the true parameter and that this is the unique maximum. In other words we need to show that E L X; θ E L X; θ 0 0 for all θ Θ. o do this, we have E L X; θ E L fx; θ X; θ 0 = log fx; θ 0 fx; θ 0dx = E fx; θ log. fx; θ 0 Now by using Jensen s inequality we have E log fx; θ fx; θ log E = log fx; θ 0 fx; θ 0 fx; θdx =0. hus giving E L X; θ E L X; θ 0 0. o prove that E L X; θ E L X; θ 0 = 0 only when θ 0 we note that identifiability assumption in Assumption 4..iii, which means that fx; θ =fx; θ 0 only when θ 0 and no other function of f gives equality. 30

11 Hence E L X; θ is uniquely maximum at θ 0. Finally, we need to show that ˆθ P θ0. By Assumption 4..ii and also the LLN we have that for all θ Θ that L X; θ a.s. lθ. herefore, for every mle ˆθ we have L X; θ 0 L X; ˆθ a.s. E L X; ˆθ E L X; θ o bound E L X; θ 0 L X; ˆθ we note that Now by using 4.4 we have and E L X; θ 0 L X; ˆθ = { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { L X; θ 0 E L X; ˆθ }. E L X; θ 0 L X; ˆθ { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { E L X; ˆθ L X; ˆθ } E L X; θ 0 L X; ˆθ { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { E L X; θ 0 L X; θ 0 }. herefore, under Assumption 4..ii we have E L X; θ 0 L X; ˆθ 3sup E θ Θ L X; θ L X; θ a.s. 0. Since L θ has a unique minimum this implies ˆθ a.s. θ 0. Hence we have shown consistency of the mle. We now need to show asymptotic normality. heorem 4.. Suppose Assumption 4.. is satisfied. i hen the score statistic is ii hen the mle is { L X; θ D log fx; θ θ0 N 0, E θ0 }. 4.5 { D log fx; θ } ˆθ θ 0 N 0, E θ0. 3

12 iii he log likelihood ratio is L X; ˆθ D L X; θ 0 χ p PROOF. First we will prove i. We recall because {X t } are iid random variables, then L X; θ θ0 = log fx t ; θ θ0. Hence L X;θ log fxt;θ θ0 is the sum of iid random variables with mean zero and variance var θ0. herefore, by the CL for iid random variables we have 4.5. We use i and aylor mean value theorem to prove ii. We first note that by the mean value theorem we have L X; θ ˆθ = L X; θ θ0 +ˆθ θ 0 L X; θ θ. 4.6 Now it can be shown because Θ has a compact support, ˆθ θ 0 a.s. 0 and the expectations of the third derivative of L is bounded that L X; θ P θ E L X; θ Substituting 4.7 into 4.6 gives θ0 log fx; θ = E θ L X; θ L ˆθ θ 0 = X; θ θ θ0 L X; θ L = E θ0 X; θ θ0 + o p. We mention that the proof above is for univariate L X;θ θ, but by redo-ing the above steps pointwise it can easily be generalised to the multivariate case too. Hence by substituting the 4.5 into the above we have ii. It is straightfoward to prove iii by using L X; ˆθ L X; θ 0 ˆθ θ 0 Iθ 0 ˆθ θ 0, i and the result that if X N 0, Σ, then AX N 0,A ΣA. Example 4..7 he Weibull By using Example 4..6 we have ˆθ θ Iθ, α α θ + α Y θ α+ t α ˆα α α log Y t log θ α Yt θ + log θ Yt θ α 3.

13 Now we observe that RHS consists of a sum iid random variables this can be viewed as an average. Since the variance of this exists you can show that itisiθ, α, the CL can be applied and we have that Remark 4.. sample size is ˆθ θ ˆα α D N 0,Iθ, α. i We recall that for iid random variables that the Fisher information for { } log L X; θ log fx; θ Iθ =E θ0 = E θ0. Hence comparing with the above theorem, we see that for iid random variables so long as the regularity conditions are satisfied the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true. Moreover, since ˆθ θ 0 Iθ 0 L θ θ0 = Iθ 0 L θ θ0, and var L θ θ0 = Iθ 0, thenitcanbeseenthat ˆθ θ 0 = O p /. ii Under suitable conditions a similar result holds true for data which is not iid. In summary, the MLE under certain regularity conditions tend to have the smallest variance, and for large samples, the variance is close to the lower bound,whichisthecramer-rao bound. In the case that Assumption 4.. is satisfied, the MLE is said to beasymptoticallyefficient. his means for finite samples the MLE may not attain the C-R bound butasymptoticallyitwill. iii A simple application of heorem 4.. is to the derivation of the distribution of Iθ 0 / ˆθ θ 0.Itisclearthatbyusingheorem4..wehave where I p is the identity matrix and Iθ 0 / ˆθ θ 0 D N 0,I p ˆθ θ 0 Iθ 0 ˆθ θ 0 D χ p. iv Note that these results apply when θ 0 lies inside the parameter space Θ. Asθ gets closer to the boundary of the parameter space 33

14 Remark 4..3 Generalised estimating equations Closely related to the MLE are generalised estimating equations GEE, which are relate to the score statistic. hese are estimators not based on maximising the likelihood but are related to equating the score statistic derivative of the likelihood to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves and some adaptions will not be the derivative of a likelihood he Fisher information See also Section 4.3, Davison 00. Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, θx, of a parameter θ 0 is such that var θx Iθ 0, where L θ Iθ =E = E L θ. is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound. i Situation. Iθ 0 =E L θ θ0 is large hence variance of the mle will be small then it means that the gradient of L θ is steep. Hence even for small deviations from θ 0, L θ is likely to be far from zero. his means the mle ˆθ is likely to be in a close neighbourhood of θ 0. ii Situation. Iθ 0 =E L θ θ0 is small hence variance of the mle will large. In this case the gradient of the likelihood L θ is flatter and hence L θ 0 for a large neighbourhood about the true parameter θ. herefore the mle ˆθ can lie in a large neighbourhood of θ 0. his is one explanation as to why Iθ is called the Fisher information. It contains information on how close close any estimator of θ can be. Look at the censoring example, Example 4.0, page, Davison

15 Chapter 5 Confidence Intervals 5. Confidence Intervals and testing We first summarise the results in the previous section which will be useful in this section. For convenience, we will assume that the likelihood is for iid random variables, whose density is fx; θ 0 though it is relatively simple to see how this can be generalised to general likelihoods - of not necessarily iid rvs. Let us suppose that θ 0 is the true parameter that we wish to estimate. Based on heorem 4.. we have { D log fx; θ } ˆθ θ 0 N 0, E θ0, 5. L θ=θ 0 { D log fx; θ } N 0, E θ0 5. and L ˆθ L θ 0 D χ p, 5.3 where p are the number of parameters in the vector θ. Using any of 5., 5. and 5.3 we can construct 95% CI for θ Constructing confidence intervals using the likelihood See also Section 4.5, Davison 00. One the of main reasons that we show asymptotic normality of an estimator it is usually not possible to derive normality for finite samples is to construct confidence intervals CIs and to test. 35

16 In the case that θ 0 is a scaler vector of dimension one, it is easy to use 5. to obtain { log fx; θ } / D E θ0 ˆθ θ 0 N0,. 5.4 Based on the above the 95% CI for θ 0 is [ ˆθ log fx; θ E θ0 z α/, ˆθ + log fx; θ E θ0 z α/ ]. log fx;θ he above, of course, requires an estimate of the standardised Fisher information E θ0 = E log fx;θ θ0 Usually, we evaluate the second derivative of log L θ = L θ and replace θ with the estimator of θ, ˆθ. Exercise: Use 5. to construct a CI for θ 0 based on the score he CI constructed above works well if θ is a scalar. But beyond dimension one, constructing a CI based on 5. and the p-dimensional normal is extremely difficult. More precisely, ifθ 0 is a p-dimensional vector then the analogous version of 5.4 is { log fx; θ } / D E θ0 ˆθ θ 0 N0,Ip, using this it is difficult to obtain the CI of θ 0. One way to construct the CI is to square ˆθ θ 0 and use Based on above a 95% CI is log fx; θ } D ˆθ θ 0 E θ0 ˆθ θ 0 χ p. 5.5 { θ; ˆθ θ log fx; θ E θ0 ˆθ θ } χ p Note that as in the scalar case, this leads to the interval with the smallest length. A disadvantage of 5.6 is that we have to a estimate the information matrix and b try to find all θ such the above holds. his can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use 5.3. By using 5.3, a 00 α% CI for θ 0 is { θ; L ˆθ L θ } χ p00 α. 5.7 he above is not easy to calculate, but it is feasible. 36

17 Example 5.. In the case that θ 0 is a scalar the 95% CI based on 5.7 is { θ; L θ L ˆθ } χ p0.95. Both the 95% CIs in 5.6 and 5.7 will be very close for relatively large sample sizes. However one advantage of using 5.7 instead of 5.6 is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc. Another feature which differentiates the CIs in 5.6 and 5.7 is that the CI based on 5.6 is symmetric about ˆθ recall that X.96σ/, X +.96σ/ is symmetric about X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ 0 using 5.7. his is a positive advantage of using 5.7 instead of 5.6. A disadvantage of using 5.7 instead of 5.6 is that sometimes in the CI based on 5.7 may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI try it for the simple case that the dimension is two!. Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance. It is clear that directly using the log-likelihood ratio to construct CIs and also test will mean also constructing CIs for the nuisance parameters. herefore below in Section?? we construct a variant of the likelihood called the Profile likelihood, which allows us to deal with nuisance parameters in a more efficient way. 5.. esting using the likelihood Let us suppose we wish to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ θ 0. We can use any of the results in 5., 5. and 5.3 to do the test - they will lead to slightly different p-values, but asympototically they are all equivalent, because they are all based essentially on the same derivation. We now list the three tests that one can use. he Wald test he Wald statistic is based on 5.. We recall from 5. that ifthenullistrue,thenwehave { D log fx; θ } ˆθ θ 0 N 0, E θ0. 37

18 hus we can use as the test statistic = { log fx; θ } / D E θ0 ˆθ θ 0 N 0,. Let us now consider how the test statistics behaves under the alternative H A : θ = θ. If the null is not true, then we have that ˆθ θ 0 = ˆθ θ +θ θ 0 Iθ log fx t ; θ θ θ 0 hus the distribution of the test statistic becomes centered about θ 0. hus for a larger sample size the more likely we are to reject the null. t { } / log fx;θ E θ0 θ Remark 5.. ypes of alternatives In the case that the alternative is fixed, it is clear that the power in the test goes to 00%. herefore, often to see the effectiveness of the test, one lets the alternative get closer to the the null as.forexample } / Suppose that θ = θ 0 +, then the center of log fx;θ is {E θ0 0. hus the alternative is too close to the null for us to discriminate betweenthethetwo. { } /. Suppose that θ = θ 0 + log fx;θ,thenthecenterof is E θ0 herefore, the test does have power, but it s not 00%. In the case that the dimension of θ is greater than one, we use the test statistic = ˆθ θ 0 E log fx;θ θ0 ˆθ θ 0 instead of. Noting that the distribution of is a chi-squared with p-degrees of freedom. he Score test he score test is based on the score. We recall from??, that under the null the distribution of the score is L θ=θ 0 { D log fx; θ N 0, E θ0 }. hus we use as the test statistic = log fx; θ } / L {E θ0 D θ=θ 0 N 0,. An advantage of this test is that the maximum likelihood estimator under either the null or alternative does not have to be calculated. Exercise: What does the test statistic look like under the alternative? 38

19 he log-likelihood ratio test Probably one of the most popular test os the log-likelihood ratio tests. his test is based on 5.3, and the test statistic is 3 = L ˆθ L θ 0 D χ p. An advantage of this test statistic is that it is pivotal, in the sense that the Fisher information etc. does not have to calculated, only the maximum likelihood estimator. Exercise: What does the test statistic look like under the alternative? 5..3 Applications of the log-likeihood ratio to the multinomial distribution Example 5.. he multinomial distribution his is a generalisation of the binomial distribution. In this case at any given trial there can arise m different events in the Binomial case m =. Let Z i denote the outcome of the ith trial and assume P Z i = k =π i π π m =. Suppose there were n trial conducted and let Y denote the number of times event arises, Y denote the number of times event arises and so on. hen it is straightforward to show that n m P Y = k,...,y m = k m = π k i i k,...,k. m If we do not impose any constraints on the probabilities {π i }, given {Y i } m i= is straightforward to derive the mle of {π i } it is very intuitive too!. Noting that π m = m i= π i, the log-likelihood of the multinomial is proportional to i= L π = m i= m y i log π i + y m log i= π i. Differentiating the above with respect to π i and solving gives the mle estimator ˆπ i = Y i /n, which is what we would have expected! We observe that though there are m probabilities to estimate due to the constraint π m = m i= π i, we only have to estimate m probabilities. We mention, that the same estimators can also be obtained by using Lagrange multipliers, that is maximising L π subject to the parameter constraint that p j= π i =. o enforce this constraint, we normally add an additional term to L π and include the dummy variable λ. hat is we define the constrained likelihood L π, λ = m m y i log π i + λ π i. i= i= 39

20 Now if we maximise L π, λ withrespectto{π i } m i= and λ we will obtain the estimators ˆπ i = Y i /n which is the same as the maximum of L π. o derive the limiting distribution we note that the second derivative is L π π i π j = y i π i y m m + i = j r= πr y m m i j r= πr Hence taking expectations of the above the information matrix is the k k matrix Iπ =n π + π m π m... π m π m π + π m... π m.. π m... π m + π m Provided no of π i is equal to either 0 or which would drop the dimension of m and make Iπ singular, then the asymptotic distribution of the mle the normal with variance Iπ. Sometimes the probabilities {π i } will not be free and will be determined by a parameter θ where θ is an r-dimensional vector where r<m, ie. π i = π i θ, in this case the likelihood of the multinomial is... L π = m i= m y i log π i + y m log i= π i θ. By differentiating the above with respect to θ and solving will give the mle. Pearson s goodness of Fit test We now derive Pearson s goodness of Fit test using the log-likelihood ratio, though Pearson did not use this method to derive his test. Suppose the null is H 0 : π = π,...,π m = π m where { π i } are some pre-set probabilities and H A : the probabilities are not the given probabilities. Hence we are testing restricted model where we do not have to estimate anything against the full model where we estimate the probabilities using π i = Y i /n. he log-likelihood ratio in this case is W = { arg max π L π L π }. Under the null we know that W = { arg max π L π L π } P χ m because we have to estimate m parameters. We now derive an expression for W and show that the Pearson- 40

21 statistic is an approximation of this. m W = Y i log Y i + Ym log Y m m n n Y i log π i Y m log π m i= i= m = Y i log Y i. n π i i= Recall that Y i is often called the observed Y i = O i and n π i the expected under the null E i = n π i. hen W = m i= O i log O i E i P χ m. By using that for a close to x and making a aylor expansion of x logxa about x = a we have x logxa a logaa +x a+ x a /a. We let O = x and E = a, then assuming the null is true and E i O i we have W = m i= Y i log Y i m Oi E i + O i E i. n π i E i Now we note that m i= E i = m i= O i = n hence the above reduces to i= W O i E i E i D χ m. We recall that the above is the Pearson test statistic. Hence this is one methods for deriving the Pearson chi-squared test for goodness of fit. By using a similar argument, we can also obtain the test statistic of the chi-squared test for independent and an explanation for the rather strange number of degrees of freedom!. 4

22 Chapter 6 he Profile Likelihood 6. he Profile Likelihood See also Section 4.5., Davison he method of profiling Let us suppose that the unknown parameters θ can be partitioned as θ =ψ, λ, where ψ are the p-dimensional parameters of interest eg. mean and λ are the q-dimensional nuisance parameters eg. variance. We will need to estimate both ψ and λ, but our interest is in testing only the parameter ψ without any information on λ and construction confidence intervals for ψ without constructing unnecessary confidence intervals for λ - confidence intervals for a large number of parameters are wider than those for a few parameters. o achieve this one often uses the profile likelihood. o motivate the profile likelihood, we first describe a method to estimate the parameters ψ, ψ in two stages and consider some examples. Let us suppse that {X t } are iid random variables, with density fx; ψ, λ where our objective is to estimate ψ and λ. In this case the log-likelihood is L ψ, λ = log fx t ; ψ, λ. o estimate ψ and λ one can use ˆλ, ˆψ = arg max λ,ψ L ψ, λ. However, this can quite difficult, and lead to expression which are hard to maximise. Instead let us consider a different method, which may, sometimes, be easier to evaluate. Suppose, for now, ψ is known, then we rewrite the likelihood as L ψ, λ =L ψ λ to show that ψ is fixed but λ varies. o estimate λ we maximise L ψ λ withrespecttoλ, ie. ˆλ ψ = arg max L ψλ. λ 43

23 In reality ψ this unknown, hence for each ψ we can evaluate ˆλ ψ. Note that for each ψ, wehave anewcurvel ψ λ overλ. Now to estimate ψ, we evaluate the maximum L ψ λ, over λ, and choose the ψ, which is the maximum over all these curves. In other words, we evaluate ˆψ = arg max L ψˆλ ψ = arg max L ψ, ˆλ ψ. ψ ψ A bit of logical deduction shows that ˆψ and λ ˆψ are the maximum likelihood estimators ˆλ, ˆψ = arg max ψ,λ L ψ, λ. We note that we have profiled out nuisance parameter λ, and the likelihood L ψ ˆλ ψ = L ψ, ˆλ ψ is completely in terms of the parameter of interest ψ. he advantage of this best illustrated through some examples. Example 6.. Let us suppose that {X t } are iid random variables from a Weibull distribution with density fx; α, θ = αyα θ exp y/θ α. We know from Example 4.., that if α, were α known an explicit expression for the MLE can be derived, it is ˆθ α = arg max L α θ θ = arg max θ = arg max θ log α +α log Y t α log θ Y t α θ α log θ Y t α = θ Y α t /α, where L α X; θ = log α +α log Y t α log θ Y t α θ. hus for a given α, the maximum likelihood estimator of θ can be derived. he maximum likelihood estimator of α is ˆα = arg max α log α +α log Y t α log Y α t /α Y t α Y. t α/α herefore, the maximum likelihood estimator of θ is Y ˆα t /ˆα. We observe that evaluating ˆα can be tricky but no worse than maximising the likelihood L α, θ over α and θ. As we mentioned above, we often do not have any interest in the nuisance parameters λ and are only interesting in testing and constructing CIs for α. In this case, we are interested in the limiting distribution of the MLE ˆα. his can easily be derived by observing that ˆψ ψ ˆλ λ D Iψψ I ψλ N 0, I λψ I λλ 44.

24 where Iψψ I ψλ I λψ I λλ = E log fx t;ψ,λ E log fx t;ψ,λ λ E log fx t;ψ,λ λ E log fx t;ψ,λ. 6. o derive an exact expression for the limiting variance of ˆψ ψ, we note that the inverse of a block matrix is A B = C D hus the above implies that A BD C A BD CA B D CBA BD C D CA B ˆψ ψ D N 0, I ψ,ψ I ψ,λ I λλ I λ,ψ. hus if ψ is a scalar we can easily use the above to construct confidence intervals for ψ. Exercise: How to estimate I ψ,ψ I ψ,λ I λλ I λ,ψ? 6.. he score and the log-likelihood ratio for the profile likelihood o ease notation, let us suppose that ψ 0 and λ 0 are the true parameters in the distribution. he above gives us the limiting distribution of ˆψ ψ 0, this allows us to test ψ, however the test ignores any dependency that may exist with the nusiance estimator parameter ˆλ.An alternative test, which circumvents this issue is to do a log-likelihood ratio test of the type { max ψ, λ max ψ 0, λ ψ,λ λ }. 6. However, to derive the limiting distribution in this case for this statistic is a little more complicated than the log-likelihood ratio test that does not involve nusiance parameters. his is because a direct aylor expansion does not work. However we observe that { } { } { } max L ψ, λ max L ψ 0, λ = max L ψ, λ L ψ 0, λ 0 max L ψ 0, λ max L ψ 0, λ 0, ψ,λ λ ψ,λ λ λ now we will show below that by using a few aylor expansions we can derive the limiting distribution of 6.. In the theorem below we will derive the distribution of the score and the nested- loglikelihood. Please note you do not have to learn this proof. heorem 6.. Suppose Assumption 4.. holds. Suppose that ψ 0, λ 0 are the true parameters. hen we have L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 I ψ0 λ 0 I 45 L ψ 0, λ λ 0 λ 0 ψ0,λ λ

25 and L ψ, λ ˆλψ0,ψ 0 D N 0, Iψ0 ψ 0 I ψ0 λ 0 I λ 0 λ 0 I λ,ψ 6.4 { } L ˆψ, ˆλ L ψ 0, ˆλ D ψ0 χ p 6.5 where I is defined as in 6.. PROOF. We first prove 6.3 which is the basis of the proofs of 6.4 and in the remark below we try to interprete 6.3. o avoid, notationaly difficulties by considering the elements of the vector L ψ,λ ˆλψ0,ψ 0 and L ψ,λ λ λ=λ0,ψ 0 as discussed in Section 4..3 we will suppose that these are univariate random variables. Our objective is to find an expression for L ψ,λ ˆλψ0,ψ 0 in terms of L ψ,λ λ λ=λ0,ψ 0 and L ψ,λ λ=λ0,ψ 0 which will allow us to obtain its variance and asymptotic distribution easily. Now making a aylor expansion of L ψ,λ ˆλψ0,ψ 0 about L ψ,λ λ0,ψ 0 gives L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 + L ψ, λ λ0,ψ λ 0 ˆλ ψ0 λ 0. Notice that we have used instead of = because we replace the second derivative with its true parameters. Now if the sample size is large enough then we can say that L ψ,λ λ λ0,ψ 0 E L ψ,λ λ λ0,ψ 0. o see why this is true consider the case that of iid random variables then herefore we have that L ψ, λ λ0,ψ λ 0 = L ψ 0, λ log fx t ; ψ, λ λ log fx t ; ψ, λ E λ0,ψ λ 0. λ0,ψ 0 L ψ, λ ˆλψ0 λ0,ψ 0 + I λψ ˆλ ψ0 λ Hence we have the first part of the decomposition of L ψ 0,λ ˆλψ0 into the distribution which is known, now we need find a decomposition of ˆλ ψ0 λ 0 into known distributions. We first recall that since L ψ 0, ˆλ ψ0 = arg max λ L ψ 0, λ then L ψ 0, λ =0 ˆλψ0 λ as long as the parameter space is large enough and the maximum is not on the boundary. herefore making a aylor expansion of L ψ 0,λ λ about L ψ 0,λ ˆλψ0 λ λ0,ψ=ψ 0 gives L ψ 0, λ λ L ψ 0, λ ˆλψ0 λ0,ψ λ 0 + L ψ 0, λ λ λ0,ψ 0 ˆλ ψ0 λ 0. 46

26 Again using the same trick as in 6.6 we have herefore L ψ 0, λ λ L ψ 0, λ ˆλψ0 λ0,ψ λ 0 + I λλ ˆλ ψ0 λ 0 =0. ˆλ ψ0 λ 0 = I λλ herefore substituting 6.6 into 6.7 gives and 6.3. L ψ 0, λ ˆλψ0 L ψ, λ L ψ 0, λ λ0,ψ λ λ0,ψ 0 I ψλ I λλ L ψ 0, λ ψ0,λ λ 0 o prove 6.4 ie. obtain the asympototic distribution and limiting variance of L ψ 0,λ ˆλψ0, we recall that the regular score function satisfies L ψ, λ λ0,ψ 0 = L ψ,λ λ0,ψ 0 L ψ 0,λ λ ψ0,λ 0 D N 0,Iθ 0. Now by substituting the above into 6.4 we immediately obtain 6.4. Finally to prove 6.5 we the following decomposition, aylor expansions and the trick in 6.6 to obtain { } L ˆψ, ˆλ L ψ 0, ˆλ ψ0 { } { } = L ˆψ, ˆλ L ψ 0, λ 0 L ψ 0, ˆλ ψ0 L ψ 0, λ 0 ˆθ θ 0 Iθˆθ θ 0 ˆλ ψ0 λ 0 I λλ ˆλ ψ0 λ 0, 6.8 where ˆθ =ˆψ, ˆλ the mle. Now we want to rewrite ˆλ ψ0 λ 0 in terms of ˆθ θ 0. We start by recalling that from 6.6 we have ˆλ ψ0 λ 0 = I λλ L ψ 0, λ λ0,ψ λ 0. Now we will rewrite L ψ 0,λ λ λ0,ψ 0 in terms of ˆθ θ 0 byusing L θ L θ ˆθ θ0 + Iθˆθ θ 0 L θ θ0 Iθˆθ θ 0. herefore concentrating on the subvector L θ λ ψ0,λ 0 we see that L θ λ ψ 0,λ 0 I λψ ˆψ ψ 0 +I λλ ˆλ λ

27 Substituting 6.9 into 6.7 gives ˆλ ψ0 λ 0 I λλ I λψ ˆψ ψ 0 +ˆλ λ 0. Finally substituting the above into 6.8 and making lots of cancellations we have { } L ˆψ, ˆλ L ψ 0, ˆλ ψ0 ˆψ ψ 0 I ψψ I ψλ I λ,λ I λ,ψ ˆψ ψ 0. Finally, since ˆθ θ 0 D N 0,Iθ, by using inversion formulas for block matrices we have that ˆψ ψ 0 I ψλ I λ,λ I λ,ψ, which gives the desired result. D N 0, I ψψ Remark 6.. i We first make the rather interesting observation. he limiting variance of L ψ,λ ψ0,λ 0 is I ψψ, whereas the the limiting variance of L ψ,λ ˆλψ0,ψ 0 is I ψψ I ψλ I λ,λ I λ,ψ and the limiting variance of ˆψ ψ 0 is I ψψ I ψλ I λ,λ I λ,ψ. ii Look again at the expression L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 I ψλ I λλ L ψ 0, λ λ0,ψ λ It is useful to understand where it came from. Consider the problem of linear regression. Suppose X and Y are random variables and we want to construct the best linear predictor of Y given X. We know that the best linear predictor is Ŷ X =EXY /EY X and the residual and mean squared error is EXY Y Ŷ X =Y EY X and E Y EXY EY X = EY EXY EY EXY. Compare this expression with 6.0. We see that in some sense L ψ,λ ˆλψ0,ψ 0 can be treated as the residual error of the projection of L ψ,λ λ0,ψ 0 onto L ψ 0,λ λ lambda0,ψ 0. his is quite surprising! We now aim to use the above result. It is immediately clear that 6.5 can be used for both constructing likelihoods and testing. For example, to construct a 95% CI for ψ we can use the mle ˆθ =ˆψ, ˆλ and the profile likelihood and use the 95% CI { { } } ψ; L ˆψ, ˆλ L ψ, ˆλ ψ χ p0.95. As you can see by profiling out the parameter λ, we have avoided the need to also construct a CI for λ too. his has many advantages, from a practical perspective it reduced the dimension of the parameters. 48

28 he log-likelihood ratio test in the presence of nuisance parameters An application of heorem 6.. is for nested hypothesis testing, as stated at the beginning of this section. 6.5 can be used to test H 0 : ψ = ψ 0 against H A : ψ ψ 0 since { } max L ψ, λ max L D ψ 0, λ χ p. ψ,λ λ Example 6.. χ -test for independence Now it is worth noting that using the Profile likelihood one can derive the chi-squared test for independence in much the same way that the Pearson goodness of fit test was derived using the log-likelihood ratio test. Do this as an exercise see Davison, Example 4.37, page 35. he score test in the presence of nuisance parameters We recall that we used heorem 6.. to obtain the distribution of { max ψ,λ L ψ, λ max λ L ψ 0, λ } under the null, we now motivate an alternative test to test the same hypothesis which uses the same heorem. We recall that under the null H 0 : ψ = ψ 0 the derivative L ψ,λ λ ˆλψ0,ψ 0 = 0, but the same is not true of L ψ,λ ˆλψ0,ψ 0. However, if the null is true we would expect if ˆλ ψ0 to be close to the true λ 0 and for L ψ,λ ˆλψ0,ψ 0 to be close to zero. Indeed this is what we showed in 6.4, where we showed that under the null L ψ, λ ˆλψ0 D N 0,Iψψ I ψλ I λ,λ I λ,ψ, 6. where λ ψ0 = arg max λ L ψ 0, λ. herefore 6. suggests an alternative test for H 0 : ψ = ψ 0 against H A : ψ ψ 0. We can use L ψ,λ as the test statistic. his is called the score or LM test. ˆλψ0 he log-likelihood ratio test and the score test are asymptotically equivalent. here are advantages and disadvantages of both. i An advantage of the log-likelihood ratio test is that we do not need to calculate the information matrix. ii An advantage of the score test is that we do not have to evaluate the the maximum likelihood estimates under the alternative model Examples Example: An application of profiling to frequency estimation Question 49

29 Suppose that the observations {X t ; t =,...,} satisfy the following nonlinear regression model X t = A cosωt+bsinωt+ε t where {ε t } are iid standard normal random variables and 0 < ω < π. he parameters A, B, and ω are real and unknown. Some useful identities are given at the end of the question. i Ignoring constants, obtain the log-likelihood of {X t }. L A, B, ω. Denote this likelihood as ii Let S A, B, ω = Xt Show that L A, B, ω+s A, B, ω = A B X t A cosωt+bsinωt + A + B. cosω+ab sinω. hus show that L A, B, ω+ S A, B, ω = O ie. the difference does not grow with. Since L A, B, ω and S A, B, ω are asymptotically equivalent, for the rest of this question, use S A, B, ω instead of the likelihood L A, B, ω. iii Obtain the profile likelihood of ω. hint: Profile out the parameters A and B, to show that ˆω = arg max ω X t expitω. Suggest, a graphical method for evaluating ˆω? iv By using the identity expiωt = exp i +Ω sin Ω sin Ω 0 < Ω < π Ω = 0 or π. 6. show that for 0 < Ω < π we have t cosωt =O t cosωt =O t sinωt =O t sinωt =O. 50

30 v By using the results in part iv show that the Fisher Information of L A, B, ω denoted as IA, B, ω is asymptotically equivalent to IA, B, ω =E S = ω 0 B + O 0 A + O B + O A + O 3 3 A + B +O. vi Derive the asymptotic variance of maximum likelihood estimator, ˆω, derived in part iv. Comment on the rate of convergence of ˆω. Useful information: In this question the following quantities may be useful: expiωt = exp i +Ω sin Ω sin Ω 0 < Ω < π Ω = 0 or π. 6.3 the trignometric identities: sinω =sinω cos Ω, cosω = cos Ω = sin Ω, expiω = cosω+i sinω and t = + t = Solution i Since {ε t } are standard normal iid random variables the likelihood is L A, B, ω = X t A cosωt B sinωt. 5

31 ii It is straightforward to show that = = = L A, B, ω Xt X t A cosωt+bsinωt +A cos ωt+b Xt A Xt A B sin ωt+ab X t A cosωt+bsinωt + + cosω + B sinωt cosωt cosω + AB sinω X t A cosωt+b sinωt + A + B + cosω+ab = S A, B, ω+ A B sinω cosω+ab sinω Now by using 6.3 we have L A, B, ω = S A, B, ω+o, as required. iii o obtain the profile likelihood, let us suppose that ω is known, hen the mle of A and B using S is Â ω = X t cosωt ˆB ω = X t sinωt. hus the profile likelihood using the approximation S is S pω = Xt = Xt X t Â ω cosωt+ ˆBωsinωt + Â ω + ˆBω [Â ω + ˆB ω ]. 5

32 hus the ω which maximises S pω is the parameter that maximises Â ω + ˆB ω. Since Â ω + ˆB ω = X t expitω, wehave as required. ˆω = arg max ω /S pω = arg max Â ω + ˆB ω = arg max X t expitω, ω iv Differentiating both sides of 6. with respect to Ω and considering the real and imaginary terms gives t cosωt =O t sinωt =O. Differentiating both sides of 6. twice wrt to Ω gives the second term. v Differentiating S A, B, ω = X t X t A cosωt+b sinωt + A +B twice wrt to A, B and ω gives and S A =, S B =, S A = X t cosωt+a S B = X t sinωt+b S ω = AX t t sinωt S A B = 0, S ω A = X t t sinωt S ω B = X t t cosωt S ω ω BX t t cosωt. = t X t A cosωt+bsinωt. Now taking expectations of the above and using v we have E S ω A = t sinωt A cosωt+bsinωt = B = B t sin ωt+ At sinωt cosωt t cosωt + A t sinωt = 53 + B + O =B + O.

33 Using a similar argument we can show that E S ω B = A + O and E S ω = t A cosωt+bsinωt = A + B Since E L E S, this gives the required result. + O =A + B 3 /3+O. vi Noting that the asymptotic variance for the profile likelihood estimator ˆω by subsituting vi into the above we have I ω,ω I ω,ab I A,B I BA,ω, A + B 3 + O 6 A + B 3 hus we observe that the asymptotic variance of ˆω is O 3. ypically estimators have a variance of order O, so we see that the estimator ˆω variance which converges to zero, much faster. compared with the majority of parameter estimators. hus the estimator is extremely good Example: An application of profiling in survival analysis Question his question also uses some methods from Survival Analysis which is covered later in this course - see Sections 3. and 9.. Let i denote the survival time of an electrical component. It is known that the regressors x i influence the survival time i. o model the influence the regressors have on the survival time the Cox-proportional hazard model is used with the exponential distribution as the baseline distribution and ψx i ; β =expβx i as the link function. More precisely the survival function of i is F i t =F 0 t ψxi;β, where F 0 t =exp t/θ. Not all the survival times of the electrical components are observed, and there can arise censoring. Hence we observe Y i =min i,c i, where c i is the censoring time and δ i,whereδ i is the indicator variable, where δ i = 0 denotes censoring of the ith component and δ i = denotes that it is not censored. he parameters β and θ are unknown. 54

34 i Derive the log-likelihood of {Y i, δ i }. ii Compute the profile likelihood of the regression parameters β, profiling out the baseline parameter θ. Solution i he survivial function and the density are f i t =ψx i ; β { F 0 t } [ψx i ;β ] f0 t and F i t =F 0 t ψx i;β. Hence for this example we have log f i t = log ψx i ; β [ ψx i ; β ] t θ log θ t θ log F i t = ψx i ; β t θ. herefore, the likelihood is L n β, θ = = n { δ i log ψxi ; β + log f 0 i +ψx i ; β log F 0 t } + i= n δ i { ψx i ; β log F 0 t } i= n { δ i log ψxi ; β log θ } i= n i= ψx i ; β i θ ii Keeping β fixed and differentiating the above with respect to θ and equating to zero gives L n = n i= δ i { θ } + i= ψx i ; β i θ and n i= ˆθβ = ψx i; β i n i= δ. i Hence the profile likelihood is l P β = n { δ i log ψxi ; β log ˆθβ } i= n i= ψx i ; β i ˆθβ. Hence to obtain an estimator of β we maximise the above with respect to β. 55

The loss function and estimating equations

The loss function and estimating equations Chapter 6 he loss function and estimating equations 6 Loss functions Up until now our main focus has been on parameter estimating via the maximum likelihood However, the negative maximum likelihood is