The Maximum Likelihood Estimator

Size: px
Start display at page:

Download "The Maximum Likelihood Estimator"

Transcription

1 Chapter 4 he Maximum Likelihood Estimator 4. he Maximum likelihood estimator As illustrated in the exponential family of distributions, discussed above, the maximum likelihood estimator of θ 0 the true parameter is defined as ˆθ = arg max θ Θ L X; θ = arg max θ Θ L θ. Often we find that L θ θ=ˆθ = 0, hence solution can be obtained by solving the derivative of the log likelihood often called the score function. However, if θ 0 lies on or close to the boundary of the parameter space this will not necessarily be true. Below we consider the sampling properties of ˆθ interior of the parameter space Θ. when the true parameter θ 0 lies in the We note that the likelihood is invariant to transformations of the data. For example if X has the density f ; θ and we define the transformed random variable Z = gx, where the function g has an inverse its a - transformation, then it is easy to show that the density of Z is fg z; θ g z z. herefore the likelihood of {Z t = gx t } is fg Z t ; θ g z z=zt = z fx t ; θ g z z=zt. z Hence it is proportional to the likelihood of {X t } and the maximum of the likelihood of {Z t = gx t } is the same as the maximum of the likelihood of {X t }.

2 4.. Evaluating the MLE Examples Example 4.. {X t } are iid random variables, which follow a Normal Gaussian distribution N µ, σ.helikelihoodisproportionalto L X; µ, σ = log σ σ X t µ. Maximising the above with respect to µ and σ gives ˆµ = X and ˆσ = X t X. Example 4.. Question: Solution: {X t } are iid random variables, which follow a Weibull distribution, which has the density αy α θ α exp y/θ α θ, α > 0. Suppose that α is known, butθ is unknown and we need to estimate it. What is the maximum likelihood estimator of θ? he log-likelihood of interest is proportional to L X; θ = log α +α log Y t α log θ Y t α θ α log θ Y t α.. θ he derivative of the log-likelihood wrt to θ is L = α θ + α θ α+ Y α t =0. Solving the above gives ˆθ = Y α t /α. Example 4..3 Notice that if α is given, an explicit solution for the maximum of the likelihood, in the above example, can be obtained. Consider instead the maximum of the likelihood with respect to α and θ, ie. arg max θ,α log α +α log Y t α log θ Y t α. θ

3 he derivative of the likelihood is L L α = α θ + α θ α+ = α Y α t =0 log Y t log θ α θ + log Y t θ Y t θ α =0. It is clear that an explicit expression to the solution of the above does not exist and we need to find alternative methods for finding a solution. Below we shall describe numerical routines which can be used in the maximisation. In special cases, one can use other methods, such as the Profile likelihood we cover this later on. Numerical Routines In an ideal world to maximise a likelihood, we would consider the derivative of the likelihood and solve it L θ θ=ˆθ = 0, and an explicit expression would exist for this solution. In reality this rarely happens as we illustrated in the section above. Usually, we will be unable to obtain an explicit expression for the MLE. In such cases, one has to do the maximisation using alternative, numerical methods. ypically it is relative straightforward to maximise the likelihood of random variables which belong to the exponential family numerical algorithms sometimes have to be used, but they tend to be fast and attain the maximum of the likelihood - not just the local maximum. However, the story becomes more complicated even if we consider mixtures of exponential family distributions - these do not belong to the exponential family, and can be difficult to maximise using conventional numerical routines. We give an example of such a distribution here. Let us suppose that {X t } are iid random variables which follow the classical normal mixture distribution fy; θ =pf y; θ + pf y; θ, where f is the density of the normal with mean µ and variance σ and f is the density of the normal with mean µ and variance σ. he log likelihood is L Y ; θ = log p exp πσ σ X t µ + p exp πσ σ X t µ. Studying the above it is clear there does not explicit solution to the maximum. Hence one needs to use a numerical algorithm to maximise the above likelihood. We discuss a few such methods below. 3

4 he Newton Raphson Routine he Newton-Raphson routine is the standard method to numerically maximise the likelihood, this can often be done automatically in R by using the R functions optim or nlm. o apply Newton-Raphson, we have to assume that the derivative of the likelihood exists this is not always the case - think about the l - norm based estimators! L θ and the minimum lies inside the parameter space such that θ=ˆθ = 0. We choose an initial value θ and apply the routine L θ L θ n θ n = θ n + θn θn. Where this routine comes from will be clear by using the aylor expansion of L θ n about θ 0 see Section If the likelihood has just one global maximum and no local maximums hence it is convex, then it is quite easy to maximise. If on the other hand, the likelihood has a few local maximums and the initial value θ is not chosen close enough to the true maximum, then the routine may converge to a local maximum not good!. In this case it may be a good idea to do the routine several times for several different initial values θ i i. For each convergence value ˆθ evaluate the likelihood L ˆθ i and select the value which gives the largest likelihood. It is best to avoid these problems by starting with an informed choice of initial value. Implementing without any thought a Newton-Rapshon routine can lead to estimators which take an incredibly long time to converge. If one carefully considers the likelihood one can shorten the convergence time by rewriting the likelihood and using faster methods often based on the Newton-Raphson. Iterative least squares his is a method that we shall describe later when we consider Generalised linear models. As the name suggests the algorithm has to be interated, however at each step weighted least squares is implemented see later in the course. he EM-algorithm his is done by the introduction of dummy variables, which leads to a new unobserved likelihood which can easily be maximised. In fact one the simplest methods of maximising the likelihood of mixture distributions is to use the EM-algorithm. We cover this later in the course. See Example 4.3 on page 7 in Davison 00. he likelihood for dependent data We mention that the likelihood for dependent data can also be constructed though often the estimation and the asymptotic properties can be a lot harder to derive. Using Bayes rule ie. 4

5 P A,A,...,A =P A i= P A i A i,...,a we have L X; θ =fx ; θ fx t X t,...,x ; θ. t= Under certain conditions on {X t } the structure above t= fx t X t,...,x ; θ can be simplified. For example if X t were Markovian then we have X t conditioned on the past on depends the most recent past observation, ie. fx t X t,...,x ; θ =fx t X t ; θ in this case the above likelihood reduces to L X; θ =fx ; θ n fx t X t ; θ. 4. t= Example 4..4 Alotofthematerialwewillcoverinthisclasswillbeforindependent observations. However likelihood methods can also work for dependent observations too. Consider the AR time series X t = ax t + ε t, where ε t are iid random variables with mean zero. We will assume that a <. We see from the above that the observation X t as a linear influence on the next observation and it is Markovian, that it given X t,therandomvariablex t has no influence on X t to see this consider the distribution function P X t x X t,x t. herefore by using 4. the likelihood of {X t } t is L X; a =fx ; a f ε X t ax t, 4. t= where f ε is the density of ε and fx ; a is the marginal density of X.hismeansthelikelihood of {X t } only depends on f ε and the marginal density of X t.weuseâ = arg max L X; a as the mle estimator of a. Often we ignore the term fx ; a because this is often hard to know - try and figure it out -itsrelativelyeasyinthegaussiancaseandconsiderwhatis called the conditional likelihood Q X; a = f ε X t ax t. 4.3 t= ã = arg max L X; a as the quasi-mle estimator of a. Exercise: What is the quasi-likelihood proportional to in the case that {ε t } are Gaussian random variables with mean zero. It should be mentioned that often the conditional likelihood is derived as if the errors {ε t } are Gaussian - even if they are not. his is often called the quasi or pseudo likelihood. 5

6 4.. A quick review of the central limit theorem In this section we will not endeavour to proof the central limit theorem which is usually based on showing that the characteristic function - a close cousin of the moment generating function - of the average converges to the characteristic function of the normal distribution. However, we will recall the general statement of the CL and generalisations of it. he purpose of this section is not to lumber you with unnecessary mathematics but to help you understand when an estimator is close to normal or not. Lemma 4.. he famous CL Let us suppose that {X t } are iid random variables, let µ = EX t < and σ = varx t <. Define X = X t.henwehave D X µ N 0, σ, alternatively X µ D N 0, σ. What this means that if we have a large enough sample size and plotted the histogram of several replications of the average, this should be close to normal. Remark 4.. i he above lemma appears to be restricted to just averages. However, it can be used in several different contexts. Averages arise in several different situations. It is not just restricted to the average of the observations. By judicious algebraic manipulations, one can show that several estimators can be rewritten as an average or approximately as an average. At first appearance, the MLE of the Weibull parameters given in Example 4..3 does not look like an average, however, in the section we will consider the general maximum likelihood estimators, and show that they can be rewritten as an average hence the CL applies to them too. ii he CL can be extended in several ways. a o random variables whose variance are not all the same ie. indepedent but identically distributed random variables. b Dependent random variables so long as the dependency decays in some way. c o not just averages but weighted averages too so long as the weight depends in certain way. However, the weights should be distributed well over all the random variables. Ie. suppose that {X t } are iid random variables. hen it is clear that 0 0 X t will never be normal unless {X t } is normal - observe 0 is fixed!, but it seems plausible that n n sinπt/x t is normal despite this not being the sum of iid random variables. 6

7 here exists several theorems which one can use to prove normality. But really the take home message is, look at your estimator and see whether asymptotic normality it looks plausible - you could even check it through simulations. Example 4..5 Some problem cases One should think a little before blindly applying the CL. Suppose that the iid random variables {X t } follow a t-distribution with degrees of freedom, ie. the density function is fx = Γ3/ π + x 3/. Let X = n n X t denote the sample mean. It is well known that the mean of the t-distribution with two degrees of freedom exists, but the variance does not it is too thick tailed. hus, the assumptions required for the CL to hold are violated and X is not normally distributed in fact it follows a stable law distribution. Intuitively this is clear, recall that the chance of outliers for a t-distribution with a small number of degrees of freedom, if large. his makes it impossible that even averages should be well behaved there is a large chance that an average could also be too large or too small. o see why the variance is infinite, study the form of t-distribution with two degrees. For the variance to be finite, the tails of the distribution should convergetozerofastenoughinother words the probability of outliers should not be too large. See that the tails of the t-distribution for large x behaveslikefx Cx 3 make a plot in Maple to check, thus the second moment EX M Cx 3 x dx = M Cx dx for some C and M, is clearly not finite! his argument can be made precise he aylor series expansion - the statisticians tool he aylor series is used all over the place in statistics and you should be completely fluent with using it. It can be used to prove consistency of an estimator, normality based on the assumption that averages converge to a normal distribution, obtaining the limiting variance of an estimator etc. We start by demonstrating its use for the log likelihood. We recall that the mean value in the univariate case states that fx =fx 0 +x x 0 f x fx =fx 0 +x x 0 f x 0 + x x 0 f x, where x and x both lie between x and x 0. In the case that f is a multivariate function, then we have fx = fx 0 +x x 0 fx x= x fx = fx 0 +x x 0 fx x=x0 + x x 0 fx x= x x x 0, 7

8 where x and x both lie between x and x 0. In the case that fx is a vector, then the mean value theorem does not directly work. Strictly speaking we cannot say that fx = fx 0 +x x 0 fx x= x, where x lies between x and x 0. However, it is quite straightforward to overcome this inconvience. he mean value theorem does hold pointwise, for every element of the vector fx =f x,...,f d x, ie. for every i d we have f i x = f i x 0 +x x 0 f i x x= xi, where x i lies between x and x 0.husif f i x x= xi f i x x=x0, we do have that fx fx 0 +x x 0 fx x=x0. We use the above below. Application An expression for L ˆθ L θ 0 in terms of ˆθ θ 0 : he expansion of L ˆθ about θ 0 the true parameter L θ 0 L ˆθ = L θ θ ˆθ 0 ˆθ + θ 0 ˆθ L θ θ θ 0 ˆθ where θ lies between θ 0 and ˆθ. If ˆθ lies in the interior of the parameter space this is an extremely important assumption here then L θ = 0. Moreover, if it can ˆθ P be shown that ˆθ θ 0 0 we show this in the section below, then under certain conditions on L θ such as the existence of the third derivative etc. it can be shown that L θ P θ E L θ θ0 =Iθ 0. Hence the above is roughly L ˆθ L θ 0 ˆθ θ 0 Iθ 0 ˆθ θ 0 Note that in many of the derivations below we will use that L θ θ P E L θ θ0 = Iθ0. But it should be noted that this only true if i ˆθ θ 0 P 0 and ii L θ uniformly to E L θ θ0. We consider below another closely related application. converges 8

9 Application An expression for ˆθ θ 0 in terms of L θ θ0 : he expansion of the p-dimension vector L θ ˆθ pointwise about θ 0 the true parameter gives for i d L i, θ ˆθ = L i, θ θ0 + L i, θ θ ˆθ θ 0. Now by using the same argument as in Application we have L θ θ0 Iθ 0 ˆθ θ 0. We mention that U θ 0 = L θ θ0 is often called the score or U statistic. And we see that the asymptotic sampling properties of U determine the sampling properties of ˆθ θ 0. Example 4..6 he Weibull Evaluate the second derivative of the likelihood given in Example 4..3, take the expection on this, Iθ, α =E L we use the to denote the second derivative with respect to the parameters α and θ. Exercise: Evaluate Iθ, α. Application implies that the maximum likelihood estimators ˆθ and ˆα recalling that no explicit expression for them exists can be written as ˆθ θ Iθ, α ˆα α α θ + α Y α θ α+ t α log Y t log θ α Yt θ + log θ Yt θ α 4..4 Sampling properties of the maximum likelihood estimator See also Section 4.4. p8, Davison 00. hese proofs will not be examined, but you should have some idea why heorem 4.. is true. We have shown that under certain conditions the maximum likelihood estimator can often be the minimum variance unbiased estimator for example, in the case of exponential family of distributions. However, for finite samples, the mle may not attain the C-R lower bound. Hence for finite sample varˆθ >Iθ. However, it can be shown that asymptotically the variance of the mle attains the mle lower bound. In other words, for large samples, the variance of the mle is close to the Cramer-Rao bound. We will prove the result in the case that l the log likelihood of independent, identically distributed random variables. he proof can be generalised to the case of non-identically distributed random variables. We first state sufficient conditions for this to be true. Assumption 4.. [Regularity Conditions ] Let {X t } be iid random variables with density fx; θ. 9 is

10 i Suppose the conditions in Assumption.. hold. ii Almost sure uniform convergence his is optional For every ε > 0 there exists a δ such that P lim sup θ θ >δ L X; θ EL θ > ε 0. We mention that directly verifying uniform convergence can be difficult. However, it can be established by showing that the parameter space is compact, point wise convergence of the likelihood to its expectation and almost sure equicontinuity in probability. iii Model identifiability For every θ Θ, theredoesnotexistanother θ Θ such that fx; θ =fx; θ for all x. iv he parameter space Θ is finite and compact. v sup E L X;θ <. We require Assumption 4..ii,iii to show consistency and Assumptions.. and 4..iiiv to show asymptotic normality. heorem 4.. Supppose Assumption 4..ii,iii holds. Let θ 0 be the true parameter and ˆθ be the mle. hen we have ˆθ a.s. θ 0 consistency. PROOF. o prove the result we first need to show that the expectation of the maximum likelihood is maximum at the true parameter and that this is the unique maximum. In other words we need to show that E L X; θ E L X; θ 0 0 for all θ Θ. o do this, we have E L X; θ E L fx; θ X; θ 0 = log fx; θ 0 fx; θ 0dx = E fx; θ log. fx; θ 0 Now by using Jensen s inequality we have E log fx; θ fx; θ log E = log fx; θ 0 fx; θ 0 fx; θdx =0. hus giving E L X; θ E L X; θ 0 0. o prove that E L X; θ E L X; θ 0 = 0 only when θ 0 we note that identifiability assumption in Assumption 4..iii, which means that fx; θ =fx; θ 0 only when θ 0 and no other function of f gives equality. 30

11 Hence E L X; θ is uniquely maximum at θ 0. Finally, we need to show that ˆθ P θ0. By Assumption 4..ii and also the LLN we have that for all θ Θ that L X; θ a.s. lθ. herefore, for every mle ˆθ we have L X; θ 0 L X; ˆθ a.s. E L X; ˆθ E L X; θ o bound E L X; θ 0 L X; ˆθ we note that Now by using 4.4 we have and E L X; θ 0 L X; ˆθ = { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { L X; θ 0 E L X; ˆθ }. E L X; θ 0 L X; ˆθ { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { E L X; ˆθ L X; ˆθ } E L X; θ 0 L X; ˆθ { E L X; θ 0 L X; θ 0 } + { E L X; ˆθ L X; ˆθ } + { E L X; θ 0 L X; θ 0 }. herefore, under Assumption 4..ii we have E L X; θ 0 L X; ˆθ 3sup E θ Θ L X; θ L X; θ a.s. 0. Since L θ has a unique minimum this implies ˆθ a.s. θ 0. Hence we have shown consistency of the mle. We now need to show asymptotic normality. heorem 4.. Suppose Assumption 4.. is satisfied. i hen the score statistic is ii hen the mle is { L X; θ D log fx; θ θ0 N 0, E θ0 }. 4.5 { D log fx; θ } ˆθ θ 0 N 0, E θ0. 3

12 iii he log likelihood ratio is L X; ˆθ D L X; θ 0 χ p PROOF. First we will prove i. We recall because {X t } are iid random variables, then L X; θ θ0 = log fx t ; θ θ0. Hence L X;θ log fxt;θ θ0 is the sum of iid random variables with mean zero and variance var θ0. herefore, by the CL for iid random variables we have 4.5. We use i and aylor mean value theorem to prove ii. We first note that by the mean value theorem we have L X; θ ˆθ = L X; θ θ0 +ˆθ θ 0 L X; θ θ. 4.6 Now it can be shown because Θ has a compact support, ˆθ θ 0 a.s. 0 and the expectations of the third derivative of L is bounded that L X; θ P θ E L X; θ Substituting 4.7 into 4.6 gives θ0 log fx; θ = E θ L X; θ L ˆθ θ 0 = X; θ θ θ0 L X; θ L = E θ0 X; θ θ0 + o p. We mention that the proof above is for univariate L X;θ θ, but by redo-ing the above steps pointwise it can easily be generalised to the multivariate case too. Hence by substituting the 4.5 into the above we have ii. It is straightfoward to prove iii by using L X; ˆθ L X; θ 0 ˆθ θ 0 Iθ 0 ˆθ θ 0, i and the result that if X N 0, Σ, then AX N 0,A ΣA. Example 4..7 he Weibull By using Example 4..6 we have ˆθ θ Iθ, α α θ + α Y θ α+ t α ˆα α α log Y t log θ α Yt θ + log θ Yt θ α 3.

13 Now we observe that RHS consists of a sum iid random variables this can be viewed as an average. Since the variance of this exists you can show that itisiθ, α, the CL can be applied and we have that Remark 4.. sample size is ˆθ θ ˆα α D N 0,Iθ, α. i We recall that for iid random variables that the Fisher information for { } log L X; θ log fx; θ Iθ =E θ0 = E θ0. Hence comparing with the above theorem, we see that for iid random variables so long as the regularity conditions are satisfied the MLE, asympotitically, attains the Cramer-Rao bound even if for finite samples this may not be true. Moreover, since ˆθ θ 0 Iθ 0 L θ θ0 = Iθ 0 L θ θ0, and var L θ θ0 = Iθ 0, thenitcanbeseenthat ˆθ θ 0 = O p /. ii Under suitable conditions a similar result holds true for data which is not iid. In summary, the MLE under certain regularity conditions tend to have the smallest variance, and for large samples, the variance is close to the lower bound,whichisthecramer-rao bound. In the case that Assumption 4.. is satisfied, the MLE is said to beasymptoticallyefficient. his means for finite samples the MLE may not attain the C-R bound butasymptoticallyitwill. iii A simple application of heorem 4.. is to the derivation of the distribution of Iθ 0 / ˆθ θ 0.Itisclearthatbyusingheorem4..wehave where I p is the identity matrix and Iθ 0 / ˆθ θ 0 D N 0,I p ˆθ θ 0 Iθ 0 ˆθ θ 0 D χ p. iv Note that these results apply when θ 0 lies inside the parameter space Θ. Asθ gets closer to the boundary of the parameter space 33

14 Remark 4..3 Generalised estimating equations Closely related to the MLE are generalised estimating equations GEE, which are relate to the score statistic. hese are estimators not based on maximising the likelihood but are related to equating the score statistic derivative of the likelihood to zero and solving for the unknown parameters. Often they are equivalent to the MLE but they can be adapted to be useful in themselves and some adaptions will not be the derivative of a likelihood he Fisher information See also Section 4.3, Davison 00. Let us return to the Fisher information. We recall that undercertain regularity conditions an unbiased estimator, θx, of a parameter θ 0 is such that var θx Iθ 0, where L θ Iθ =E = E L θ. is the Fisher information. Furthermore, under suitable regularity conditions, the MLE will asymptotically attain this bound. It is reasonable to ask, how one can interprete this bound. i Situation. Iθ 0 =E L θ θ0 is large hence variance of the mle will be small then it means that the gradient of L θ is steep. Hence even for small deviations from θ 0, L θ is likely to be far from zero. his means the mle ˆθ is likely to be in a close neighbourhood of θ 0. ii Situation. Iθ 0 =E L θ θ0 is small hence variance of the mle will large. In this case the gradient of the likelihood L θ is flatter and hence L θ 0 for a large neighbourhood about the true parameter θ. herefore the mle ˆθ can lie in a large neighbourhood of θ 0. his is one explanation as to why Iθ is called the Fisher information. It contains information on how close close any estimator of θ can be. Look at the censoring example, Example 4.0, page, Davison

15 Chapter 5 Confidence Intervals 5. Confidence Intervals and testing We first summarise the results in the previous section which will be useful in this section. For convenience, we will assume that the likelihood is for iid random variables, whose density is fx; θ 0 though it is relatively simple to see how this can be generalised to general likelihoods - of not necessarily iid rvs. Let us suppose that θ 0 is the true parameter that we wish to estimate. Based on heorem 4.. we have { D log fx; θ } ˆθ θ 0 N 0, E θ0, 5. L θ=θ 0 { D log fx; θ } N 0, E θ0 5. and L ˆθ L θ 0 D χ p, 5.3 where p are the number of parameters in the vector θ. Using any of 5., 5. and 5.3 we can construct 95% CI for θ Constructing confidence intervals using the likelihood See also Section 4.5, Davison 00. One the of main reasons that we show asymptotic normality of an estimator it is usually not possible to derive normality for finite samples is to construct confidence intervals CIs and to test. 35

16 In the case that θ 0 is a scaler vector of dimension one, it is easy to use 5. to obtain { log fx; θ } / D E θ0 ˆθ θ 0 N0,. 5.4 Based on the above the 95% CI for θ 0 is [ ˆθ log fx; θ E θ0 z α/, ˆθ + log fx; θ E θ0 z α/ ]. log fx;θ he above, of course, requires an estimate of the standardised Fisher information E θ0 = E log fx;θ θ0 Usually, we evaluate the second derivative of log L θ = L θ and replace θ with the estimator of θ, ˆθ. Exercise: Use 5. to construct a CI for θ 0 based on the score he CI constructed above works well if θ is a scalar. But beyond dimension one, constructing a CI based on 5. and the p-dimensional normal is extremely difficult. More precisely, ifθ 0 is a p-dimensional vector then the analogous version of 5.4 is { log fx; θ } / D E θ0 ˆθ θ 0 N0,Ip, using this it is difficult to obtain the CI of θ 0. One way to construct the CI is to square ˆθ θ 0 and use Based on above a 95% CI is log fx; θ } D ˆθ θ 0 E θ0 ˆθ θ 0 χ p. 5.5 { θ; ˆθ θ log fx; θ E θ0 ˆθ θ } χ p Note that as in the scalar case, this leads to the interval with the smallest length. A disadvantage of 5.6 is that we have to a estimate the information matrix and b try to find all θ such the above holds. his can be quite unwielding. An alternative method, which is asymptotically equivalent to the above but removes the need to estimate the information matrix and is to use 5.3. By using 5.3, a 00 α% CI for θ 0 is { θ; L ˆθ L θ } χ p00 α. 5.7 he above is not easy to calculate, but it is feasible. 36

17 Example 5.. In the case that θ 0 is a scalar the 95% CI based on 5.7 is { θ; L θ L ˆθ } χ p0.95. Both the 95% CIs in 5.6 and 5.7 will be very close for relatively large sample sizes. However one advantage of using 5.7 instead of 5.6 is that it is easier to evaluate - no need to obtain the second derivative of the likelihood etc. Another feature which differentiates the CIs in 5.6 and 5.7 is that the CI based on 5.6 is symmetric about ˆθ recall that X.96σ/, X +.96σ/ is symmetric about X, whereas the symmetry condition may not hold for sample sizes when constructing a CI for θ 0 using 5.7. his is a positive advantage of using 5.7 instead of 5.6. A disadvantage of using 5.7 instead of 5.6 is that sometimes in the CI based on 5.7 may have more than one interval. As you can see if the dimension of θ is large it is quite difficult to evaluate the CI try it for the simple case that the dimension is two!. Indeed for dimensions greater than three it is extremely hard. However in most cases, we are only interested in constructing CIs for certain parameters of interest, the other unknown parameters are simply nuisance parameters and CIs for them are not of interest. For example, for the normal distribution we may only be interested in CIs for the mean but not the variance. It is clear that directly using the log-likelihood ratio to construct CIs and also test will mean also constructing CIs for the nuisance parameters. herefore below in Section?? we construct a variant of the likelihood called the Profile likelihood, which allows us to deal with nuisance parameters in a more efficient way. 5.. esting using the likelihood Let us suppose we wish to test the hypothesis H 0 : θ = θ 0 against the alternative H A : θ θ 0. We can use any of the results in 5., 5. and 5.3 to do the test - they will lead to slightly different p-values, but asympototically they are all equivalent, because they are all based essentially on the same derivation. We now list the three tests that one can use. he Wald test he Wald statistic is based on 5.. We recall from 5. that ifthenullistrue,thenwehave { D log fx; θ } ˆθ θ 0 N 0, E θ0. 37

18 hus we can use as the test statistic = { log fx; θ } / D E θ0 ˆθ θ 0 N 0,. Let us now consider how the test statistics behaves under the alternative H A : θ = θ. If the null is not true, then we have that ˆθ θ 0 = ˆθ θ +θ θ 0 Iθ log fx t ; θ θ θ 0 hus the distribution of the test statistic becomes centered about θ 0. hus for a larger sample size the more likely we are to reject the null. t { } / log fx;θ E θ0 θ Remark 5.. ypes of alternatives In the case that the alternative is fixed, it is clear that the power in the test goes to 00%. herefore, often to see the effectiveness of the test, one lets the alternative get closer to the the null as.forexample } / Suppose that θ = θ 0 +, then the center of log fx;θ is {E θ0 0. hus the alternative is too close to the null for us to discriminate betweenthethetwo. { } /. Suppose that θ = θ 0 + log fx;θ,thenthecenterof is E θ0 herefore, the test does have power, but it s not 00%. In the case that the dimension of θ is greater than one, we use the test statistic = ˆθ θ 0 E log fx;θ θ0 ˆθ θ 0 instead of. Noting that the distribution of is a chi-squared with p-degrees of freedom. he Score test he score test is based on the score. We recall from??, that under the null the distribution of the score is L θ=θ 0 { D log fx; θ N 0, E θ0 }. hus we use as the test statistic = log fx; θ } / L {E θ0 D θ=θ 0 N 0,. An advantage of this test is that the maximum likelihood estimator under either the null or alternative does not have to be calculated. Exercise: What does the test statistic look like under the alternative? 38

19 he log-likelihood ratio test Probably one of the most popular test os the log-likelihood ratio tests. his test is based on 5.3, and the test statistic is 3 = L ˆθ L θ 0 D χ p. An advantage of this test statistic is that it is pivotal, in the sense that the Fisher information etc. does not have to calculated, only the maximum likelihood estimator. Exercise: What does the test statistic look like under the alternative? 5..3 Applications of the log-likeihood ratio to the multinomial distribution Example 5.. he multinomial distribution his is a generalisation of the binomial distribution. In this case at any given trial there can arise m different events in the Binomial case m =. Let Z i denote the outcome of the ith trial and assume P Z i = k =π i π π m =. Suppose there were n trial conducted and let Y denote the number of times event arises, Y denote the number of times event arises and so on. hen it is straightforward to show that n m P Y = k,...,y m = k m = π k i i k,...,k. m If we do not impose any constraints on the probabilities {π i }, given {Y i } m i= is straightforward to derive the mle of {π i } it is very intuitive too!. Noting that π m = m i= π i, the log-likelihood of the multinomial is proportional to i= L π = m i= m y i log π i + y m log i= π i. Differentiating the above with respect to π i and solving gives the mle estimator ˆπ i = Y i /n, which is what we would have expected! We observe that though there are m probabilities to estimate due to the constraint π m = m i= π i, we only have to estimate m probabilities. We mention, that the same estimators can also be obtained by using Lagrange multipliers, that is maximising L π subject to the parameter constraint that p j= π i =. o enforce this constraint, we normally add an additional term to L π and include the dummy variable λ. hat is we define the constrained likelihood L π, λ = m m y i log π i + λ π i. i= i= 39

20 Now if we maximise L π, λ withrespectto{π i } m i= and λ we will obtain the estimators ˆπ i = Y i /n which is the same as the maximum of L π. o derive the limiting distribution we note that the second derivative is L π π i π j = y i π i y m m + i = j r= πr y m m i j r= πr Hence taking expectations of the above the information matrix is the k k matrix Iπ =n π + π m π m... π m π m π + π m... π m.. π m... π m + π m Provided no of π i is equal to either 0 or which would drop the dimension of m and make Iπ singular, then the asymptotic distribution of the mle the normal with variance Iπ. Sometimes the probabilities {π i } will not be free and will be determined by a parameter θ where θ is an r-dimensional vector where r<m, ie. π i = π i θ, in this case the likelihood of the multinomial is... L π = m i= m y i log π i + y m log i= π i θ. By differentiating the above with respect to θ and solving will give the mle. Pearson s goodness of Fit test We now derive Pearson s goodness of Fit test using the log-likelihood ratio, though Pearson did not use this method to derive his test. Suppose the null is H 0 : π = π,...,π m = π m where { π i } are some pre-set probabilities and H A : the probabilities are not the given probabilities. Hence we are testing restricted model where we do not have to estimate anything against the full model where we estimate the probabilities using π i = Y i /n. he log-likelihood ratio in this case is W = { arg max π L π L π }. Under the null we know that W = { arg max π L π L π } P χ m because we have to estimate m parameters. We now derive an expression for W and show that the Pearson- 40

21 statistic is an approximation of this. m W = Y i log Y i + Ym log Y m m n n Y i log π i Y m log π m i= i= m = Y i log Y i. n π i i= Recall that Y i is often called the observed Y i = O i and n π i the expected under the null E i = n π i. hen W = m i= O i log O i E i P χ m. By using that for a close to x and making a aylor expansion of x logxa about x = a we have x logxa a logaa +x a+ x a /a. We let O = x and E = a, then assuming the null is true and E i O i we have W = m i= Y i log Y i m Oi E i + O i E i. n π i E i Now we note that m i= E i = m i= O i = n hence the above reduces to i= W O i E i E i D χ m. We recall that the above is the Pearson test statistic. Hence this is one methods for deriving the Pearson chi-squared test for goodness of fit. By using a similar argument, we can also obtain the test statistic of the chi-squared test for independent and an explanation for the rather strange number of degrees of freedom!. 4

22 Chapter 6 he Profile Likelihood 6. he Profile Likelihood See also Section 4.5., Davison he method of profiling Let us suppose that the unknown parameters θ can be partitioned as θ =ψ, λ, where ψ are the p-dimensional parameters of interest eg. mean and λ are the q-dimensional nuisance parameters eg. variance. We will need to estimate both ψ and λ, but our interest is in testing only the parameter ψ without any information on λ and construction confidence intervals for ψ without constructing unnecessary confidence intervals for λ - confidence intervals for a large number of parameters are wider than those for a few parameters. o achieve this one often uses the profile likelihood. o motivate the profile likelihood, we first describe a method to estimate the parameters ψ, ψ in two stages and consider some examples. Let us suppse that {X t } are iid random variables, with density fx; ψ, λ where our objective is to estimate ψ and λ. In this case the log-likelihood is L ψ, λ = log fx t ; ψ, λ. o estimate ψ and λ one can use ˆλ, ˆψ = arg max λ,ψ L ψ, λ. However, this can quite difficult, and lead to expression which are hard to maximise. Instead let us consider a different method, which may, sometimes, be easier to evaluate. Suppose, for now, ψ is known, then we rewrite the likelihood as L ψ, λ =L ψ λ to show that ψ is fixed but λ varies. o estimate λ we maximise L ψ λ withrespecttoλ, ie. ˆλ ψ = arg max L ψλ. λ 43

23 In reality ψ this unknown, hence for each ψ we can evaluate ˆλ ψ. Note that for each ψ, wehave anewcurvel ψ λ overλ. Now to estimate ψ, we evaluate the maximum L ψ λ, over λ, and choose the ψ, which is the maximum over all these curves. In other words, we evaluate ˆψ = arg max L ψˆλ ψ = arg max L ψ, ˆλ ψ. ψ ψ A bit of logical deduction shows that ˆψ and λ ˆψ are the maximum likelihood estimators ˆλ, ˆψ = arg max ψ,λ L ψ, λ. We note that we have profiled out nuisance parameter λ, and the likelihood L ψ ˆλ ψ = L ψ, ˆλ ψ is completely in terms of the parameter of interest ψ. he advantage of this best illustrated through some examples. Example 6.. Let us suppose that {X t } are iid random variables from a Weibull distribution with density fx; α, θ = αyα θ exp y/θ α. We know from Example 4.., that if α, were α known an explicit expression for the MLE can be derived, it is ˆθ α = arg max L α θ θ = arg max θ = arg max θ log α +α log Y t α log θ Y t α θ α log θ Y t α = θ Y α t /α, where L α X; θ = log α +α log Y t α log θ Y t α θ. hus for a given α, the maximum likelihood estimator of θ can be derived. he maximum likelihood estimator of α is ˆα = arg max α log α +α log Y t α log Y α t /α Y t α Y. t α/α herefore, the maximum likelihood estimator of θ is Y ˆα t /ˆα. We observe that evaluating ˆα can be tricky but no worse than maximising the likelihood L α, θ over α and θ. As we mentioned above, we often do not have any interest in the nuisance parameters λ and are only interesting in testing and constructing CIs for α. In this case, we are interested in the limiting distribution of the MLE ˆα. his can easily be derived by observing that ˆψ ψ ˆλ λ D Iψψ I ψλ N 0, I λψ I λλ 44.

24 where Iψψ I ψλ I λψ I λλ = E log fx t;ψ,λ E log fx t;ψ,λ λ E log fx t;ψ,λ λ E log fx t;ψ,λ. 6. o derive an exact expression for the limiting variance of ˆψ ψ, we note that the inverse of a block matrix is A B = C D hus the above implies that A BD C A BD CA B D CBA BD C D CA B ˆψ ψ D N 0, I ψ,ψ I ψ,λ I λλ I λ,ψ. hus if ψ is a scalar we can easily use the above to construct confidence intervals for ψ. Exercise: How to estimate I ψ,ψ I ψ,λ I λλ I λ,ψ? 6.. he score and the log-likelihood ratio for the profile likelihood o ease notation, let us suppose that ψ 0 and λ 0 are the true parameters in the distribution. he above gives us the limiting distribution of ˆψ ψ 0, this allows us to test ψ, however the test ignores any dependency that may exist with the nusiance estimator parameter ˆλ.An alternative test, which circumvents this issue is to do a log-likelihood ratio test of the type { max ψ, λ max ψ 0, λ ψ,λ λ }. 6. However, to derive the limiting distribution in this case for this statistic is a little more complicated than the log-likelihood ratio test that does not involve nusiance parameters. his is because a direct aylor expansion does not work. However we observe that { } { } { } max L ψ, λ max L ψ 0, λ = max L ψ, λ L ψ 0, λ 0 max L ψ 0, λ max L ψ 0, λ 0, ψ,λ λ ψ,λ λ λ now we will show below that by using a few aylor expansions we can derive the limiting distribution of 6.. In the theorem below we will derive the distribution of the score and the nested- loglikelihood. Please note you do not have to learn this proof. heorem 6.. Suppose Assumption 4.. holds. Suppose that ψ 0, λ 0 are the true parameters. hen we have L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 I ψ0 λ 0 I 45 L ψ 0, λ λ 0 λ 0 ψ0,λ λ

25 and L ψ, λ ˆλψ0,ψ 0 D N 0, Iψ0 ψ 0 I ψ0 λ 0 I λ 0 λ 0 I λ,ψ 6.4 { } L ˆψ, ˆλ L ψ 0, ˆλ D ψ0 χ p 6.5 where I is defined as in 6.. PROOF. We first prove 6.3 which is the basis of the proofs of 6.4 and in the remark below we try to interprete 6.3. o avoid, notationaly difficulties by considering the elements of the vector L ψ,λ ˆλψ0,ψ 0 and L ψ,λ λ λ=λ0,ψ 0 as discussed in Section 4..3 we will suppose that these are univariate random variables. Our objective is to find an expression for L ψ,λ ˆλψ0,ψ 0 in terms of L ψ,λ λ λ=λ0,ψ 0 and L ψ,λ λ=λ0,ψ 0 which will allow us to obtain its variance and asymptotic distribution easily. Now making a aylor expansion of L ψ,λ ˆλψ0,ψ 0 about L ψ,λ λ0,ψ 0 gives L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 + L ψ, λ λ0,ψ λ 0 ˆλ ψ0 λ 0. Notice that we have used instead of = because we replace the second derivative with its true parameters. Now if the sample size is large enough then we can say that L ψ,λ λ λ0,ψ 0 E L ψ,λ λ λ0,ψ 0. o see why this is true consider the case that of iid random variables then herefore we have that L ψ, λ λ0,ψ λ 0 = L ψ 0, λ log fx t ; ψ, λ λ log fx t ; ψ, λ E λ0,ψ λ 0. λ0,ψ 0 L ψ, λ ˆλψ0 λ0,ψ 0 + I λψ ˆλ ψ0 λ Hence we have the first part of the decomposition of L ψ 0,λ ˆλψ0 into the distribution which is known, now we need find a decomposition of ˆλ ψ0 λ 0 into known distributions. We first recall that since L ψ 0, ˆλ ψ0 = arg max λ L ψ 0, λ then L ψ 0, λ =0 ˆλψ0 λ as long as the parameter space is large enough and the maximum is not on the boundary. herefore making a aylor expansion of L ψ 0,λ λ about L ψ 0,λ ˆλψ0 λ λ0,ψ=ψ 0 gives L ψ 0, λ λ L ψ 0, λ ˆλψ0 λ0,ψ λ 0 + L ψ 0, λ λ λ0,ψ 0 ˆλ ψ0 λ 0. 46

26 Again using the same trick as in 6.6 we have herefore L ψ 0, λ λ L ψ 0, λ ˆλψ0 λ0,ψ λ 0 + I λλ ˆλ ψ0 λ 0 =0. ˆλ ψ0 λ 0 = I λλ herefore substituting 6.6 into 6.7 gives and 6.3. L ψ 0, λ ˆλψ0 L ψ, λ L ψ 0, λ λ0,ψ λ λ0,ψ 0 I ψλ I λλ L ψ 0, λ ψ0,λ λ 0 o prove 6.4 ie. obtain the asympototic distribution and limiting variance of L ψ 0,λ ˆλψ0, we recall that the regular score function satisfies L ψ, λ λ0,ψ 0 = L ψ,λ λ0,ψ 0 L ψ 0,λ λ ψ0,λ 0 D N 0,Iθ 0. Now by substituting the above into 6.4 we immediately obtain 6.4. Finally to prove 6.5 we the following decomposition, aylor expansions and the trick in 6.6 to obtain { } L ˆψ, ˆλ L ψ 0, ˆλ ψ0 { } { } = L ˆψ, ˆλ L ψ 0, λ 0 L ψ 0, ˆλ ψ0 L ψ 0, λ 0 ˆθ θ 0 Iθˆθ θ 0 ˆλ ψ0 λ 0 I λλ ˆλ ψ0 λ 0, 6.8 where ˆθ =ˆψ, ˆλ the mle. Now we want to rewrite ˆλ ψ0 λ 0 in terms of ˆθ θ 0. We start by recalling that from 6.6 we have ˆλ ψ0 λ 0 = I λλ L ψ 0, λ λ0,ψ λ 0. Now we will rewrite L ψ 0,λ λ λ0,ψ 0 in terms of ˆθ θ 0 byusing L θ L θ ˆθ θ0 + Iθˆθ θ 0 L θ θ0 Iθˆθ θ 0. herefore concentrating on the subvector L θ λ ψ0,λ 0 we see that L θ λ ψ 0,λ 0 I λψ ˆψ ψ 0 +I λλ ˆλ λ

27 Substituting 6.9 into 6.7 gives ˆλ ψ0 λ 0 I λλ I λψ ˆψ ψ 0 +ˆλ λ 0. Finally substituting the above into 6.8 and making lots of cancellations we have { } L ˆψ, ˆλ L ψ 0, ˆλ ψ0 ˆψ ψ 0 I ψψ I ψλ I λ,λ I λ,ψ ˆψ ψ 0. Finally, since ˆθ θ 0 D N 0,Iθ, by using inversion formulas for block matrices we have that ˆψ ψ 0 I ψλ I λ,λ I λ,ψ, which gives the desired result. D N 0, I ψψ Remark 6.. i We first make the rather interesting observation. he limiting variance of L ψ,λ ψ0,λ 0 is I ψψ, whereas the the limiting variance of L ψ,λ ˆλψ0,ψ 0 is I ψψ I ψλ I λ,λ I λ,ψ and the limiting variance of ˆψ ψ 0 is I ψψ I ψλ I λ,λ I λ,ψ. ii Look again at the expression L ψ, λ ˆλψ0,ψ 0 L ψ, λ λ0,ψ 0 I ψλ I λλ L ψ 0, λ λ0,ψ λ It is useful to understand where it came from. Consider the problem of linear regression. Suppose X and Y are random variables and we want to construct the best linear predictor of Y given X. We know that the best linear predictor is Ŷ X =EXY /EY X and the residual and mean squared error is EXY Y Ŷ X =Y EY X and E Y EXY EY X = EY EXY EY EXY. Compare this expression with 6.0. We see that in some sense L ψ,λ ˆλψ0,ψ 0 can be treated as the residual error of the projection of L ψ,λ λ0,ψ 0 onto L ψ 0,λ λ lambda0,ψ 0. his is quite surprising! We now aim to use the above result. It is immediately clear that 6.5 can be used for both constructing likelihoods and testing. For example, to construct a 95% CI for ψ we can use the mle ˆθ =ˆψ, ˆλ and the profile likelihood and use the 95% CI { { } } ψ; L ˆψ, ˆλ L ψ, ˆλ ψ χ p0.95. As you can see by profiling out the parameter λ, we have avoided the need to also construct a CI for λ too. his has many advantages, from a practical perspective it reduced the dimension of the parameters. 48

28 he log-likelihood ratio test in the presence of nuisance parameters An application of heorem 6.. is for nested hypothesis testing, as stated at the beginning of this section. 6.5 can be used to test H 0 : ψ = ψ 0 against H A : ψ ψ 0 since { } max L ψ, λ max L D ψ 0, λ χ p. ψ,λ λ Example 6.. χ -test for independence Now it is worth noting that using the Profile likelihood one can derive the chi-squared test for independence in much the same way that the Pearson goodness of fit test was derived using the log-likelihood ratio test. Do this as an exercise see Davison, Example 4.37, page 35. he score test in the presence of nuisance parameters We recall that we used heorem 6.. to obtain the distribution of { max ψ,λ L ψ, λ max λ L ψ 0, λ } under the null, we now motivate an alternative test to test the same hypothesis which uses the same heorem. We recall that under the null H 0 : ψ = ψ 0 the derivative L ψ,λ λ ˆλψ0,ψ 0 = 0, but the same is not true of L ψ,λ ˆλψ0,ψ 0. However, if the null is true we would expect if ˆλ ψ0 to be close to the true λ 0 and for L ψ,λ ˆλψ0,ψ 0 to be close to zero. Indeed this is what we showed in 6.4, where we showed that under the null L ψ, λ ˆλψ0 D N 0,Iψψ I ψλ I λ,λ I λ,ψ, 6. where λ ψ0 = arg max λ L ψ 0, λ. herefore 6. suggests an alternative test for H 0 : ψ = ψ 0 against H A : ψ ψ 0. We can use L ψ,λ as the test statistic. his is called the score or LM test. ˆλψ0 he log-likelihood ratio test and the score test are asymptotically equivalent. here are advantages and disadvantages of both. i An advantage of the log-likelihood ratio test is that we do not need to calculate the information matrix. ii An advantage of the score test is that we do not have to evaluate the the maximum likelihood estimates under the alternative model Examples Example: An application of profiling to frequency estimation Question 49

29 Suppose that the observations {X t ; t =,...,} satisfy the following nonlinear regression model X t = A cosωt+bsinωt+ε t where {ε t } are iid standard normal random variables and 0 < ω < π. he parameters A, B, and ω are real and unknown. Some useful identities are given at the end of the question. i Ignoring constants, obtain the log-likelihood of {X t }. L A, B, ω. Denote this likelihood as ii Let S A, B, ω = Xt Show that L A, B, ω+s A, B, ω = A B X t A cosωt+bsinωt + A + B. cosω+ab sinω. hus show that L A, B, ω+ S A, B, ω = O ie. the difference does not grow with. Since L A, B, ω and S A, B, ω are asymptotically equivalent, for the rest of this question, use S A, B, ω instead of the likelihood L A, B, ω. iii Obtain the profile likelihood of ω. hint: Profile out the parameters A and B, to show that ˆω = arg max ω X t expitω. Suggest, a graphical method for evaluating ˆω? iv By using the identity expiωt = exp i +Ω sin Ω sin Ω 0 < Ω < π Ω = 0 or π. 6. show that for 0 < Ω < π we have t cosωt =O t cosωt =O t sinωt =O t sinωt =O. 50

30 v By using the results in part iv show that the Fisher Information of L A, B, ω denoted as IA, B, ω is asymptotically equivalent to IA, B, ω =E S = ω 0 B + O 0 A + O B + O A + O 3 3 A + B +O. vi Derive the asymptotic variance of maximum likelihood estimator, ˆω, derived in part iv. Comment on the rate of convergence of ˆω. Useful information: In this question the following quantities may be useful: expiωt = exp i +Ω sin Ω sin Ω 0 < Ω < π Ω = 0 or π. 6.3 the trignometric identities: sinω =sinω cos Ω, cosω = cos Ω = sin Ω, expiω = cosω+i sinω and t = + t = Solution i Since {ε t } are standard normal iid random variables the likelihood is L A, B, ω = X t A cosωt B sinωt. 5

31 ii It is straightforward to show that = = = L A, B, ω Xt X t A cosωt+bsinωt +A cos ωt+b Xt A Xt A B sin ωt+ab X t A cosωt+bsinωt + + cosω + B sinωt cosωt cosω + AB sinω X t A cosωt+b sinωt + A + B + cosω+ab = S A, B, ω+ A B sinω cosω+ab sinω Now by using 6.3 we have L A, B, ω = S A, B, ω+o, as required. iii o obtain the profile likelihood, let us suppose that ω is known, hen the mle of A and B using S is  ω = X t cosωt ˆB ω = X t sinωt. hus the profile likelihood using the approximation S is S pω = Xt = Xt X t  ω cosωt+ ˆBωsinωt +  ω + ˆBω [ ω + ˆB ω ]. 5

32 hus the ω which maximises S pω is the parameter that maximises  ω + ˆB ω. Since  ω + ˆB ω = X t expitω, wehave as required. ˆω = arg max ω /S pω = arg max  ω + ˆB ω = arg max X t expitω, ω iv Differentiating both sides of 6. with respect to Ω and considering the real and imaginary terms gives t cosωt =O t sinωt =O. Differentiating both sides of 6. twice wrt to Ω gives the second term. v Differentiating S A, B, ω = X t X t A cosωt+b sinωt + A +B twice wrt to A, B and ω gives and S A =, S B =, S A = X t cosωt+a S B = X t sinωt+b S ω = AX t t sinωt S A B = 0, S ω A = X t t sinωt S ω B = X t t cosωt S ω ω BX t t cosωt. = t X t A cosωt+bsinωt. Now taking expectations of the above and using v we have E S ω A = t sinωt A cosωt+bsinωt = B = B t sin ωt+ At sinωt cosωt t cosωt + A t sinωt = 53 + B + O =B + O.

33 Using a similar argument we can show that E S ω B = A + O and E S ω = t A cosωt+bsinωt = A + B Since E L E S, this gives the required result. + O =A + B 3 /3+O. vi Noting that the asymptotic variance for the profile likelihood estimator ˆω by subsituting vi into the above we have I ω,ω I ω,ab I A,B I BA,ω, A + B 3 + O 6 A + B 3 hus we observe that the asymptotic variance of ˆω is O 3. ypically estimators have a variance of order O, so we see that the estimator ˆω variance which converges to zero, much faster. compared with the majority of parameter estimators. hus the estimator is extremely good Example: An application of profiling in survival analysis Question his question also uses some methods from Survival Analysis which is covered later in this course - see Sections 3. and 9.. Let i denote the survival time of an electrical component. It is known that the regressors x i influence the survival time i. o model the influence the regressors have on the survival time the Cox-proportional hazard model is used with the exponential distribution as the baseline distribution and ψx i ; β =expβx i as the link function. More precisely the survival function of i is F i t =F 0 t ψxi;β, where F 0 t =exp t/θ. Not all the survival times of the electrical components are observed, and there can arise censoring. Hence we observe Y i =min i,c i, where c i is the censoring time and δ i,whereδ i is the indicator variable, where δ i = 0 denotes censoring of the ith component and δ i = denotes that it is not censored. he parameters β and θ are unknown. 54

34 i Derive the log-likelihood of {Y i, δ i }. ii Compute the profile likelihood of the regression parameters β, profiling out the baseline parameter θ. Solution i he survivial function and the density are f i t =ψx i ; β { F 0 t } [ψx i ;β ] f0 t and F i t =F 0 t ψx i;β. Hence for this example we have log f i t = log ψx i ; β [ ψx i ; β ] t θ log θ t θ log F i t = ψx i ; β t θ. herefore, the likelihood is L n β, θ = = n { δ i log ψxi ; β + log f 0 i +ψx i ; β log F 0 t } + i= n δ i { ψx i ; β log F 0 t } i= n { δ i log ψxi ; β log θ } i= n i= ψx i ; β i θ ii Keeping β fixed and differentiating the above with respect to θ and equating to zero gives L n = n i= δ i { θ } + i= ψx i ; β i θ and n i= ˆθβ = ψx i; β i n i= δ. i Hence the profile likelihood is l P β = n { δ i log ψxi ; β log ˆθβ } i= n i= ψx i ; β i ˆθβ. Hence to obtain an estimator of β we maximise the above with respect to β. 55

The loss function and estimating equations

The loss function and estimating equations Chapter 6 he loss function and estimating equations 6 Loss functions Up until now our main focus has been on parameter estimating via the maximum likelihood However, the negative maximum likelihood is

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

4.2 Estimation on the boundary of the parameter space

4.2 Estimation on the boundary of the parameter space Chapter 4 Non-standard inference As we mentioned in Chapter the the log-likelihood ratio statistic is useful in the context of statistical testing because typically it is pivotal (does not depend on any

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Introduction to Estimation Methods for Time Series models Lecture 2

Introduction to Estimation Methods for Time Series models Lecture 2 Introduction to Estimation Methods for Time Series models Lecture 2 Fulvio Corsi SNS Pisa Fulvio Corsi Introduction to Estimation () Methods for Time Series models Lecture 2 SNS Pisa 1 / 21 Estimators:

More information

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ )

(θ θ ), θ θ = 2 L(θ ) θ θ θ θ θ (θ )= H θθ (θ ) 1 d θ (θ ) Setting RHS to be zero, 0= (θ )+ 2 L(θ ) (θ θ ), θ θ = 2 L(θ ) 1 (θ )= H θθ (θ ) 1 d θ (θ ) O =0 θ 1 θ 3 θ 2 θ Figure 1: The Newton-Raphson Algorithm where H is the Hessian matrix, d θ is the derivative

More information

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata Maura Department of Economics and Finance Università Tor Vergata Hypothesis Testing Outline It is a mistake to confound strangeness with mystery Sherlock Holmes A Study in Scarlet Outline 1 The Power Function

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

Likelihood inference in the presence of nuisance parameters

Likelihood inference in the presence of nuisance parameters Likelihood inference in the presence of nuisance parameters Nancy Reid, University of Toronto www.utstat.utoronto.ca/reid/research 1. Notation, Fisher information, orthogonal parameters 2. Likelihood inference

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. β 0, β 1 Q = n (Y i (β 0 + β 1 X i )) 2 i=1 Minimize this by maximizing

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Parameter Estimation

Parameter Estimation Parameter Estimation Consider a sample of observations on a random variable Y. his generates random variables: (y 1, y 2,, y ). A random sample is a sample (y 1, y 2,, y ) where the random variables y

More information

Estimation of Dynamic Regression Models

Estimation of Dynamic Regression Models University of Pavia 2007 Estimation of Dynamic Regression Models Eduardo Rossi University of Pavia Factorization of the density DGP: D t (x t χ t 1, d t ; Ψ) x t represent all the variables in the economy.

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Principles of Statistical Inference Recap of statistical models Statistical inference (frequentist) Parametric vs. semiparametric

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Ch. 5 Hypothesis Testing

Ch. 5 Hypothesis Testing Ch. 5 Hypothesis Testing The current framework of hypothesis testing is largely due to the work of Neyman and Pearson in the late 1920s, early 30s, complementing Fisher s work on estimation. As in estimation,

More information

DA Freedman Notes on the MLE Fall 2003

DA Freedman Notes on the MLE Fall 2003 DA Freedman Notes on the MLE Fall 2003 The object here is to provide a sketch of the theory of the MLE. Rigorous presentations can be found in the references cited below. Calculus. Let f be a smooth, scalar

More information

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota

Stat 5102 Lecture Slides Deck 3. Charles J. Geyer School of Statistics University of Minnesota Stat 5102 Lecture Slides Deck 3 Charles J. Geyer School of Statistics University of Minnesota 1 Likelihood Inference We have learned one very general method of estimation: method of moments. the Now we

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Regression Estimation Least Squares and Maximum Likelihood

Regression Estimation Least Squares and Maximum Likelihood Regression Estimation Least Squares and Maximum Likelihood Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 3, Slide 1 Least Squares Max(min)imization Function to minimize

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:. MATHEMATICAL STATISTICS Homework assignment Instructions Please turn in the homework with this cover page. You do not need to edit the solutions. Just make sure the handwriting is legible. You may discuss

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix)

EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) 1 EC212: Introduction to Econometrics Review Materials (Wooldridge, Appendix) Taisuke Otsu London School of Economics Summer 2018 A.1. Summation operator (Wooldridge, App. A.1) 2 3 Summation operator For

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher

More information

Correlation and Regression

Correlation and Regression Correlation and Regression October 25, 2017 STAT 151 Class 9 Slide 1 Outline of Topics 1 Associations 2 Scatter plot 3 Correlation 4 Regression 5 Testing and estimation 6 Goodness-of-fit STAT 151 Class

More information

Lecture 4: Testing Stuff

Lecture 4: Testing Stuff Lecture 4: esting Stuff. esting Hypotheses usually has three steps a. First specify a Null Hypothesis, usually denoted, which describes a model of H 0 interest. Usually, we express H 0 as a restricted

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth

More information

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8 Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

Chapter 3. Point Estimation. 3.1 Introduction

Chapter 3. Point Estimation. 3.1 Introduction Chapter 3 Point Estimation Let (Ω, A, P θ ), P θ P = {P θ θ Θ}be probability space, X 1, X 2,..., X n : (Ω, A) (IR k, B k ) random variables (X, B X ) sample space γ : Θ IR k measurable function, i.e.

More information

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided Let us first identify some classes of hypotheses. simple versus simple H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided H 0 : θ θ 0 versus H 1 : θ > θ 0. (2) two-sided; null on extremes H 0 : θ θ 1 or

More information

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators Estimation theory Parametric estimation Properties of estimators Minimum variance estimator Cramer-Rao bound Maximum likelihood estimators Confidence intervals Bayesian estimation 1 Random Variables Let

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Inference in non-linear time series

Inference in non-linear time series Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators

More information

More Empirical Process Theory

More Empirical Process Theory More Empirical Process heory 4.384 ime Series Analysis, Fall 2008 Recitation by Paul Schrimpf Supplementary to lectures given by Anna Mikusheva October 24, 2008 Recitation 8 More Empirical Process heory

More information

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley

Review of Classical Least Squares. James L. Powell Department of Economics University of California, Berkeley Review of Classical Least Squares James L. Powell Department of Economics University of California, Berkeley The Classical Linear Model The object of least squares regression methods is to model and estimate

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Statistical Inference

Statistical Inference Statistical Inference Classical and Bayesian Methods Revision Class for Midterm Exam AMS-UCSC Th Feb 9, 2012 Winter 2012. Session 1 (Revision Class) AMS-132/206 Th Feb 9, 2012 1 / 23 Topics Topics We will

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating

More information

Economics 583: Econometric Theory I A Primer on Asymptotics

Economics 583: Econometric Theory I A Primer on Asymptotics Economics 583: Econometric Theory I A Primer on Asymptotics Eric Zivot January 14, 2013 The two main concepts in asymptotic theory that we will use are Consistency Asymptotic Normality Intuition consistency:

More information

STA 2201/442 Assignment 2

STA 2201/442 Assignment 2 STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution

More information

1 Appendix A: Matrix Algebra

1 Appendix A: Matrix Algebra Appendix A: Matrix Algebra. Definitions Matrix A =[ ]=[A] Symmetric matrix: = for all and Diagonal matrix: 6=0if = but =0if 6= Scalar matrix: the diagonal matrix of = Identity matrix: the scalar matrix

More information

[y i α βx i ] 2 (2) Q = i=1

[y i α βx i ] 2 (2) Q = i=1 Least squares fits This section has no probability in it. There are no random variables. We are given n points (x i, y i ) and want to find the equation of the line that best fits them. We take the equation

More information

The Multivariate Gaussian Distribution [DRAFT]

The Multivariate Gaussian Distribution [DRAFT] The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

The outline for Unit 3

The outline for Unit 3 The outline for Unit 3 Unit 1. Introduction: The regression model. Unit 2. Estimation principles. Unit 3: Hypothesis testing principles. 3.1 Wald test. 3.2 Lagrange Multiplier. 3.3 Likelihood Ratio Test.

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability & Mathematical Statistics May 2011 Examinations INDICATIVE SOLUTION Introduction The indicative solution has been written by the Examiners with the

More information

Asymptotics for Nonlinear GMM

Asymptotics for Nonlinear GMM Asymptotics for Nonlinear GMM Eric Zivot February 13, 2013 Asymptotic Properties of Nonlinear GMM Under standard regularity conditions (to be discussed later), it can be shown that where ˆθ(Ŵ) θ 0 ³ˆθ(Ŵ)

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

Semiparametric Regression

Semiparametric Regression Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under

More information

Introduction to Simple Linear Regression

Introduction to Simple Linear Regression Introduction to Simple Linear Regression Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Introduction to Simple Linear Regression 1 / 68 About me Faculty in the Department

More information

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1 4 Hypothesis testing 4. Simple hypotheses A computer tries to distinguish between two sources of signals. Both sources emit independent signals with normally distributed intensity, the signals of the first

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Likelihood Inference in the Presence of Nuisance Parameters

Likelihood Inference in the Presence of Nuisance Parameters PHYSTAT2003, SLAC, September 8-11, 2003 1 Likelihood Inference in the Presence of Nuance Parameters N. Reid, D.A.S. Fraser Department of Stattics, University of Toronto, Toronto Canada M5S 3G3 We describe

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

Vector Auto-Regressive Models

Vector Auto-Regressive Models Vector Auto-Regressive Models Laurent Ferrara 1 1 University of Paris Nanterre M2 Oct. 2018 Overview of the presentation 1. Vector Auto-Regressions Definition Estimation Testing 2. Impulse responses functions

More information

Lecture 1: August 28

Lecture 1: August 28 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 1: August 28 Our broad goal for the first few lectures is to try to understand the behaviour of sums of independent random

More information

VAR Models and Applications

VAR Models and Applications VAR Models and Applications Laurent Ferrara 1 1 University of Paris West M2 EIPMC Oct. 2016 Overview of the presentation 1. Vector Auto-Regressions Definition Estimation Testing 2. Impulse responses functions

More information

Greene, Econometric Analysis (6th ed, 2008)

Greene, Econometric Analysis (6th ed, 2008) EC771: Econometrics, Spring 2010 Greene, Econometric Analysis (6th ed, 2008) Chapter 17: Maximum Likelihood Estimation The preferred estimator in a wide variety of econometric settings is that derived

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

Exercises and Answers to Chapter 1

Exercises and Answers to Chapter 1 Exercises and Answers to Chapter The continuous type of random variable X has the following density function: a x, if < x < a, f (x), otherwise. Answer the following questions. () Find a. () Obtain mean

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE)

ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE) 1 ELEG 5633 Detection and Estimation Minimum Variance Unbiased Estimators (MVUE) Jingxian Wu Department of Electrical Engineering University of Arkansas Outline Minimum Variance Unbiased Estimators (MVUE)

More information

Lecture 15. Hypothesis testing in the linear model

Lecture 15. Hypothesis testing in the linear model 14. Lecture 15. Hypothesis testing in the linear model Lecture 15. Hypothesis testing in the linear model 1 (1 1) Preliminary lemma 15. Hypothesis testing in the linear model 15.1. Preliminary lemma Lemma

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1 Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maximum likelihood Consistency Confidence intervals Properties of the mean estimator Properties of the

More information